主权项 |
1. A computer system to extract contender values as positively associated with a pre-defined value from a compilation of one or more electronically stored documents, the system comprising:
one or more computer readable storage devices configured to store
one or more software modules including computer executable instructions, andthe compilation, wherein the electronically stored documents comprise one or more semi-structured document(s), one or more unstructured document(s), or a combination thereof, and each of the one or more electronically stored documents comprises one or more pages;a network configured to distribute information to a user workstation;one or more hardware computer processors in communication with the one or more computer readable storage devices and configured to execute the one or more software modules in order to cause the computer system to
access, from the one or more computer readable storage devices, the compilation;receive information regarding the pre-defined value, wherein the pre-defined value has a certain format, has a certain two-dimensional spatial relationship to words in a pre-selected page, and is associated with one or more keywords;for each page of the compilation,
identify words and contender values on the page using optical character recognition (OCR) and post-OCR processing, andgroup the identified words and the identified contender values into anchor blocks based on their spatial positioning on the page, such that the page comprises a plurality of anchor blocks and each anchor block comprises one or more words, one contender value, or a combination thereof;on the page, for each of the contender values,
numerically determine a first confidence that the contender value is associated with the pre-defined value based at least in part on a comparison of a calculated two-dimensional spatial relationship between the contender value and the anchor blocks on the page with the pre-defined two-dimensional spatial relationship between the pre-defined value and the words in the pre-selected page,numerically determine a second confidence that the contender value is associated with the pre-defined value based at least in part on a comparison of words in the anchor blocks on the page with the one or more keywords associated with the pre-defined value, andnumerically determine a third confidence that the contender value is associated with the pre-defined value based at least in part on a comparison of a format of the contender value with the certain format of the pre-defined value;over all the pages of the compilation, extract positive contender values as positively associated with the pre-defined value based at least in part on the first confidence, the second confidence, and the third confidence;store the positive contender values in the one or more computer readable storage devices; andtransmit the positive contender values over the network to the user workstation in response to a search for values associated with the pre-defined value at the user workstation. |