发明名称 Index extraction from documents
摘要 Systems, methods, and programs embodied in a computer readable medium are provided for index extraction. Stored in a database are ground truth documents that are organized according to a plurality of classifications, each classification having a group of predefined indices. A document to be indexed is classified by drawing an association between the document and one of the classifications. An attempt is made to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications. Upon a failure to extract the subset of the group of predefined indices, attempts are made to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications.
申请公布号 US8805803(B2) 申请公布日期 2014.08.12
申请号 US200410916877 申请日期 2004.08.12
申请人 Hewlett-Packard Development Company, L.P. 发明人 Simske Steven J.;Wright David W.
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项 1. A method for index extraction, comprising the steps of: storing a plurality of ground truth documents in a database, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices; classifying a document by drawing an association in a computer system between the document to be indexed and one of the classifications; attempting in the computer system to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and attempting in the computer system to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document.
地址 Houston TX US