发明名称 System and method for automatically extracting metadata from unstructured electronic documents
摘要 A system and method for automatically extracting meta data from unstructured electronic documents is disclosed. In one embodiment, the unstructured electronic document is converted into a plain text document. Further, a document header of the unstructured electronic document is extracted from the plain text document using a rule-based document header extractor, where the rule-based document header extractor may be based on a rule that includes determining a ratio of a number of words with their initial letters capitalized in a text line over a total number of words in the text line in the plain text document. Moreover, meta data is extracted from the extracted document header using a heuristic approach.
申请公布号 US8843815(B2) 申请公布日期 2014.09.23
申请号 US201013258484 申请日期 2010.01.18
申请人 Hewlett-Packard Development Company, L. P. 发明人 Yang Sheng-Wen;Xiong Yuhong;Liu Wei
分类号 G06F17/00;G06F17/27;G06F17/30 主分类号 G06F17/00
代理机构 代理人
主权项 1. A computer implemented method for automatically extracting metadata from an unstructured electronic document, comprising: converting the unstructured electronic document into a plain text document; extracting, using a rule-based document header extractor, a document header of the unstructured electronic document from the plain text document based on a rule comprising: determining a total number of words in a text line of the plain text document;computing a ratio of the number of words with their initial letters capitalized in the text line over a total number of words in the text line;determining whether the ratio is greater than or equal to a first predetermined header threshold value;if so, declaring the text line as belonging to the document header and repeat the above steps for a next text line in the converted plain text document;if not, declaring the text line as belonging to a main text of the plain text document and repeat the above steps for the next text line of the converted plain text document till a number of main text lines becomes greater than a second predetermined header threshold value; and extracting metadata from the extracted document header using a heuristic approach.
地址 Houston TX US