发明名称 |
System and method for automatically extracting metadata from unstructured electronic documents |
摘要 |
A system and method for automatically extracting meta data from unstructured electronic documents is disclosed. In one embodiment, the unstructured electronic document is converted into a plain text document. Further, a document header of the unstructured electronic document is extracted from the plain text document using a rule-based document header extractor, where the rule-based document header extractor may be based on a rule that includes determining a ratio of a number of words with their initial letters capitalized in a text line over a total number of words in the text line in the plain text document. Moreover, meta data is extracted from the extracted document header using a heuristic approach. |
申请公布号 |
US8843815(B2) |
申请公布日期 |
2014.09.23 |
申请号 |
US201013258484 |
申请日期 |
2010.01.18 |
申请人 |
Hewlett-Packard Development Company, L. P. |
发明人 |
Yang Sheng-Wen;Xiong Yuhong;Liu Wei |
分类号 |
G06F17/00;G06F17/27;G06F17/30 |
主分类号 |
G06F17/00 |
代理机构 |
|
代理人 |
|
主权项 |
1. A computer implemented method for automatically extracting metadata from an unstructured electronic document, comprising:
converting the unstructured electronic document into a plain text document; extracting, using a rule-based document header extractor, a document header of the unstructured electronic document from the plain text document based on a rule comprising:
determining a total number of words in a text line of the plain text document;computing a ratio of the number of words with their initial letters capitalized in the text line over a total number of words in the text line;determining whether the ratio is greater than or equal to a first predetermined header threshold value;if so, declaring the text line as belonging to the document header and repeat the above steps for a next text line in the converted plain text document;if not, declaring the text line as belonging to a main text of the plain text document and repeat the above steps for the next text line of the converted plain text document till a number of main text lines becomes greater than a second predetermined header threshold value; and extracting metadata from the extracted document header using a heuristic approach. |
地址 |
Houston TX US |