System and method for automatically extracting metadata from unstructured electronic documents,申请号US201013258484-传众专利搜索

发明名称	System and method for automatically extracting metadata from unstructured electronic documents
摘要	A system and method for automatically extracting meta data from unstructured electronic documents is disclosed. In one embodiment, the unstructured electronic document is converted into a plain text document. Further, a document header of the unstructured electronic document is extracted from the plain text document using a rule-based document header extractor, where the rule-based document header extractor may be based on a rule that includes determining a ratio of a number of words with their initial letters capitalized in a text line over a total number of words in the text line in the plain text document. Moreover, meta data is extracted from the extracted document header using a heuristic approach.
申请公布号	US8843815(B2)	申请公布日期	2014.09.23
申请号	US201013258484	申请日期	2010.01.18
申请人	Hewlett-Packard Development Company, L. P.	发明人	Yang Sheng-Wen;Xiong Yuhong;Liu Wei
分类号	G06F17/00;G06F17/27;G06F17/30	主分类号	G06F17/00
代理机构		代理人
主权项	1. A computer implemented method for automatically extracting metadata from an unstructured electronic document, comprising: converting the unstructured electronic document into a plain text document; extracting, using a rule-based document header extractor, a document header of the unstructured electronic document from the plain text document based on a rule comprising: determining a total number of words in a text line of the plain text document;computing a ratio of the number of words with their initial letters capitalized in the text line over a total number of words in the text line;determining whether the ratio is greater than or equal to a first predetermined header threshold value;if so, declaring the text line as belonging to the document header and repeat the above steps for a next text line in the converted plain text document;if not, declaring the text line as belonging to a main text of the plain text document and repeat the above steps for the next text line of the converted plain text document till a number of main text lines becomes greater than a second predetermined header threshold value; and extracting metadata from the extracted document header using a heuristic approach.
地址	Houston TX US