METHOD AND APPARATUS FOR EXTRACTION OF TEXTUAL CONTENT FROM HYPERTEXT WEB DOCUMENTS,申请号US20080027625-传众专利搜索

发明名称	METHOD AND APPARATUS FOR EXTRACTION OF TEXTUAL CONTENT FROM HYPERTEXT WEB DOCUMENTS
摘要	Textual content is extracted from hypertext documents by generating for each text document a pruned document model tree of merged text nodes by removing selected tag nodes from a document model tree of the text document, calculating for each merged text node of the pruned document model tree a set of text features which are compared with predetermined feature criteria to decide whether the merged text node is an informative merged text node, and assembling the informative merged text nodes to generate a text file containing the textual content.
申请公布号	US2009030891(A1)	申请公布日期	2009.01.29
申请号	US20080027625	申请日期	2008.02.07
申请人	SIEMENS AKTIENGESELLSCHAFT	发明人	SKUBACZ MICHAL;ZIEGLER CAI-NICOLAS
分类号	G06F17/30	主分类号	G06F17/30
代理机构		代理人
主权项
地址