发明名称 METHOD AND APPARATUS FOR EXTRACTION OF TEXTUAL CONTENT FROM HYPERTEXT WEB DOCUMENTS
摘要 Textual content is extracted from hypertext documents by generating for each text document a pruned document model tree of merged text nodes by removing selected tag nodes from a document model tree of the text document, calculating for each merged text node of the pruned document model tree a set of text features which are compared with predetermined feature criteria to decide whether the merged text node is an informative merged text node, and assembling the informative merged text nodes to generate a text file containing the textual content.
申请公布号 US2009030891(A1) 申请公布日期 2009.01.29
申请号 US20080027625 申请日期 2008.02.07
申请人 SIEMENS AKTIENGESELLSCHAFT 发明人 SKUBACZ MICHAL;ZIEGLER CAI-NICOLAS
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址