发明名称 Selective content extraction
摘要 A method for extracting web content includes detecting, within a web page, a hierarchical structure that includes a plurality of nodes. Potential article nodes from the plurality of nodes are identified. The identified potential article node with a highest rank in the hierarchical structure is identified as an article node. Content is extracted from the article node.
申请公布号 US9032285(B2) 申请公布日期 2015.05.12
申请号 US200913378153 申请日期 2009.06.30
申请人 Hewlett-Packard Development Company, L.P. 发明人 Liu Sam;Joshi Parag;Xiong Yuhong;Atkins Clayton;Liu Jerry
分类号 G06F17/00;G06F17/30 主分类号 G06F17/00
代理机构 代理人
主权项 1. A method for extracting web content, comprising: detecting, within a first web page, a hierarchical structure of the first web page that includes a first plurality of nodes; identifying potential first article nodes from the first plurality of nodes, wherein the potential first article nodes include a plurality of nodes that correspond to an article section of the first web page; selecting as a first article node one of the identified potential first article nodes that appears first in the hierarchical structure of the first web page, wherein the first article node corresponds to content of an article of the article section of the first webpage; selecting a number of sibling article nodes at a same level in the hierarchical structure of the first web page as the first article node, wherein the number of sibling article nodes include additional content of the article of the first article node; extracting content from the first article node and the number of sibling nodes; identifying a link specifying a second web page within a sibling node of the number of sibling nodes, wherein the link is indicative that the second web page includes a continuation of the content of the article of the article section of the first webpage; detecting, responsive to identifying the link, a hierarchical structure of the second web page specified by the link that includes a second plurality of nodes; identifying potential second article nodes from the second plurality of nodes, wherein the potential second article nodes include a plurality of nodes that correspond to an article section of the second web page; selecting as the second article node the identified potential article node that appears first in the hierarchical structure of the second webpage; extracting content from the second article node; merging the content extracted from the second article node with the content from the first article node and the number of sibling article nodes; and producing the merged content.
地址 Houston TX US