A method for extracting web content includes detecting, within a web page, a hierarchical structure that includes a plurality of nodes. Potential article nodes from the plurality of nodes are identified. The identified potential article node with a highest rank in the hierarchical structure is identified as an article node. Content is extracted from the article node.
申请公布号
WO2011002456(A1)
申请公布日期
2011.01.06
申请号
WO2009US49298
申请日期
2009.06.30
申请人
HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P.;LIU, SAM;JOSHI, PARAG;XIONG, YUHONG;ATKINS, CLAYTON;LIU, JERRY
发明人
LIU, SAM;JOSHI, PARAG;XIONG, YUHONG;ATKINS, CLAYTON;LIU, JERRY