发明名称 Information block extraction apparatus and method for Web pages
摘要 A method and apparatus for identifying coherent areas within a Web page. First, a Web page is parsed into an HTML DOM tree and an HTML tag token stream. Next, repeated-patterns are induced from the Web page. After filtering out improper repeated-patterns and generating corresponding instances of the repeated-patterns, the repeated-patterns are mapped back to corresponding regions in the Web page. Based on the mappings, a hierarchical RST tree containing information blocks is generated. Information items within the information blocks are detected then used to generate a hierarchical structural information block tree. Information blocks from the structural information block tree are then classified into text information blocks and link information blocks. Based on the classification and block semantic similarity, the bocks are clustered then grouped into semantic information blocks. The semantic information blocks contain main text information blocks and related link blocks which, if necessary, can be labeled.
申请公布号 US2005066269(A1) 申请公布日期 2005.03.24
申请号 US20040943157 申请日期 2004.09.17
申请人 FUJITSU LIMITED;NANJING UNIVERSITY 发明人 WANG JUN;WANG JICHENG;WU GANGSHAN;TSUDA HIROSHI
分类号 G06F17/30;G06F12/00;G06F17/00;(IPC1-7):G06F17/00 主分类号 G06F17/30
代理机构 代理人
主权项
地址