发明名称 TEXT CONTENT EXTRACTION METHOD AND DEVICE
摘要 Disclosed are a text content extraction method and device. The method comprises: dividing an input HTML webpage into a plurality of modules, determining the location score of each module according to the location of each module in the layout of the webpage, and calculating the text length of each module; extracting the link address contained in each module, counting a character content which has the highest usage frequency in all the link addresses, marking each link address which contains the character content as a valid link, and marking each link address which does not contain the character content as an invalid link; and according to comprehensive score = location score × (text length + word length of valid link)/word length of invalid link, determining the comprehensive score of each module, and judging the module the comprehensive score of which goes beyond a set threshold value as a content module. The method described in the present invention can effectively remove redundant information about a non-content portion in a webpage, and extract the effective content of the webpage more accurately.
申请公布号 WO2013178193(A3) 申请公布日期 2014.01.23
申请号 WO2013CN80666 申请日期 2013.08.01
申请人 ZTE CORPORATION 发明人 YE, WEI
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址