发明名称 RECOGNITION SYSTEM AND RECOGNITION METHOD OF NON-BODY TEXT IN WEBPAGE
摘要 <p>Disclosed are a recognition system and a recognition method of the non-body text in a webpage, which relate to the field of text extraction. The system comprises: a webpage grabber, used for grabbing webpage data a target website; DOM tree construction unit, used for constructing DOM trees that each webpage in the target website corresponds to; a DOM tree analysis unit, used for finding a unit text section in the webpage according to the DOM trees; a text statistic unit; used for conducting statistics on occurrence number in the webpage of a target website of the unit text section; a text recognition unit, used for recognizing the unit text section into a non-body text when the occurrence times are larger than a preset threshold value. The system and the method overcome the problem of recognition lag of the non-body texts in an existing method and have high recognition accuracy.</p>
申请公布号 WO2014000571(A1) 申请公布日期 2014.01.03
申请号 WO2013CN77102 申请日期 2013.06.09
申请人 BEIJING QIHOO TECHNOLOGY COMPANY LIMITED;QIZHI SOFTWARE (BEIJING) COMPANY LIMITED 发明人 WANG, ZHIGANG
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址