摘要 |
<p>Disclosed are a recognition system and a recognition method of the non-body text in a webpage, which relate to the field of text extraction. The system comprises: a webpage grabber, used for grabbing webpage data a target website; DOM tree construction unit, used for constructing DOM trees that each webpage in the target website corresponds to; a DOM tree analysis unit, used for finding a unit text section in the webpage according to the DOM trees; a text statistic unit; used for conducting statistics on occurrence number in the webpage of a target website of the unit text section; a text recognition unit, used for recognizing the unit text section into a non-body text when the occurrence times are larger than a preset threshold value. The system and the method overcome the problem of recognition lag of the non-body texts in an existing method and have high recognition accuracy.</p> |