摘要 |
<p>Extracting principal content from Web pages includes identifying and classifying items on the Web page, building a list of candidates, calculating candidate scores, selecting a top score candidate, performing clean up processing for the top score candidate, and performing final page processing for the top score candidate. Candidate scores may vary according to a number of paragraphs and images grouped according to size. A word length of CJK (Chinese-Japanese-Korean) text may be determined according to punctuation therein. Candidate scores may be modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags‘body’,‘div’,‘td’,‘li’,‘article/section’and pieces are candidates that do not include other candidates. Candidate scores may be modified according to a number of ratios corresponding to text and link density.</p> |