发明名称 EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES
摘要 <p>Extracting principal content from Web pages includes identifying and classifying items on the Web page, building a list of candidates, calculating candidate scores, selecting a top score candidate, performing clean up processing for the top score candidate, and performing final page processing for the top score candidate. Candidate scores may vary according to a number of paragraphs and images grouped according to size. A word length of CJK (Chinese-Japanese-Korean) text may be determined according to punctuation therein. Candidate scores may be modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags‘body’,‘div’,‘td’,‘li’,‘article/section’and pieces are candidates that do not include other candidates. Candidate scores may be modified according to a number of ratios corresponding to text and link density.</p>
申请公布号 EP2776945(A1) 申请公布日期 2014.09.17
申请号 EP20120847034 申请日期 2012.11.07
申请人 EVERNOTE CORPORATION 发明人 BIGNERT, JAKOB;COARNA, GABRIEL, ALEXANDRU
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址