发明名称 EXTRACTING INFORMATION FROM WEB PAGES
摘要 Methods and apparatus, including computer program products, for identifying Web page content with a granularity finer than individual Web pages, e.g., finer than individual HTML documents. The invention provides a computer-implemented method for identifying The Web page content. The method includes receiving a string of markup language source code that includes tags. The method includes identifying sub-sequences in which tags occur in the string. Each sub-sequence is associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence. The sub-sequences identified are ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing. The criteria includes a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string. The method includes returning the identified sub-sequences.
申请公布号 WO2005109178(A2) 申请公布日期 2005.11.17
申请号 WO2005US15279 申请日期 2005.05.03
申请人 HARIK, RALPH 发明人 HARIK, RALPH
分类号 G06F7/00;G06F17/22;G06F17/27 主分类号 G06F7/00
代理机构 代理人
主权项
地址