发明名称 Determination of table of content links for a hyperlinked document
摘要 The present invention relates to a methodology for assembling a document from content spanning multiple web-pages employing two cooperative processes. Given a starting location, one process analyzes a single page at a time to find candidate links. The links are recursively followed and those pages are analyzed. A detailed set of heuristics is used to determine what is or is not a candidate link. The links are examined for link clusters and a table of contents if found is identified. The candidate pages are then fed to a document-level analyzer. This process compares the attributes of one page against the others and looks for a document-like structure. Using another detailed set of heuristics, the document-level analyzer determines if the page should be included in the document.
申请公布号 US2005076000(A1) 申请公布日期 2005.04.07
申请号 US20030608591 申请日期 2003.06.27
申请人 XEROX CORPORATION 发明人 SWEET JAMES M.;HARRINGTON STEVEN J.;JONES RHYS PRICE;SAVAKIS ANDREAS
分类号 G06F17/30;(IPC1-7):G06F7/00 主分类号 G06F17/30
代理机构 代理人
主权项
地址