发明名称 |
Determination of table of content links for a hyperlinked document |
摘要 |
The present invention relates to a methodology for assembling a document from content spanning multiple web-pages employing two cooperative processes. Given a starting location, one process analyzes a single page at a time to find candidate links. The links are recursively followed and those pages are analyzed. A detailed set of heuristics is used to determine what is or is not a candidate link. The links are examined for link clusters and a table of contents if found is identified. The candidate pages are then fed to a document-level analyzer. This process compares the attributes of one page against the others and looks for a document-like structure. Using another detailed set of heuristics, the document-level analyzer determines if the page should be included in the document.
|
申请公布号 |
US2005076000(A1) |
申请公布日期 |
2005.04.07 |
申请号 |
US20030608591 |
申请日期 |
2003.06.27 |
申请人 |
XEROX CORPORATION |
发明人 |
SWEET JAMES M.;HARRINGTON STEVEN J.;JONES RHYS PRICE;SAVAKIS ANDREAS |
分类号 |
G06F17/30;(IPC1-7):G06F7/00 |
主分类号 |
G06F17/30 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|