发明名称 Method and apparatus for enhanced web browsing
摘要 Methods and apparatus for searching the World Wide Web are disclosed. The method includes searching all the pages of at least one web site and then searching at least one search engine index for all the pages of at least one web site and determining if the pages are cached in the search engine index. A further embodiment provides for searching an index of a search engine, repeating the search after a specified period of time and then determining if any changes have been made to the web pages in the search engine index.
申请公布号 US9323861(B2) 申请公布日期 2016.04.26
申请号 US201012949685 申请日期 2010.11.18
申请人 Shepherd Daniel W. 发明人 Shepherd Daniel W.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Mandour & Associates, APC 代理人 Mandour & Associates, APC
主权项 1. An apparatus for searching at least one web site, comprising: a processor coupled to a display device, the processor containing instructions for a web crawler, wherein the web crawler comprises a parallel web crawler comprising the following operations: 1) searching all pages on the at least one web site; 2) searching at least one search engine index for said all pages on the at least one web site; 3) determining if said all pages on the at least one web site are cached on the at least one search engine index; 4) repeating the searching of the at least one search engine index after a specified period of time, and determining what changes have occurred; 5) determining, from the at least one web site, all anchor text links, internal links, and external links, and displaying the anchor text links, internal links, and external links in a report, wherein the anchor text links are used to predict the similarity of a page to a query before the page is downloaded, wherein the predicting of the similarity is based on the anchor text links and the predicting allows the processor to engage in focused crawling; 6) searching the at least one web site for broken links; 7) searching the at least one web site for leading links; 8) searching the at least one web site for external links; 9) searching the at least one web site and displaying all image links; 10) extracting a specified type of data from the at least one web site; 11) copying and indexing source code from the at least one web site; 12) blocking directories, pages, and sections from the at least one web site during a search; and 13) selecting only static pages to search on the at least one web site, wherein the web crawler is executed with these thirteen operations being selected; wherein each of these thirteen operations, starting at the searching all pages on the at least one web site, are run in parallel, wherein the web crawler results in up to date data for said all pages, the web crawler results in gathering information from said all pages, and the web crawler results in automated browsing and maintaining of links and HTML code for the at least one website, the processor further comprising a politeness policy that provides guidelines for avoiding overloading web pages of the at least one web site revisited by the web crawler, wherein the politeness policy is based on Universal Resource Locator (URL) normalization or URL canonicalization of at least one URL of the at least one web site; the processor further comprising a re-visit policy that dictates when to check for changes to pages of the at least one web site already examined, wherein the revisit policy comprises a combination of uniform and proportional policies that monotonically and sub-linearly increase with rate of change, wherein the revisit policy comprises a binary measure that indicates whether a page of the at least one web site is accurate, wherein the revisit policy comprises a measure of how outdated the page is, wherein the revisit policy maintains a high value of average freshness by ignoring pages of the at least one web site that change to often; the processor further comprising a selection policy searching all pages on the at least one web site, wherein the using of the anchor text links to predict the similarity of the page is based on the selection policy; the processor further comprising a parallelization policy for these thirteen operations, wherein coordination of these thirteen operations is based on the parallelization policy.
地址 Escondido CA US