发明名称 Optimizing web crawling with user history
摘要 A politeness manager estimates traffic to the sites based on historical log data generated and sent by plug-ins or toolbars on client web browsers. The historical log data details dates and times the web browsers visit different web sites that is used to understand what timeframes specific web sites are busy and what timeframes the web sites are not busy. Crawl rates for different timeframes for a web site are determined based on the historical log data, and web crawlers are scheduled to crawl the web site according to the crawl rates to minimize the chances that web crawler requests are responsible for the site crashing.
申请公布号 US8782031(B2) 申请公布日期 2014.07.15
申请号 US201113206256 申请日期 2011.08.09
申请人 Microsoft Corporation 发明人 Wierman Dean M.;Canel Fabrice;Shyamkumar Balaji;Zhang Charles (Xi)
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人 Ream Dave;Haslam Brian;Minhas Micky
主权项 1. A method for crawling a web site, comprising: receiving, at a server device, log data from a plurality of web browsers, the log data indicating users accessing the web site through the web browsers; using, at the service device, the log data to estimate traffic to the web site during a timeframe; determining, by the server device, a threshold frequency of page requests for the web site during the timeframe based on the estimate of traffic; determining, at the server device, a crawl rate during the timeframe that is less than the threshold frequency of page requests; and using the crawl rate to schedule one or more web crawlers to request the web site.
地址 Redmond WA US