发明名称 Resource download policies based on user browsing statistics
摘要 Web crawling polices are generated based on user web browsing statistics. User browsing statistics are aggregated at the granularity of resource identifier patterns (such as URL patterns) that denote groups of resources within a particular domain or website that share syntax at a certain level of granularity. The web crawl policies rank the resource identifier patterns according to their associated aggregated user browsing statistics. A crawl ordering defined by the web crawl polices is used to download and discover new resources within a domain or website.
申请公布号 US9495453(B2) 申请公布日期 2016.11.15
申请号 US201113114643 申请日期 2011.05.24
申请人 Microsoft Technology Licensing, LLC 发明人 Cai Rui;Fan Xiaodong;Zhang Lei
分类号 G06F7/00;G06F17/30;G06F17/00 主分类号 G06F7/00
代理机构 Lee & Hayes, PLLC 代理人 Swain Sandy;Minhas Micky;Lee & Hayes, PLLC
主权项 1. A computer-implemented method, comprising: under control of one or more processors configured with executable instructions: determining a resource identifier pattern that identifies a subset of a plurality of resource identifiers of a website, the resource identifier pattern having a higher level of generality than the subset of the plurality of resource identifiers, the determining comprising: analyzing the plurality of resource identifiers to determine syntactical relationships between individual resource identifiers of the plurality of resource identifiers;determining that at least a first resource identifier of the subset of the plurality of resource identifiers represents an individual resource identifier having a lower syntactical level than at least a second resource identifier of the subset of the plurality of resource identifiers;determining user browsing behaviors associated with the first resource identifier and the second resource identifier; andbased at least in part on the first resource identifier and the second resource identifier sharing at least one syntactical element, first user browsing behaviors associated with the first resource identifier, and second user browsing behaviors associated with the second resource identifier, merging the first resource identifier into the second resource identifier to generate a single resource identifier; andgenerating a policy for downloading one or more of a plurality of resources identified by the resource identifier pattern based at least in part on aggregated user browsing statistics associated with the subset of the plurality of resource identifiers.
地址 Redmond WA US