发明名称 Systems and methods for inferring uniform resource locator (URL) normalization rules
摘要 Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an "equivalence class" herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules. There are two rule-learning steps: step (1), where it is learned for each equivalence class what portions of the URLs in that class are relevant for selecting the page and what portions are not; and step (2), where the per-equivalence-class rules constructed during step (1) are generalized to rules that cover many equivalence classes. Once a rule is determined, it is applied to the class of web pages or web resources to identify errors. If there are no errors, the rule is activated and is then used by the web crawler for future crawling to avoid the download of duplicative web pages or web resources.
申请公布号 US2006218143(A1) 申请公布日期 2006.09.28
申请号 US20050089988 申请日期 2005.03.25
申请人 MICROSOFT CORPORATION 发明人 NAJORK MARC A.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址