发明名称 Techniques for clustering structurally similar web pages
摘要 Web page clustering techniques described herein are URL Clustering and Page Clustering, whereby clustering algorithms cluster together pages that are structurally similar. Regarding URL clustering, because similarly structured pages have similar patterns in their URLs, grouping similar URL patterns will group structurally similar pages. Embodiments of URL clustering may involve: (a) URL normalization and (b) URL variation computation. Regarding page clustering, page feature-based techniques further cluster any given set of homogenous clusters, reducing the number of clusters based on the underlying page code. Embodiments of page clustering may reduce the number of clusters based on the tag probabilities and the tag sequence, utilizing an Approximate Nearest Neighborhood (ANN) graph along with evaluation of intra-cluster and inter-cluster compactness.
申请公布号 US7680858(B2) 申请公布日期 2010.03.16
申请号 US20060481734 申请日期 2006.07.05
申请人 YAHOO! INC. 发明人 POOLA KRISHNA LEELA;RAMANUJAPURAM ARUN
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址