发明名称 System for incrementally clustering news stories
摘要 Disclosed are methods and apparatus for clustering news stories, which are to be presented over a computer network. In general, an incremental clustering system is configured to update a current set of news clusters with newly arrived news articles without having to recompute the clusters for the entire corpus, as well as form new clusters for recently generated news topics. In one embodiment, a plurality of news articles are initially obtained via the computer network, and the news articles are clustered into a plurality of initial clusters. For only news articles, including any unclustered news articles, that are less than a predetermined age limit, it is determined in an incremental clustering process whether to form one or more new clusters or assign to the initial clusters. Indications of the initial clusters and the one or more new clusters, if any, are then stored so as to be accessible for sending a portion of the news articles to users in a clustered format based on the initial clusters and the one or more new clusters, if any.
申请公布号 US8832105(B2) 申请公布日期 2014.09.09
申请号 US201113117022 申请日期 2011.05.26
申请人 Yahoo! Inc. 发明人 Punera Kunal;Rajan Suju;Teo Choon Hui;Vadrevu Srinivas
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 Weaver Austin Villeneuve & Sampson LLP 代理人 Weaver Austin Villeneuve & Sampson LLP
主权项 1. A method of clustering news stories that are to be accessed over a computer network, comprising: obtaining a plurality of news articles via the computer network; clustering the news articles into a plurality of initial clusters; for a subset of the news articles in the initial clusters that are less than a predetermined age limit and any unclustered news articles that are less than the predetermined age limit, determining whether to form one or more new clusters or assign to the initial clusters in an incremental clustering process; and storing indications of the initial clusters and the one or more new clusters, if any, so as to be accessible for sending a portion of the news articles to users in a clustered format based on the initial clusters and the one or more new clusters, if any; wherein the news articles that are less than the predetermined age limit are defined as transient articles and the remaining news articles are defined as fixed articles, wherein the incremental clustering is withheld from being performed on the fixed articles so that the fixed articles retain their initial clusters; wherein the incremental clustering process is performed by: for each transient article, finding one or more nearest neighbor articles from the entire corpus of articles, including fixed and transient articles;for each transient article selected from a randomly ordered set, determining whether a ratio of nearest neighbors that are fixed articles to nearest neighbors that are transient articles is greater than a predetermined threshold; andfor each transient article selected from the randomly ordered set and based on the determination as to whether the ratio is greater than the predetermined threshold, adding such transient article and its one or more nearest neighbors that are transient articles to one or more of the initial clusters or forming a new cluster for such transient article and its one or more nearest neighbors that are transient articles.
地址 Sunnyvale CA US