发明名称 Phrase clustering
摘要 Systems and associated methods for enhanced concept understanding in large document collections through phrase clustering are described. Embodiments take as input an initial set of phrases and estimate centroids using a clustering process. Embodiments then generate new phrases around each of the current centroids using the current phrases. These new phrases are added to the current set, and the clustering process is iterated. Upon convergence, embodiments finalize clusters based on phrases of any given length.
申请公布号 US8880526(B2) 申请公布日期 2014.11.04
申请号 US201213596678 申请日期 2012.08.28
申请人 International Business Machines Corporation 发明人 Bhattacharya Indrajit;Godbole Shantanu Ravindra;Sharma Akshit
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Ference & Associates LLC 代理人 Ference & Associates LLC
主权项 1. A method for phrase based clustering comprising: utilizing at least one processor to execute computer code configured to perform the steps of: accessing a collection of items to be clustered; receiving an initial set of phrases as input; clustering the collection of items to be clustered using the initial set of phrases to create centroids; generating a new set of phrases around the centroids; adding the new set of phrases to the initial set of phrases to produce a combined set of phrases; and re-clustering the collection of items to be clustered using the combined set of phrases; wherein said generating of a new set of phrases around the centroids comprises: finding high weight words in a context vector for a centroid; finding existing phrases that appear around words of a centroid; and pruning phrases that do not have high weight for at least one of the words of the centroid; said pruning comprising: generating a higher-order phrase via combining two lower-order phrases, each of the higher-order phrase and the two lower-order phrases comprising a context vector; and employing a monotonicity property, wherein the higher-order phrase has high weight for a word in its context vector if both of the lower order phrases individually each have high weight for the at least one word in their context vectors.
地址 Armonk NY US