发明名称 Document clustering system, document clustering method, and recording medium
摘要 In the provided document clustering system (100), a concept tree structure accumulation unit (11) stores a concept tree structure that represents a hierarchical relationship among concepts represented by each of a plurality of words. For any two words, a concept similarity computation unit (12) obtains a concept similarity, which is an index indicating how close the concepts represented by the two words are. Using concept similarities for words that appear in two documents in a document set, an inter-document similarity computation unit (13) obtains an inter-document similarity, which indicates how similar the two documents are semantically. A clustering unit (14) uses inter-document similarities to cluster the documents in the document set.
申请公布号 US8965896(B2) 申请公布日期 2015.02.24
申请号 US201013518401 申请日期 2010.12.21
申请人 NEC Corporation 发明人 Mizuguchi Hironori;Kusui Dai
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 Young & Thompson 代理人 Young & Thompson
主权项 1. A document clustering system comprising: a control device comprising a computer device that executes each of: a concept tree structure acquisition unit that acquires a concept tree structure that represents the hierarchical relationship of a concept of a plurality of words; a document set acquisition unit that acquires a document set, which is a collection of documents; a concept similarity computation unit that finds the concept similarity between two arbitrary words of the document set that was acquired by the document set acquisition unit, the concept similarity being an index indicating the closeness of the two words in a concept; an inter-document similarity computation unit that finds inter-document similarity, which is the degree of semantic similarity between two documents that are included in the document set that was acquired by the document set acquisition unit, based on the concept similarity found by the concept similarity computation unit; a clustering unit that performs document clustering of the document set based on the inter-document similarity that was found by the inter-document similarity computation unit; a co-occurring pattern acquisition unit that acquires co-occurring patterns that include words and co-occurrence that co-occur with the words in the concept tree structure that was acquired by the concept tree structure acquisition unit; and a superconcept setting unit that selects context conforming higher-order words, which are higher-order words of the higher-order words that are common with the two words, and of which the number of the words in a specified range in each document that includes each of the two words that coincide with the words of the co-occurring pattern acquired by the co-occurring pattern acquisition unit for the common higher-order words is a maximum, wherein the concept similarity computation unit finds the concept similarity based on the context conforming higher-order words that were selected by the superconcept setting unit from among the higher-order words that are common with the two words.
地址 Tokyo JP