摘要 |
PURPOSE: To form a thesaurus for processing a natural language at high speed by sorting words by repeating division into clusters while using the cooccurrence frequency vectors of the words of sorting objects corresponding to an information quantity reference. CONSTITUTION: A statistical processing part 1 extracts words from an inputted document, totalizes (sums up) the cooccurrence frequency between the extracted word and the specified context of that word and prepares the cooccurrence frequency vector of the word. On the other hand, an automatic word sorting part 2 sorts the words while using the coccurrence frequency vector prepared by the statistic processing part 1 and outputs the thesaurus for sorting those words. When sorting the words with the automatic word sorting part 2 in this case, first of all, the word group of the sorting object is divided into two clusters, the relation (full description length) of two clusters at such a time is found, the the words of two clusters are exchanged so that this relation can be minimized corresponding to the prescribed information quantity reference. Then, clustering is performed again to two provided clusters and its division is performed until they can not be divided any more. |