摘要 |
<p>A method and apparatus is provided for extracting key terms from a data set, the method including the steps of identifying a first set of one or more word groups of one or more words that occur more than once in the data set, and removing from this first set a second set of word groups that are sub-strings of longer word groups in the first set. The remaining word groups are key terms. Each word group is weighted according to its frequency of occurrence within the data set. The weighting of any word group may be increased by the frequency of any sub-string of words occurring in the second set and then dividing each weighting by the number of words in the word group. This weighting process operates to determine the order of occurrence of the word groups. Prefixes and suffixes are also removed from each word in the data set. This produces a neutral form of each word so that the weighting values are prefix and suffix independent.</p> |