摘要 |
A method for classifying a new text document using a collection of training instances with class known and the class is not known, includes: first parameter learning step of estimating the word distribution θz for each class z; second parameter learning step of estimating the background distribution γ, and the degree of interpolation δ between γ and θz, such that the probability of observing the collection of all of the instances with known and unknown classes is maximized; classification step, including calculating for each word of a new instance, the probabilities that the word is generated from the word distribution θz and from the background distribution γ; combining the two probabilities using δ; and combining the probabilities of all words to estimate document probability for the class z that indicates the document generated from the class z; the new instance being classified as a class z* for which the document probability is the highest. |