摘要 |
A method to automatically categorize messages or documents containing text. The method of solution fits in the general framework of supervised learning, in which a rule or rules for categorizing data is automatically constructed by a computer on the basis of training data that has been labeled beforehand. More specifically, the method involves the construction of a linear separator: training data is used to construct for each category a weight vector w and a threshold t, and the decision of whether a hitherto unseen document d is in the category will depend on the outcome of the test wTx>=t, where x is a vector derived from the document d. The method also uses a set L of features selected from the training data in order to construct the numerical vector representation x of a document. The preferred method uses an algorithm based on Gauss-Seidel iteration to determine the weight factor w that is determined by a regularized convex optimization problem derived from the principle of minimizing modified training error.
|