主权项 |
1. A method comprising:
obtaining, using one or more processors, a plurality of unlabeled text documents; obtaining, using the one or more processors, an initial concept; obtaining, using the one or more processors, keywords from a knowledge source based on the initial concept; scoring, using the one or more processors, the plurality of unlabeled documents based at least in part on the initial keywords; determining, using the one or more processors, a categorization of the documents based on the scores; performing, using the one or more processors, a first feature selection and creating a first vector space representation of each document in a first category and a second category associated with the first feature selection, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and generating, using the one or more processors, the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category based on the scores and documents belonging to the second category based on the scores, the training set including the first vector space representations for the subset of the obtained unlabeled documents belonging to the first category based on the scores and the second category based on the scores, the first vector space representations serving as one or more labels of the subset of the obtained unlabeled documents belonging to the first category and the second category. |