发明名称 Creating a Training Data Set Based on Unlabeled Textual Data
摘要 A system and method are disclosed for obtaining a plurality of unlabeled text documents; obtaining an initial concept; obtaining keywords from a knowledge source based on the initial concept; scoring the plurality of unlabeled documents based at least in part on the initial keywords; determining a categorization of the documents based on the scores; performing a first feature selection and creating a first vector space representation of each document in a first category and a second category, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and generating the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category and documents belonging to the second category.
申请公布号 US2017060993(A1) 申请公布日期 2017.03.02
申请号 US201615253249 申请日期 2016.08.31
申请人 Skytree, Inc. 发明人 Pendar Nick;Wang Zhuang
分类号 G06F17/30;G06N99/00 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method comprising: obtaining, using one or more processors, a plurality of unlabeled text documents; obtaining, using the one or more processors, an initial concept; obtaining, using the one or more processors, keywords from a knowledge source based on the initial concept; scoring, using the one or more processors, the plurality of unlabeled documents based at least in part on the initial keywords; determining, using the one or more processors, a categorization of the documents based on the scores; performing, using the one or more processors, a first feature selection and creating a first vector space representation of each document in a first category and a second category associated with the first feature selection, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and generating, using the one or more processors, the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category based on the scores and documents belonging to the second category based on the scores, the training set including the first vector space representations for the subset of the obtained unlabeled documents belonging to the first category based on the scores and the second category based on the scores, the first vector space representations serving as one or more labels of the subset of the obtained unlabeled documents belonging to the first category and the second category.
地址 San Jose CA US