发明名称 System and method for training data generation in predictive coding
摘要 A predictive coding system updates a plurality of training documents for an untrained classification model based on a plurality of additional documents. The plurality of additional documents are selected from a plurality of unlabeled documents based on a decision hyperplane associated with a first trained classification model. The predictive coding system provides the updated plurality of training documents to the untrained classification model to cause the untrained classification model to be retrained and to cause a second trained classification model to be generated.
申请公布号 US9607272(B1) 申请公布日期 2017.03.28
申请号 US201313843501 申请日期 2013.03.15
申请人 Veritas Technologies LLC 发明人 Yu Shengke;Rangan Venkat
分类号 G06N99/00 主分类号 G06N99/00
代理机构 Wilmer Cutler Pickering Hale and Dorr LLP 代理人 Wilmer Cutler Pickering Hale and Dorr LLP
主权项 1. A method comprising: determining to improve an effectiveness measure of a first trained classification model, wherein the first trained model is trained using a set of training documents; selecting a plurality of unlabeled documents, wherein the plurality of unlabeled documents are not part of the set of training documents used to train the first trained classification model; generating a support vector based on a determination that one or more of the plurality of unlabeled documents are within a margin of a decision hyperplane associated with the first trained classification model; calculating, by a processor in a predictive coding system, an overall score for each unlabeled document of the plurality of unlabeled documents based on a distance of a respective unlabeled document to the decision hyperplane and an angle diversity of the respective unlabeled document; comparing, by the processor in the predictive coding system, the overall scores of the unlabeled documents to each other to select a pre-determined number of unlabeled documents having lowest scores in the plurality of unlabeled documents; updating, by the processor in the predictive coding system, the set of training documents used to train the first trained classification model by adding the pre-determined number of unlabeled documents having the lowest scores in the plurality of unlabeled documents to the set of training documents; updating the decision hyperplane based on the support vector; providing, by the predictive coding system, the updated set of training documents to the first trained classification model to improve the effectiveness measure of the first trained classification model by generating a second trained classification model from the updated set of training documents; identifying an effectiveness measure of the second trained classification model; and generating a third trained classification model based on a determination that the effectiveness measure of the second trained classification model has improved from the effectiveness measure of the first trained classification model.
地址 Mountain View CA US