发明名称 Systems and methods for identifying and categorizing electronic documents through machine learning
摘要 Computer implemented systems and methods are disclosed for identifying and categorizing electronic documents through machine learning. In accordance with some embodiments, a seed set of categorized electronic documents may be used to train a document categorizer based on a machine learning algorithm. The trained document categorizer may categorize electronic documents in a large corpus of electronic documents. Performance metrics associated with performance of the trained document categorizer may be tracked, and additional seed sets of categorized electronic documents may be used to improve the performance of document categorizer by retraining the document categorizer on subsequent seed sets. Additional seed sets may and categorizations may be iterated through until a desired document categorization performance is reached.
申请公布号 US9514414(B1) 申请公布日期 2016.12.06
申请号 US201615088481 申请日期 2016.04.01
申请人 PALANTIR TECHNOLOGIES INC. 发明人 Rosswog James;Gerhardt Matthew;Raboin Eric;Erenrich Daniel;Bogomolov Arseny;Bills Cooper;Anderson Eric;Grossman Jack;Simons Kevin;Levan Matthew;Klein Nathaniel;Beiermeister Ryan;O'Brien Tim
分类号 G06F15/18;G06N5/04;G06F17/30;G06N7/00;G06N99/00 主分类号 G06F15/18
代理机构 Knobbe, Martens, Olson & Bear LLP 代理人 Knobbe, Martens, Olson & Bear LLP
主权项 1. A system for categorizing electronic documents, comprising: a memory device that stores a set of instructions; at least one processor that executes the instructions to: receive categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents;train a document categorizer on the categorizations using a machine learning algorithm;categorize the remaining electronic documents in the corpus using the trained document categorizer;compare one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations;in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically: analyze a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents;designate the one or more electronic documents of the portion as a second seed set; andprovide the second seed set for categorization;receive categorizations for the electronic documents included in the second seed set;retrain the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm;re-categorize the remaining electronic documents in the corpus using the retrained document categorizer;compare one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; anditerate through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold.
地址 Palo Alto CA US