发明名称 SYSTEMS AND METHODS FOR A SCALABLE CONTINUOUS ACTIVE LEARNING APPROACH TO INFORMATION CLASSIFICATION
摘要 Systems and methods for classifying electronic information are provided by way of a Technology-Assisted Review (“TAR”) process. In certain embodiments, the TAR process is a Scalable Continuous Active Learning (“S-CAL”) approach. In certain embodiments, S-CAL selects an initial sample from a document collection, trains a classifier by using a default classification for a portion of the initial sample, scores the initial sample, selects a sub-sample from the initial sample for review, removes the reviewed sub-sample from the initial sample, and repeats the process by re-training the classifier until the initial sample is exhausted. In certain embodiments, a classification threshold is determined using a calculated estimate of the prevalence of relevant information such that the threshold classifies the information in accordance with a determined target criteria. In certain embodiments, the estimate of prevalence is determined from the results of iterations of a TAR process such as S-CAL.
申请公布号 US2016371262(A1) 申请公布日期 2016.12.22
申请号 US201615186387 申请日期 2016.06.17
申请人 Cormack Gordon V.;Grossman Maura R. 发明人 Cormack Gordon V.;Grossman Maura R.
分类号 G06F17/30;G06N99/00 主分类号 G06F17/30
代理机构 代理人
主权项 1. A system for classifying information, the system comprising: at least one computing device having a processor and physical memory, the physical memory storing instructions that cause the processor to: receive an identification of a relevant document; select a set of documents U from a document collection, wherein the document collection is stored on a non-transitory storage medium; assign a default classification to one or more documents in U to be used as a training set along with the relevant document; train a classifier using the training set; score the documents in U using the classifier; remove one or more documents from the training set; select a first batch size documents from U to form a set V; select a first sub-sample of documents from V to form a set W; present one or more documents in W to a reviewer; receive from the reviewer one or more user coding decisions associated with the presented documents; add one or more of the documents presented to the reviewer to the training set and remove said documents from U; estimate a number of relevant documents in V using the number of relevant documents identified in the user coding decision received from the reviewer; update the classifier using one or more documents in the training set; estimate a prevalence of relevant documents in the document collection; and upon determining that a stopping criteria has been reached, calculate a threshold for the classifier using the estimated prevalence, and classify the documents in the document collection using the classifier and the calculated threshold.
地址 Waterloo CA