发明名称 SYSTEM AND METHOD FOR REAL-TIME DYNAMIC MEASUREMENT OF BEST-ESTIMATE QUALITY LEVELS WHILE REVIEWING CLASSIFIED OR ENRICHED DATA
摘要 A system, method and computer program product for validating a document classification process, including a document collection; a document classification process performed on the document collection; a random selection module configured to automatically generate a random validation set of documents from the document collection; and a document review process performed on the random validation set of documents to validate results of the document classification process. The system, method and computer program product are configured to dynamically and in real-time measure and display on a computer display device a best case estimate of a quality of the results of the document classification process based on the documents that are validated, and given a size of a total data set of the document collection.
申请公布号 US2016048587(A1) 申请公布日期 2016.02.18
申请号 US201514922747 申请日期 2015.10.26
申请人 MSC INTELLECTUAL PROPERTIES B.V. 发明人 Scholtes Johannes Cornelis;Pasichnyk Yuriy
分类号 G06F17/30;G06N7/00;G06N99/00 主分类号 G06F17/30
代理机构 代理人
主权项 1. A computer implemented system for validating a document classification process for eDiscovery, internal investigations, law enforcement activities, compliance audits, records management, legacy data clean-up, or defensible dispositions, the system comprising: a document collection of N documents related to eDiscovery, internal investigations, law enforcement activities, compliance audits, records management, legacy data clean-up, or defensible dispositions; a document classification process performed on the document collection; a random selection module configured to automatically generate a random validation set S of documents based on a user selectable percentage P of the N documents from the document collection; and a manual document review process performed on the random validation set of documents to validate overall results of all of the documents classified by the document classification process, wherein the system is configured to dynamically and in real-time measure and display on a computer display device a best case estimate of a quality of the results of the overall document classification process based on the documents that are validated, given the size N of a total data set of the document collection, and based on a predetermined quality threshold for an overall classification quality desired for the document classification process, and wherein the system is configured to employ automatic document classification methods including at least one of Technology Assisted Review (TAR), Predictive Coding, Machine Assisted Review (MAR), or Computer Assisted Review (CAR), support vector machines (SVM), naive-Bayes classifiers, k-nearest neighbors, rules-based classification, Linear discriminant analysis (LDA), Maximum Entropy Markov Model (MEMM), scatter-gather clustering, and hierarchical agglomerate clustering (HAC).
地址 AMSTERDAM NL