发明名称 FORECASTABLE SUPERVISED LABELS AND CORPUS SETS FOR TRAINING A NATURAL-LANGUAGE PROCESSING SYSTEM
摘要 A method and associated systems for forecastable supervised labels and corpus sets for training a natural-language processing system. An NLP-training system asks an “oracle” expert to answer a predictive test question and, in response, receives from the oracle an answer, rationales for selecting that answer, and identifications of extrinsic natural-language sources of evidence that supports those rationales. The system retrieves updated versions of that evidence at a later time, and returns that updated evidence to the oracle. In response, the oracle returns an updated answer and rationales based on the updated evidence. The system then compares time-varying characteristics of the evidence in order to determine the relative contributions of each piece of evidence to the oracles' selections. Less relevant evidence is discarded and the remaining, optimized, evidence is forwarded to the NLP system to be used as training data.
申请公布号 US2017124479(A1) 申请公布日期 2017.05.04
申请号 US201514927766 申请日期 2015.10.30
申请人 International Business Machines Corporation 发明人 Baughman Aaron K.;Diamanti Gary F.;Marzorati Mauro
分类号 G06N99/00;G06F17/28 主分类号 G06N99/00
代理机构 代理人
主权项 1. An NLP-training system comprising a processor, a memory coupled to the processor, and a computer-readable hardware storage device coupled to the processor, the storage device containing program code configured to be run by the processor via the memory to implement a method for forecastable supervised labels and corpus sets for training a natural-language processing system, the method comprising: the training system selecting an oracle, wherein an oracle is a human or computerized expert in a particular field of endeavor; the training system receiving from the oracle a first label and a first set of rationales, wherein the oracle selected the first label and the first set of rationales at a first time as a function of natural-language content stored at one or more extrinsic electronic sources,wherein the first label identifies a correct answer to a predictive question,wherein the first set of rationales identifies one or more reasons why the oracle selected the first label, andwherein answering the predictive question comprises predicting a future occurrence; the training system adding the natural-language content to a first set of corpora; the training system retrieving from the one or more extrinsic electronic sources, at a second time, a later version of the natural-language content; the training system creating a second set of corpora by adding to the first set of corpora the later version of the natural-language content; the training system communicating the second set of corpora to the oracle; the training system accepting from the oracle, in response to the communicating, a second label and a second set of rationales; and the training system eliminating less relevant datasets from the second set of corpora as a function of the relative relevance of each dataset of the second set of corpora to the oracle's selection of the second label and the second set of rationales.
地址 Armonk NY US