发明名称 RECOMBINING INCORRECTLY SEPARATED TOKENS IN NATURAL LANGUAGE PROCESSING
摘要 To recombine incorrectly separated tokens in NLP, a determination is made whether a token from an ordered set of tokens is present in a dictionary related to a corpus from which the ordered set is extracted. When the token is not present in the dictionary, and when a compounding threshold has not been reached, the token is agglutinated with a next adjacent token in the ordered set to form the compound token. The compounding threshold limits a number of tokens that can be agglutinated to form a compound token. A determination is made whether the compound token is present in the dictionary. A weight is assigned to the compound token when the compound token is present in the dictionary and a confidence rating of the compound token is computed as a function of the weight. The compound token and the confidence rating are used in NLP of the corpus.
申请公布号 US2016299885(A1) 申请公布日期 2016.10.13
申请号 US201514683504 申请日期 2015.04.10
申请人 International Business Machines Corporation 发明人 Emanuel Barton W.;Nassar Ahmed M.A.;Rakshit Sarbajit K.;Trim Craig M.;Wong Albert T.
分类号 G06F17/28;G06F17/27 主分类号 G06F17/28
代理机构 代理人
主权项 1. A method for recombining incorrectly separated tokens in Natural Language Processing (NLP), the method comprising: determining whether a token from an ordered set of tokens is present in a dictionary, the dictionary being related to a corpus from which the ordered set of tokens is extracted; determining whether a compounding threshold has been reached, wherein the compounding threshold limits a number of tokens that can be agglutinated to form a compound token; agglutinating, using a processor and a memory, responsive to the token not being present in the dictionary, and responsive to the compounding threshold not having been reached, the token with a next adjacent token in the ordered set of tokens to form the compound token; determining whether the compound token is present in the dictionary; assigning a weight to the compound token responsive to the compound token being present in the dictionary; computing a confidence rating of the compound token, the confidence rating being a function of the weight; and using the compound token and the confidence rating in performing an NLP operation on the corpus.
地址 Armonk NY US