发明名称 |
RECOMBINING INCORRECTLY SEPARATED TOKENS IN NATURAL LANGUAGE PROCESSING |
摘要 |
To recombine incorrectly separated tokens in NLP, a determination is made whether a token from an ordered set of tokens is present in a dictionary related to a corpus from which the ordered set is extracted. When the token is not present in the dictionary, and when a compounding threshold has not been reached, the token is agglutinated with a next adjacent token in the ordered set to form the compound token. The compounding threshold limits a number of tokens that can be agglutinated to form a compound token. A determination is made whether the compound token is present in the dictionary. A weight is assigned to the compound token when the compound token is present in the dictionary and a confidence rating of the compound token is computed as a function of the weight. The compound token and the confidence rating are used in NLP of the corpus. |
申请公布号 |
US2016299885(A1) |
申请公布日期 |
2016.10.13 |
申请号 |
US201514683504 |
申请日期 |
2015.04.10 |
申请人 |
International Business Machines Corporation |
发明人 |
Emanuel Barton W.;Nassar Ahmed M.A.;Rakshit Sarbajit K.;Trim Craig M.;Wong Albert T. |
分类号 |
G06F17/28;G06F17/27 |
主分类号 |
G06F17/28 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method for recombining incorrectly separated tokens in Natural Language Processing (NLP), the method comprising:
determining whether a token from an ordered set of tokens is present in a dictionary, the dictionary being related to a corpus from which the ordered set of tokens is extracted; determining whether a compounding threshold has been reached, wherein the compounding threshold limits a number of tokens that can be agglutinated to form a compound token; agglutinating, using a processor and a memory, responsive to the token not being present in the dictionary, and responsive to the compounding threshold not having been reached, the token with a next adjacent token in the ordered set of tokens to form the compound token; determining whether the compound token is present in the dictionary; assigning a weight to the compound token responsive to the compound token being present in the dictionary; computing a confidence rating of the compound token, the confidence rating being a function of the weight; and using the compound token and the confidence rating in performing an NLP operation on the corpus. |
地址 |
Armonk NY US |