发明名称 |
Identification and Extraction of New Terms in Documents |
摘要 |
A method and apparatus that can extract new terms from documents for inclusion in a vocabulary collection is disclosed. A document may be parsed to obtain an n-gram phrase indicative of a new term. The phrase may include a plurality of words. The n-gram phrase may be decomposed into a series of bi-gram phrases each including a first and a second phrase part. The first and second phrase parts each include at least one word. It may then be determined whether the first or second phrase part is in a vocabulary collection. If not, it may be estimated as to the probability that the bi-gram phrase should be in the vocabulary collection. The bi-gram phrase may be added to the vocabulary collection if the probability that the bi-gram phrase should be in the vocabulary collection exceeds a minimum threshold level.
|
申请公布号 |
US2013246045(A1) |
申请公布日期 |
2013.09.19 |
申请号 |
US201213420149 |
申请日期 |
2012.03.14 |
申请人 |
ULANOV ALEXANDER;SIMANOVSKY ANDREY;HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P. |
发明人 |
ULANOV ALEXANDER;SIMANOVSKY ANDREY |
分类号 |
G06F17/27 |
主分类号 |
G06F17/27 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|