发明名称 Identification and Extraction of New Terms in Documents
摘要 A method and apparatus that can extract new terms from documents for inclusion in a vocabulary collection is disclosed. A document may be parsed to obtain an n-gram phrase indicative of a new term. The phrase may include a plurality of words. The n-gram phrase may be decomposed into a series of bi-gram phrases each including a first and a second phrase part. The first and second phrase parts each include at least one word. It may then be determined whether the first or second phrase part is in a vocabulary collection. If not, it may be estimated as to the probability that the bi-gram phrase should be in the vocabulary collection. The bi-gram phrase may be added to the vocabulary collection if the probability that the bi-gram phrase should be in the vocabulary collection exceeds a minimum threshold level.
申请公布号 US2013246045(A1) 申请公布日期 2013.09.19
申请号 US201213420149 申请日期 2012.03.14
申请人 ULANOV ALEXANDER;SIMANOVSKY ANDREY;HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P. 发明人 ULANOV ALEXANDER;SIMANOVSKY ANDREY
分类号 G06F17/27 主分类号 G06F17/27
代理机构 代理人
主权项
地址