发明名称 SYSTEM AND METHOD FOR EXTRACTING ENTITIES OF INTEREST FROM TEXT USING N-GRAM MODELS
摘要 A document (or multiple documents) is analyzed to identify entities of interest within that document. This is accomplished by constructing n-gram or bi-gram models that correspond to different kinds of text entities, such as chemistry-related words and generic English words. The models can be constructed from training text selected to reflect a particular kind of text entity. The document is tokenized, and the tokens are run against the models to determine, for each token, which kind of text entity is most likely to be associated with that token. The entities of interest in the document can then be annotated accordingly.
申请公布号 US2009119235(A1) 申请公布日期 2009.05.07
申请号 US20080335490 申请日期 2008.12.15
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 KANUNGO TAPAS;RHODES JAMES J.
分类号 G06F15/18;G06F17/27 主分类号 G06F15/18
代理机构 代理人
主权项
地址