发明名称 |
System and method for identifying compounds through iterative analysis |
摘要 |
A system and method for identifying compounds through iterative analysis of measure of association is disclosed. A limit on a number of tokens per compound is specified. Compounds within a text corpus are iteratively evaluated. A number of occurrences of one or more n-grams within the text corpus is determined. Each n-gram includes up to a maximum number of tokens, which are each provided in a vocabulary for the text corpus. At least one n-gram including a number of tokens equal to the limit based on the number of occurrences is identified. A measure of association between the tokens in the identified n-gram is determined. Each identified n-gram with a sufficient measure of association is added to the vocabulary as a compound token and the limit is adjusted.
|
申请公布号 |
US7555428(B1) |
申请公布日期 |
2009.06.30 |
申请号 |
US20030647203 |
申请日期 |
2003.08.21 |
申请人 |
GOOGLE INC. |
发明人 |
FRANZ ALEXANDER;MILCH BRIAN |
分类号 |
G06F17/21;G06F17/27;G06F17/28 |
主分类号 |
G06F17/21 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|