发明名称 TWO-PASS HASH EXTRACTION OF TEXT STRINGS
摘要 Data compression and key word recognition may be provided. A first pass may walk a text string, generate terms, and calculate a hash value for each generated term. For each hash value, a hash bucket may be created where an associated occurrence count may be maintained. The hash buckets may be sorted by occurrence count and a few top buckets may be kept. Once those top buckets are known, a second pass may walk the text string, generate terms, and calculate a hash value for each term. If the hash values of terms match hash values of one of the kept buckets, then the term may be considered a frequent term. Consequently, the term may be added to a dictionary along with a corresponding frequency count. Then, the dictionary may be examined to remove terms that may not be frequent, but appeared due to hash collisions.
申请公布号 KR20100059901(A) 申请公布日期 2010.06.04
申请号 KR20107006410 申请日期 2008.08.28
申请人 MICROSOFT CORP. 发明人 POUZIN DOMINIC
分类号 G06F17/00;G06F17/21 主分类号 G06F17/00
代理机构 代理人
主权项
地址