发明名称 Managing an archive for approximate string matching
摘要 In one aspect, in general, a method is described for managing an archive. The archive is used for determining approximate matches associated with strings occurring in records. The method includes processing records to determine a set of string representations that correspond to strings occurring in the records. The method also includes generating, for each of at least some of the string representations in the set, a plurality of close representations that are each generated from at least some of the same characters in the string. The method also includes storing entries in the archive. Each stored entry represents a potential approximate match between at least two strings based on their respective close representations.
申请公布号 US8775441(B2) 申请公布日期 2014.07.08
申请号 US200812015085 申请日期 2008.01.16
申请人 Ab Initio Technology LLC 发明人 Anderson Arlen
分类号 G06F17/30;G06F15/16 主分类号 G06F17/30
代理机构 Fish & Richardson P.C. 代理人 Fish & Richardson P.C.
主权项 1. A method for managing an archive for determining approximate matches associated with strings occurring in records, the method including: determining a set of strings occurring in the records, the set of strings including a first string; generating, for each of the strings in the set, a plurality of deletion variants that are each generated by deleting one or more characters from the corresponding string; for the first string, identifying one or more potentially matching strings in the set of strings, each potentially matching string of the potentially matching strings identified in response to determining that any deletion variant of the first string matches any deletion variant of the potentially matching string; for each of the potentially matching strings, calculating a corresponding match score; for at least some of the potentially matching strings, storing a record in the archive identifying the first string, the potentially matching string, and the match score; determining a count of occurrences of the first string in the records; for each of the potentially matching strings, determining a count of occurrences of the respective potentially matching string in the records; and generating a significance value for the first string based on a sum of at least the count of occurrences of the string and the count of occurrences of each of the one or more potentially matching strings.
地址 Lexington MA US