发明名称 |
Name search using multiple bitmap distributions |
摘要 |
Provided are a computer implemented method, computer program product, and system for matching names. For a first bitmap distribution, it is determined whether a first bitmap signature of a query name and a second bitmap signature of a target name have a number of character n-grams overlapping that meet or exceed a threshold to generate a first preliminary value. For a second bitmap distribution that is different from the first bitmap distribution, it is determined whether a third bitmap signature of the query name and a fourth bitmap signature of the target name have a number of character n-grams overlapping that meet or exceed a threshold to generate a second preliminary value. The first preliminary value and the second preliminary value are combined, and, if the combination results in a value of true, it is determined that the query name and the target name are to be further processed. |
申请公布号 |
US9020911(B2) |
申请公布日期 |
2015.04.28 |
申请号 |
US201213353252 |
申请日期 |
2012.01.18 |
申请人 |
International Business Machines Corporation |
发明人 |
Biesenbach David E.;Liddle Steven J.;Watjen Stephen J.;Williams Charles K. |
分类号 |
G06F7/00;G06F17/30 |
主分类号 |
G06F7/00 |
代理机构 |
Konrad, Raynes, Davda and Victor LLP |
代理人 |
Davda Janaki K.;Konrad, Raynes, Davda and Victor LLP |
主权项 |
1. A computer program product for matching names, the computer program product comprising:
a non-transitory computer readable storage medium having computer readable program code embodied therein, wherein the computer readable program code, when executed by a processor of a computer, is configured to perform operations of: creating a first bitmap distribution of character n-grams distributed into bitmap positions in descending order of frequency of occurrence of the character n-grams in a set of names based on bitmap positions with a lowest cumulative frequency, wherein at least two distinct character n-grams are assigned to a same bitmap position of the bitmap positions; creating a second bitmap distribution of the character n-grams distributed into the bitmap positions so that the at least two distinct character n-grams are assigned to different bitmap positions and so that any overlapping character n-grams in the first bitmap distribution do not overlap in the second bitmap distribution; using the first bitmap distribution, determining whether a first bitmap signature of a query name and a second bitmap signature of a target name in a set of names have a number of character n-grams overlapping that meet or exceed a first configurable threshold to generate a first preliminary value; using the second bitmap distribution, determining whether a third bitmap signature of the query name and a fourth bitmap signature of the target name have a number of character n-grams overlapping that meet or exceed a second configurable threshold to generate a second preliminary value; and in response to determining that a logical operation applied to the first preliminary value and the second preliminary value results in a value of true, determining that the query name and the target name are to be processed for further comparisons. |
地址 |
Armonk NY US |