发明名称 Detecting duplicate records
摘要 A method for finding duplicates by matching group of fields in records is disclosed. The method comprises standardizing data using field specific knowledge base; extracting at least part of one or more related fields of records; applying a matching attribute function to generate keys on the “comparable” field part extracted data; generating record level keys using generated field level keys; clustering the records based on generated record level keys; identifying reference record for each cluster identified; and calculating matching percentage for each record in a cluster with respect to reference record of the cluster. Devices and systems are disclosed that enable the method for finding duplicates.
申请公布号 US8838549(B2) 申请公布日期 2014.09.16
申请号 US200812168727 申请日期 2008.07.07
申请人 发明人 Bodapati Chandra;Gunasekar Noel Vijay
分类号 G06F7/00;G06F17/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项 1. A method of detecting and eliminating duplicates in a set of records containing contact information in a database of structured records, wherein each record comprises data relating to a plurality of fields, the method comprising: normalizing field data for each record using information from a knowledge base including a pre-defined set of formats, a predefined set of mappings and a pre-defined set of rules, wherein normalizing a record comprises: standardizing said field data in said record to represent field data in said record in a standardized format based on a combination of said formats, mappings, and rules, andextracting data from one or more fields from the group of fields containing name, city, email address, city, and telephone number in said record based on a combination of said mappings and rules; transforming said record to facilitate record comparison, wherein said transforming includes filling at least one empty field in said record, wherein said at least one empty field is filled with an associated data from an existing field in said each record and by using a corresponding mapping from said knowledge base; generating one or more clusters of records based on record level keys of said records; identifying reference record for each cluster from said one or more clusters generated; calculating matching percentage for each said record in each said cluster, wherein matching percentage for a record in a cluster is calculated with respect to reference record of said cluster; detecting duplicate records in each cluster based on matching percentage obtained for each record; merging records having non-overlapping information from said detected duplicate records in each said cluster; and purging records having identical information from said detected duplicate records in each said cluster.
地址