主权项 |
1. A method of detecting and eliminating duplicates in a set of records containing contact information in a database of structured records, wherein each record comprises data relating to a plurality of fields, the method comprising:
normalizing field data for each record using information from a knowledge base including a pre-defined set of formats, a predefined set of mappings and a pre-defined set of rules, wherein normalizing a record comprises:
standardizing said field data in said record to represent field data in said record in a standardized format based on a combination of said formats, mappings, and rules, andextracting data from one or more fields from the group of fields containing name, city, email address, city, and telephone number in said record based on a combination of said mappings and rules; transforming said record to facilitate record comparison, wherein said transforming includes filling at least one empty field in said record, wherein said at least one empty field is filled with an associated data from an existing field in said each record and by using a corresponding mapping from said knowledge base; generating one or more clusters of records based on record level keys of said records; identifying reference record for each cluster from said one or more clusters generated; calculating matching percentage for each said record in each said cluster, wherein matching percentage for a record in a cluster is calculated with respect to reference record of said cluster; detecting duplicate records in each cluster based on matching percentage obtained for each record; merging records having non-overlapping information from said detected duplicate records in each said cluster; and purging records having identical information from said detected duplicate records in each said cluster. |