发明名称 |
DATA DE-DUPLICATION |
摘要 |
A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein. |
申请公布号 |
US2016092479(A1) |
申请公布日期 |
2016.03.31 |
申请号 |
US201514716910 |
申请日期 |
2015.05.20 |
申请人 |
International Business Machines Corporation |
发明人 |
Kabra Namit;Saillet Yannick |
分类号 |
G06F17/30 |
主分类号 |
G06F17/30 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method, executed by a computer, for de-duplicating data, the method comprising:
receiving a dataset; receiving common domain information for the dataset, wherein the common domain information defines a set of columns having a common domain; pivoting the dataset along the set of columns having a common domain to provide a pivoted dataset; and de-duplicating the pivoted dataset to provide a de-duplicated dataset. |
地址 |
Armonk NY US |