发明名称 DATA DE-DUPLICATION
摘要 A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.
申请公布号 US2016092479(A1) 申请公布日期 2016.03.31
申请号 US201514716910 申请日期 2015.05.20
申请人 International Business Machines Corporation 发明人 Kabra Namit;Saillet Yannick
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method, executed by a computer, for de-duplicating data, the method comprising: receiving a dataset; receiving common domain information for the dataset, wherein the common domain information defines a set of columns having a common domain; pivoting the dataset along the set of columns having a common domain to provide a pivoted dataset; and de-duplicating the pivoted dataset to provide a de-duplicated dataset.
地址 Armonk NY US