发明名称 AUTOMATED DATA DUPLICATE IDENTIFICATION
摘要 In an approach to identifying duplicates in data, one or more computer processors receive a request from a user to identify duplicates in a data set. The one or more computer processors retrieve the data set utilizing data discovery. The one or more computer processors perform data profiling on the data set. The one or more computer processors determine one or more domain types of the data set, based, at least in part, on the performed data profiling. The one or more computer processors perform data standardization on the data set, based, at least in part, on the one or more determined domain types. Responsive to performing data standardization, the one or more computer processors perform probabilistic matching on the data set. The one or more computer processors to identify two or more duplicates in the data set, based, at least in part, on the probabilistic matching.
申请公布号 US2016162507(A1) 申请公布日期 2016.06.09
申请号 US201414561927 申请日期 2014.12.05
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Gupta Ritesh K.;Kabra Namit;Kumar Manish;Mittapalli Srinivas K.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for identifying duplicates in a data set, the method comprising: receiving, by one or more computer processors, a request from a user to identify duplicates in a data set; retrieving, by the one or more computer processors, the data set utilizing data discovery; performing, by the one or more computer processors, data profiling on the data set; determining, by the one or more computer processors, one or more domain types of the data set, based, at least in part, on the performed data profiling; performing, by the one or more computer processors, data standardization on the data set, based, at least in part, on the one or more determined domain types; responsive to performing data standardization, performing, by the one or more computer processors, probabilistic matching on the data set; and identifying, by the one or more computer processors, two or more duplicates in the data set, based, at least in part, on the probabilistic matching.
地址 Armonk NY US