发明名称 |
Data cleaning |
摘要 |
A computer-implemented method comprising partitioning data representing an input instance of a database including multiple tuples into multiple fragments of tuples, detecting tuples which violate a data quality specification in respective ones of the fragments, selecting a data cleaning asset on the basis of characteristics of errors in detected tuples for a fragment and based on declared asset capabilities, assigning a selected data cleaning asset to the fragment, the selected data cleaning asset to provide a set of candidate corrections for the detected tuples in the fragment, providing data representing an output instance of the database in which detected tuples are replaced with selected candidate corrections. |
申请公布号 |
US8805798(B2) |
申请公布日期 |
2014.08.12 |
申请号 |
US201213468938 |
申请日期 |
2012.05.10 |
申请人 |
Qatar Foundation |
发明人 |
Kaldas Ihab Francis Ilyas;Beskales George;Elmagarmid Ahmed |
分类号 |
G06F7/02;G06F17/30 |
主分类号 |
G06F7/02 |
代理机构 |
Mossman Kumar & Tyler PC |
代理人 |
Mossman Kumar & Tyler PC |
主权项 |
1. A computer-implemented method comprising:
partitioning data representing an input instance of a database including multiple tuples into multiple fragments of tuples; detecting tuples which violate a data quality specification in respective ones of the fragments; selecting multiple data cleaning assets on the basis of characteristics of errors in detected tuples for a fragment and based on declared asset capabilities; assigning multiple selected data cleaning assets to the fragment, the selected multiple data cleaning assets to provide sets of redundant candidate corrections for the detected tuples in the fragment; selecting a candidate correction for a tuple with a relatively higher confidence measure from measures for the candidate corrections in the redundant sets, and wherein a confidence measure includes a measure representing a majority vote for a tuple from multiple candidate corrections for the tuple from the redundant sets; and providing data representing an output instance of the database in which detected tuples are replaced with selected candidate corrections. |
地址 |
Doha QA |