发明名称 METHOD AND SYSTEM FOR LARGE SCALE DATA CURATION
摘要 An end-to-end data curation system and the various methods used in linking, matching, and cleaning large-scale data sources. The goal of this system is to provide scalable and efficient record deduplication. The system uses a crowd of experts to train the system. The system operator can optionally provide a set of hints to reduce the number of questions send to the experts. The system solves the problem of schema mapping and record deduplication a holistic way by unifying these problems into a unified linkage problem.
申请公布号 US2015278241(A1) 申请公布日期 2015.10.01
申请号 US201414228546 申请日期 2014.03.28
申请人 DataTamer, Inc. 发明人 Bates-Haus Nikolaus;Beskales George;Bruckner Daniel Meir;Ilyas Ihab F.;Pagan Alexander Richter;Stonebraker Michael Ralph
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A data integration method for performing the traditionally separate tasks of record deduplication and schema mapping comprising the steps of: abstracting rows/records from a data source into a first set of objects; abstracting columns/fields/attributes from said data source into a second set of objects; and iteratively performing object linkage on said first said of objects and said second set of objects, wherein said object linkage performed on said first set of objects performs the task of said record deduplication, andwherein said object linkage performed on said second set of object performs the task of said schema mapping.
地址 Cambridge MA US