发明名称 METHOD AND SYSTEM FOR LARGE SCALE DATA CURATION
摘要 An end-to-end data curation system and the various methods used in linking, matching, and cleaning large-scale data sources. The goal of this system is to provide scalable and efficient record deduplication. The system uses a crowd of experts to train the system. The system operator can optionally provide a set of hints to reduce the number of questions sent to the experts. The system solves the problem of schema mapping and record deduplication in a holistic way by unifying these problems into a unified linkage problem.
申请公布号 US2017075918(A1) 申请公布日期 2017.03.16
申请号 US201615359795 申请日期 2016.11.23
申请人 Tamr, Inc 发明人 Bates-Haus Nikolaus;Beskales George;Bruckner Daniel Meir;Ilyas Ihab F.;Pagan Alexander Richter;Stonebraker Michael Ralph
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A computer implemented data integration method for performing record deduplication and schema mapping comprising: abstracting rows/records from one or more database storage sources into a first set of objects in computer memory; abstracting columns/fields/attributes from said database storage sources into a second set of objects in computer memory; and iteratively performing object linkage on said first said of objects and said second set of objects, wherein said object linkage performed on said first set of objects performs the task of said record deduplication, andwherein said object linkage performed on said second set of object performs the task of said schema mapping.
地址 Cambridge MA US