发明名称 Schema Discovery Through Statistical Transduction
摘要 A method, system, and computer program product derive data schema for application to a data set. One or more processors generate a directed acyclic weighted graph that encodes data types and semantic types used by a data set. One or more processors assign estimated frequencies for each component of the directed acyclic weighted graph, where the estimated frequencies predict a likelihood of a particular data schema element being used by any data set. One or more processors traverse through paths in the directed acyclic weighted graph with a predetermined portion of the data set to determine a data schema that correctly defines data from the data set and identifies any errors in the data set, and then apply the data schema to the data set to generate clean data that is properly formatted.
申请公布号 US2016357747(A1) 申请公布日期 2016.12.08
申请号 US201514730287 申请日期 2015.06.04
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Parthasarathy Srinivasan;Pavuluri Venkata N.;Turaga Deepak S.
分类号 G06F17/30;G06F17/27 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method of deriving data schema for application to a data set, the method comprising: generating, by one or more processors, a directed acyclic weighted graph that encodes data types and semantic types used by a data set; assigning, by one or more processors, estimated frequencies for each component of the directed acyclic weighted graph, wherein the estimated frequencies predict a likelihood of a particular data schema element being used by any data set; traversing, by one or more processors, through paths in the directed acyclic weighted graph with a predetermined portion of the data set to determine a data schema that correctly defines data from the data set and identifies any errors in the data set; and applying, by one or more processors, the data schema to the data set to generate clean data that is properly formatted.
地址 ARMONK NY US