发明名称 Self-analyzing data processing job to determine data quality issues
摘要 Techniques are disclosed to determine data quality issues in data processing jobs. The data processing job is received, the data processing job specifying one or more processing steps designed based on one or more data schemas and further specifies one or more desired quality metrics to measure at the one or more processing steps. One or more state machines are provided, that are generated based on the quality metrics and on the data schemas. Input data to the data process job are processed using the one or more state machines, in order to generate output data and a set of data quality records characterizing a set of data quality issues identified during the execution of the data processing job.
申请公布号 US9576037(B2) 申请公布日期 2017.02.21
申请号 US201414224864 申请日期 2014.03.25
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Li Jeff J.;Li Yong
分类号 G06F7/00;G06F17/00;G06F17/30 主分类号 G06F7/00
代理机构 Patterson + Sheridan, LLP 代理人 Patterson + Sheridan, LLP
主权项 1. A computer-implemented method to determine data quality issues in extract, transform, and load (ETL) jobs, based on quality metrics, the computer-implemented method comprising: receiving a data processing job comprising an ETL job specifying one or more processing steps designed based on one or more data schemas including an input schema and an output schema and further specifying one or more desired quality metrics to measure at the one or more processing steps, wherein the one or more processing steps specify the input schema and the output schema and are configured to perform a desired data transformation; providing one or more state machines generated based on the quality metrics and on the data schemas, wherein each state machine corresponds to a respective processing step and has: (i) a respective plurality of nodes representing elements of a predefined markup language and (ii) transitions between the nodes based on incoming events of the predefined markup language, wherein the one or more events are processed by an ETL engine when executing the ETL job; wherein at least a first of the one or more state machines represents at least two markup language components selected from a markup element, a markup attribute, a derived element, and a derived attribute; and during execution of the ETL job, processing input data for the ETL job by operation of one or more computer processors and using the one or more state machines, in order to generate: (i) output data from executing the ETL job and (ii) a set of data quality records characterizing a set of hierarchical data quality issues pertaining to the one or more desired quality metrics and identified during execution of the ETL job; whereafter the generated set of data quality records is output; wherein the ETL job is configured to perform: (i) self-analysis in order to generate a measure of a quality of data generated by the one or more processing steps and (ii) data lineage analysis in order determine one or more factors potentially contributing to each data quality issue of the set of data quality issues; wherein the ETL job is selected from: (i) a hierarchical data composing job for generating hierarchical data from a plurality of input sources and via one or more composer steps; and (ii) a hierarchical parsing job including a plurality of parsing steps to parse a plurality of portions of incoming hierarchical data.
地址 Armonk NY US