发明名称 Data filtering and optimization for ETL (extract, transform, load) processes
摘要 A method and system are disclosed for use with an ETL (Extract, Transform, Load) process, comprising optimizing a filter expression to select a subset of data and evaluating the filter expression on the data after the extracting, before the loading, but not during the transforming of the ETL process. The method and system optimizes the filtering using a pipelined evaluation for single predicate filtering and an adaptive optimization for multiple predicate filtering. The adaptive optimization includes an initial phase and a dynamic phase.
申请公布号 US8744994(B2) 申请公布日期 2014.06.03
申请号 US20080343021 申请日期 2008.12.23
申请人 International Business Machines Corporation 发明人 Chen Ying;He Bin;Wang Rui
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项 1. A computer-implemented method for use with an ETL (Extract, Transform, Load) process, comprising: operating a computer processor to optimize a predicate expression to select a subset of data; extracting the subset of data; filtering the subset of data; transforming the subset of data; loading the subset of data to a target; evaluating the predicate expression, depicted as a tree with a plurality of nodes, during the filtering of the subset of data at a time that occurs after the extracting of the subset of data, after the transforming of the subset of data, and before the loading of the subset of data to a target; storing, in each of the plurality of nodes, execution statistics of a plurality of child nodes of each of the plurality of nodes; adjusting an order of execution of the predicate expression after every n records, with n being a predetermined number of records, based on a recent execution statistic of the plurality of child nodes of the plurality of nodes and whether each of the plurality of nodes is an OR node or an AND node; and adjusting the order of execution of the predicate expression such that a child node of the plurality of child nodes with a lowest true rate is executed first, wherein the true rate corresponds to a percentage of time at which one of the plurality of nodes evaluates to true, wherein the execution statistics include the true rate.
地址 Armonk NY US