发明名称 Table Based Data Set Extraction From Data Clusters
摘要 A computer system and computer implemented method for extracting data set from data clusters that comprises of rows and columns of heterogeneous data values. A plurality of random data groups comprising of at least one of a plurality of contiguous row or columns of data values are selected. Each data value has a data type. A table template type is identified based on detection of a pattern between the data cells of the contiguous rows or columns. A table template header is identified that comprises of a starting position, and ending position and a width. A reference row or reference column indicating a start of a table body is determined. The data cells of the subsequent rows or columns in the table body are compared to the data cells of the reference rows to identify noise rows or columns that are removed from the table body.
申请公布号 US2016292240(A1) 申请公布日期 2016.10.06
申请号 US201514656557 申请日期 2015.03.31
申请人 Informatica LLC 发明人 Diwan Saurabh;P.J. Shivananda
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for extracting data set from data clusters, the method comprising: accessing a data cluster from a database, the data cluster comprising rows and columns of heterogeneous data values in a data structure including one or more comments, titles, numeric data and string data; selecting a plurality of random data groups in the data cluster, wherein a data group comprises at least one of a plurality of contiguous rows of data values or a plurality of contiguous columns of data values in the data cluster, wherein each data value has a data type; detecting patterns of changes in data values between contiguous rows or contiguous columns in the selected data groups; identifying a table template type based on the detected patterns of changes in data value; identifying a table template header comprising of a starting position, an ending position and a width for the table template type; determining a reference row or reference column based on the ending position of the table template header and a subsequent row or column indicating a start of a table body; comparing the subsequent rows or columns with the reference rows or reference columns of the data cluster to identify noise rows or columns; removing the noise columns or rows from the table body; extracting a data set from the table body; and storing the extracted data set to the database.
地址 Redwood City CA US