主权项 |
1. A method for extracting data set from data clusters, the method comprising:
accessing a data cluster from a database, the data cluster comprising rows and columns of heterogeneous data values in a data structure including one or more comments, titles, numeric data and string data; selecting a plurality of random data groups in the data cluster, wherein a data group comprises at least one of a plurality of contiguous rows of data values or a plurality of contiguous columns of data values in the data cluster, wherein each data value has a data type; detecting patterns of changes in data values between contiguous rows or contiguous columns in the selected data groups; identifying a table template type based on the detected patterns of changes in data value; identifying a table template header comprising of a starting position, an ending position and a width for the table template type; determining a reference row or reference column based on the ending position of the table template header and a subsequent row or column indicating a start of a table body; comparing the subsequent rows or columns with the reference rows or reference columns of the data cluster to identify noise rows or columns; removing the noise columns or rows from the table body; extracting a data set from the table body; and storing the extracted data set to the database. |