发明名称 Method for storing a dataset
摘要 Sorting and storing a dataset, the dataset comprising at least one attribute. The method includes defining a set of data blocks and assigning to each data block a predefined maximum number of entries or a predefined maximum amount of storage, dividing the dataset into a sequence of multiple sub-datasets each having one value or a range of values of the attribute, wherein each pair of successive sub-datasets of the sequence are non-overlapping or overlapping at their respective extremum value of the attribute, for each sub-dataset of the multiple sub-datasets: in case the sub-dataset fully or partially fits into a data block of the defined data blocks storing the sub-dataset into at least the data block, the sub-dataset that partially fits into the data block comprising a number of entries that is smaller than a predefined maximum threshold.
申请公布号 US9442694(B1) 申请公布日期 2016.09.13
申请号 US201514944256 申请日期 2015.11.18
申请人 International Business Machines Corporation 发明人 Boehme Thomas F.;Brodt Andreas;Hrle Namik;Schiller Oliver
分类号 G06F17/30;G06F7/36;G06F11/14;G06F15/16 主分类号 G06F17/30
代理机构 代理人 Kelly L. Jeffrey;Kashef Mohammed
主权项 1. A computer implemented method for sorting and storing a dataset, the dataset comprising rows, each row comprising a value associated with an attribute, the method comprising: defining a set of data blocks, each data block of the set of data blocks having a predefined maximum number of entries; defining a backup data block having a backup predefined maximum number of entries which is greater than the predefined maximum number of entries; randomly dividing the dataset into a sequence of multiple equally sized sub-datasets each comprising a different range of values associated with the attribute, wherein each pair of successive sub-datasets overlap at their respective extremum value of the attribute, and wherein each of the multiple sub-datasets comprises fewer rows than each of the predefined maximum number of entries; storing the values of each sub-dataset on a respective data block in an undefined order, wherein each value of each row is stored as a single entry on the respective data block; storing the values of each sub-dataset on the backup data block; storing subsequent sub-datasets on a subsequent data block, an attribute associated with each subsequent sub-dataset having a respective range of values overlapping at their respective extremum value immediately preceding or succeeding the previous range of values associated with the attribute of the sub-dataset; and creating for each data block an attribute value information indicating the range of values of the attribute stored on the data block for selectively processing at least part of the set of data blocks using the attribute value information.
地址 Armonk NY US