发明名称 Sub-block partitioning for hash-based deduplication
摘要 Sub-block partitioning for hash-based deduplication is performed by defining a minimal size and maximum size of the sub-block. For each boundary start position of the sub-block, starting a search, after the minimal size of the sub-block, for a boundary position of a subsequent sub-block by using multiple search criteria to test hash values that are calculated during the search. If one of the multiple search criteria is satisfied by one of the hash values, declaring the position of the hash value as a boundary end position of the sub-block. If the maximum size of the sub-block is reached prior to satisfying one of the multiple search criteria, declaring a position of an alternative one of the hash values that is selected based upon another one of the multiple search criteria as the boundary end position of the sub-block.
申请公布号 US9164688(B2) 申请公布日期 2015.10.20
申请号 US201213541009 申请日期 2012.07.03
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Aronovich Lior;Hirsch Michael
分类号 G06F17/30;G06F7/02;G06F3/06 主分类号 G06F17/30
代理机构 Griffiths & Seaton PLLC 代理人 Griffiths & Seaton PLLC
主权项 1. A method for sub-block partitioning for hash-based deduplication by a processor device in a computing environment, the method comprising: defining a minimal size and maximum size of the sub-blocks; for each boundary start position of a sub-block, starting a search for a boundary position of the sub-block after the minimal size of the sub-block by using a plurality of search criteria to test a plurality of hash values that are generated during the search, wherein the plurality of hash values includes at least individual hash values and derived hash values that are derived from sets of underlying hash values; if one of the plurality of search criteria is not satisfied by one of the plurality of hash values, incrementing the current position of one of the plurality of hash values by at least one byte, and if the maximum size of the sub-block is not reached prior to satisfying one of the plurality of search criteria, calculating a next one of the plurality of hash values to test using the one of the plurality of search criteria; if one of the plurality of search criteria is satisfied by one of the plurality of hash values, declaring a position of the one of the plurality of hash values as a boundary position of the sub-block; and if the maximum size of the sub-block is reached prior to satisfying one of the plurality of search criteria, declaring a position of an alternative one of the plurality of hash values that is selected based upon an alternative one of the plurality of search criteria as the boundary position of the sub-block; wherein the one of the plurality of search criteria includes one of a first type search criteria and a second type search criteria, and the alternative one of the plurality of search criteria is a third type search criteria, and further wherein the first type search criteria is satisfied if n bits at predefined positions of a last calculated hash value are equal to one of an mth predefined different patterns of bits; andsatisfying the second type search criteria if n bits at predefined positions of a value calculated by applying a XOR operation on last calculated k hash values are equal to one of an mth predefined different patterns of bits, and applying one of a plurality of operations that combines values of the last calculated k hash values and produces alternative values whose statistical distribution being the same as the statistical distribution of the plurality of hash values being combined.
地址 Armonk NY US