发明名称 Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
摘要 Techniques for deduplicating structured datasets using hybrid chunking and header removal. According to one embodiment, a request is received to deduplicate a file having a plurality of data blocks, each data block having a header and a data portion. The data blocks are anchored using first anchors to indicate block boundaries based on their headers. At least one second anchor is added within a data portion of at least one data block if the data portion of at least one data block satisfies a predetermined condition. The data blocks are then deduplicated based on the first and second anchors.
申请公布号 US9183218(B1) 申请公布日期 2015.11.10
申请号 US201213538964 申请日期 2012.06.29
申请人 EMC Corporation 发明人 Wallace Grant R.;Duggal Abhinav
分类号 G06F7/00;G06F17/00;G06F17/30;G06F13/14 主分类号 G06F7/00
代理机构 Blakely, Sokoloff, Taylor & Zafman LLP 代理人 Blakely, Sokoloff, Taylor & Zafman LLP
主权项 1. A computer-implemented method, comprising: receiving a request at a system to deduplicate a file having a plurality of data blocks, each data block having a header and a data portion, wherein the file is received from a client application of a client device over a network to be stored in the system; scanning to search a predetermined signature embedded within a header of each data block to identify a block boundary between the header and the data portion; anchoring the data blocks using first anchors to indicate block boundaries based on the scanning of the predetermined signature, including recognizing a plurality of markers within the data portions of the data blocks, wherein the markers were inserted into the data blocks by the client application prior to receiving the file over the network,removing the recognized markers from the file, andanchoring the data blocks using the first anchors at locations of the removed markers, wherein an anchor denotes a boundary between two data blocks; adding at least one second anchor within a data portion of at least one data block that has been anchored by two of the first anchors, if the data portion of at least one data block satisfies a predetermined condition, wherein the second anchor is located between two first anchors; separating data portions of the data blocks from the headers based on the first anchors; chunking the data portion of the data blocks based on the first anchors and the at least one second anchor, generating a plurality of data chunks; and deduplicating the data chunks of the data portions of the data blocks.
地址 Hopkinton MA US