摘要 |
A method, system and apparatus for efficient storage of small files in a segment-based deduplication scheme by allocating multiple small files to a single data segment is provided. A mechanism for distinguishing between large files (e.g., files that are on the order of the size of a segment or larger) and smaller files, and starting a new segment at the beginning of a large file is also provided. A file attribute-based system for determining an identity of a small file at which to begin a new segment and then allocating subsequent small files to that segment and contiguous segments until a next small file having an appropriate attribute subsequently is encountered to begin a new segment is further provided. In one aspect of the present invention a filename hash is used for file attribute analysis to determine when a new segment should begin. Using such a mechanism, multiple small files can be allocated to a data segment and at the same time continue to provide for efficient storage of large files within separate data segments. The file attribute analysis further provides for an increase in deduplication rate for subsequently provided copies of the small files (e.g., in a backup) since segment boundaries remain constant in spite of file additions or deletions.
|