摘要 |
A similarity module is arranged to identify whether a local data file is identical in whole or in part to existing files stored by a data hosting service, the local and existing files being hierarchically structured. The similarity module is configured to compare local file metadata with metadata of existing files and identify as candidate matches existing files which metadata matches the local file metadata to a predetermined extent. A local file checksum is then compared with checksums of the candidate existing files and if there is a match, the local file is identified as a duplicate file of the candidate existing file. If there is no match, the module compares local segment checksums with existing segment checksums from the candidate existing files, wherein the segment checksums have been generated by semantically segmenting the local and existing files following the hierarchy to divide them vertically and horizontally into segments. If the checksums of a local segment and an existing file segment match, a local segment is identified as a duplicate segment. The similarity module may be located on a local computer or on the server. |