摘要 |
<p>Methods and systems for efficiently determining a similarity between two or more datasets. In one embodiment, the similarity is determined based on comparing a subset of sorted frequency-weighted blocks from one dataset to a subset of sorted frequency-weighed blocks from another dataset. Data blocks of a dataset are converted into hash values that are frequency-weighted. These frequency-weighted hash values can be compared to frequency-weighted hash values of another dataset to determine a similarity of the two datasets. In another embodiment, upon a change of a block in a subset of the dataset, the similarity value is re-determined without resorting or hashing the blocks of a dataset other than the blocks of the subset, resulting in an increased performance of a similarity comparison. In another embodiment, blocks of a dataset are excluded based on a block-filtering rule to increase the accuracy of the similarity comparison.</p> |