摘要 |
Documents that are near-duplicates may be determined using techniques involving consistent uniform hashing. A biased bit may be placed in the leading position of a sequence of bits that may be generated and subsequently used in comparison techniques to determine near-duplicate documents. Unbiased bits may be used in subsequent positions of the sequence of bits, after the biased bit, for use in comparison techniques. Samples may be used collectively, as opposed to individually, in the generation of biased bits. Sequences of bits may thus be produced not on a single sample basis, but for multiple samples, thereby amortizing the cost of generating randomness for the samples. Less than one bit of randomness per sample may be used.
|