摘要 |
<p>The present invention relates to an apparatus and a method for detecting a duplicated document to detect whether a big data text is duplicated through clustering and hashing. The apparatus of the present invention comprises: a language distinction portion for distinguishing a language of a document extracted from a big data source; a first hashing portion for firstly hashing-calculating a word list, which is extracted from a character string included in the document, and extracting a first hashing value; a cut syllable analyzing portion for producing a group, which is produced by dividing the world list into consecutive syllables, of N number (N is a positive integer) of cut syllables, extracting M number (M is a positive integer) of the cut syllables having the highest frequency among the cut syllables, and producing a top frequency list; a second hashing portion for secondly hashing-calculating the top frequency list, and extracting a second hashing value; and a duplicated document detecting portion for categorizing a document of which the second hashing value is the same as a same cluster, distinguishing whether the document is duplicated among documents included in the same cluster, and determining a comparison target document as the duplicated document if a difference of the first hashing value of the document for a comparison target is less than or equal to K number of bits set in advance.</p> |