发明名称 APPARAUS AND METHOD FOR DETECTING DUPULICATED DOCUMENT OF BIG DATA TEXT USING CLUSTERING AND HASHING
摘要 <p>The present invention relates to an apparatus and a method for detecting a duplicated document to detect whether a big data text is duplicated through clustering and hashing. The apparatus of the present invention comprises: a language distinction portion for distinguishing a language of a document extracted from a big data source; a first hashing portion for firstly hashing-calculating a word list, which is extracted from a character string included in the document, and extracting a first hashing value; a cut syllable analyzing portion for producing a group, which is produced by dividing the world list into consecutive syllables, of N number (N is a positive integer) of cut syllables, extracting M number (M is a positive integer) of the cut syllables having the highest frequency among the cut syllables, and producing a top frequency list; a second hashing portion for secondly hashing-calculating the top frequency list, and extracting a second hashing value; and a duplicated document detecting portion for categorizing a document of which the second hashing value is the same as a same cluster, distinguishing whether the document is duplicated among documents included in the same cluster, and determining a comparison target document as the duplicated document if a difference of the first hashing value of the document for a comparison target is less than or equal to K number of bits set in advance.</p>
申请公布号 KR101545273(B1) 申请公布日期 2015.08.20
申请号 KR20140177227 申请日期 2014.12.10
申请人 WISENUT CO., LTD. 发明人 PARK, HO JIN;KWON, YOUNG HYUN;LEE, HYUN WOO;YUN, DO HYUN;LEE, MYUNG HYUN
分类号 G06F17/21;G06F17/27 主分类号 G06F17/21
代理机构 代理人
主权项
地址