APPARAUS AND METHOD FOR DETECTING DUPULICATED DOCUMENT OF BIG DATA TEXT USING CLUSTERING AND HASHING,申请号KR20140177227-传众专利搜索

发明名称	APPARAUS AND METHOD FOR DETECTING DUPULICATED DOCUMENT OF BIG DATA TEXT USING CLUSTERING AND HASHING
摘要	<p>The present invention relates to an apparatus and a method for detecting a duplicated document to detect whether a big data text is duplicated through clustering and hashing. The apparatus of the present invention comprises: a language distinction portion for distinguishing a language of a document extracted from a big data source; a first hashing portion for firstly hashing-calculating a word list, which is extracted from a character string included in the document, and extracting a first hashing value; a cut syllable analyzing portion for producing a group, which is produced by dividing the world list into consecutive syllables, of N number (N is a positive integer) of cut syllables, extracting M number (M is a positive integer) of the cut syllables having the highest frequency among the cut syllables, and producing a top frequency list; a second hashing portion for secondly hashing-calculating the top frequency list, and extracting a second hashing value; and a duplicated document detecting portion for categorizing a document of which the second hashing value is the same as a same cluster, distinguishing whether the document is duplicated among documents included in the same cluster, and determining a comparison target document as the duplicated document if a difference of the first hashing value of the document for a comparison target is less than or equal to K number of bits set in advance.</p>
申请公布号	KR101545273(B1)	申请公布日期	2015.08.20
申请号	KR20140177227	申请日期	2014.12.10
申请人	WISENUT CO., LTD.	发明人	PARK, HO JIN;KWON, YOUNG HYUN;LEE, HYUN WOO;YUN, DO HYUN;LEE, MYUNG HYUN
分类号	G06F17/21;G06F17/27	主分类号	G06F17/21
代理机构		代理人
主权项
地址