发明名称 SYSTEM AND METHOD FOR NEAR AND EXACT DE-DUPLICATION OF DOCUMENTS
摘要 A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.
申请公布号 US2011191354(A1) 申请公布日期 2011.08.04
申请号 US201113075792 申请日期 2011.03.30
申请人 MSC INTELLECTUAL PROPERTIES B.V. 发明人 SCHOLTES JOHANNES C.;BLOEMBERGEN SIEBE
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项
地址