发明名称 System and method for near and exact de-duplication of documents
摘要 A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.
申请公布号 US8250079(B2) 申请公布日期 2012.08.21
申请号 US201113075792 申请日期 2011.03.30
申请人 SCHOLTES JOHANNES C.;BLOEMBERGEN SIEBE;MSC INTELLECTUAL PROPERTIES B.V. 发明人 SCHOLTES JOHANNES C.;BLOEMBERGEN SIEBE
分类号 G06F7/00;G06F17/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项
地址