发明名称 |
System and method for near and exact de-duplication of documents |
摘要 |
A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.
|
申请公布号 |
US8250079(B2) |
申请公布日期 |
2012.08.21 |
申请号 |
US201113075792 |
申请日期 |
2011.03.30 |
申请人 |
SCHOLTES JOHANNES C.;BLOEMBERGEN SIEBE;MSC INTELLECTUAL PROPERTIES B.V. |
发明人 |
SCHOLTES JOHANNES C.;BLOEMBERGEN SIEBE |
分类号 |
G06F7/00;G06F17/00;G06F17/30 |
主分类号 |
G06F7/00 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|