发明名称 |
Duplicate entry detection system and method |
摘要 |
A computer system and method for determining whether the subject matter described in a received document is substantially similar to the subject matter of other documents in a document corpus, such that the received document can be considered a duplicate document. After receiving a first document, a set of tokens for the first document is generated. A non-fielded relevance search on a token index is executed. The relevance search returns a set of candidate duplicate documents with scores corresponding to each candidate document. For each candidate document with a score above a threshold, filtering is performed on each candidate document to determine whether each candidate document is a true duplicate of the first document. A set of candidate documents with a score above the threshold that were not disqualified as candidate documents is then provided.
|
申请公布号 |
US8046372(B1) |
申请公布日期 |
2011.10.25 |
申请号 |
US20070754237 |
申请日期 |
2007.05.25 |
申请人 |
AMAZON TECHNOLOGIES, INC. |
发明人 |
THIRUMALAI SRIKANTH;MANOHARAN ASWATH;TOMKO MARK J.;EMERY GRANT M.;MOHAN VIJAI;TERRA EGIDIO |
分类号 |
G06F7/00;G06F17/30 |
主分类号 |
G06F7/00 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|