发明名称 |
Filtering invalid tokens from a document using high IDF token filtering |
摘要 |
Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.
|
申请公布号 |
US7908279(B1) |
申请公布日期 |
2011.03.15 |
申请号 |
US20070856581 |
申请日期 |
2007.09.17 |
申请人 |
AMAZON TECHNOLOGIES, INC. |
发明人 |
THIRUMALAI SRIKANTH;MANOHARAN ASWATH;TOMKO MARK J.;EMERY GRANT M.;MOHAN VIJAI;TERRA EGIDIO |
分类号 |
G06F7/00;G06F9/445;G06F17/21;G06F17/30 |
主分类号 |
G06F7/00 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|