发明名称 |
Efficient Indexing of Documents with Similar Content |
摘要 |
A computer system comprising one or more processors and memory groups a set of documents into a plurality of clusters. Each cluster includes one or more documents of the set of documents and a respective cluster of documents of the plurality of clusters includes respective cluster data corresponding to a plurality of documents including a first document and a second document. The computer system determines that the second document includes duplicate data that is duplicative of corresponding data in the first document, identifies a respective subset of the respective cluster data that excludes at least a subset of the duplicate data, and generates an index of the respective subset of the respective cluster data. |
申请公布号 |
US2012303622(A1) |
申请公布日期 |
2012.11.29 |
申请号 |
US201213571316 |
申请日期 |
2012.08.09 |
申请人 |
DEAN JEFFREY A.;GHEMAWAT SANJAY;THAMBIDORAI GAUTHAM |
发明人 |
DEAN JEFFREY A.;GHEMAWAT SANJAY;THAMBIDORAI GAUTHAM |
分类号 |
G06F17/30 |
主分类号 |
G06F17/30 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|