发明名称 Efficient Indexing of Documents with Similar Content
摘要 A computer system comprising one or more processors and memory groups a set of documents into a plurality of clusters. Each cluster includes one or more documents of the set of documents and a respective cluster of documents of the plurality of clusters includes respective cluster data corresponding to a plurality of documents including a first document and a second document. The computer system determines that the second document includes duplicate data that is duplicative of corresponding data in the first document, identifies a respective subset of the respective cluster data that excludes at least a subset of the duplicate data, and generates an index of the respective subset of the respective cluster data.
申请公布号 US2012303622(A1) 申请公布日期 2012.11.29
申请号 US201213571316 申请日期 2012.08.09
申请人 DEAN JEFFREY A.;GHEMAWAT SANJAY;THAMBIDORAI GAUTHAM 发明人 DEAN JEFFREY A.;GHEMAWAT SANJAY;THAMBIDORAI GAUTHAM
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址