发明名称 Efficient indexing of documents with similar content
摘要 A computer system comprising one or more processors and memory groups a set of documents into a plurality of clusters. Each cluster includes one or more documents of the set of documents and a respective cluster of documents of the plurality of clusters includes respective cluster data corresponding to a plurality of documents including a first document and a second document. The computer system determines that the second document includes duplicate data that is duplicative of corresponding data in the first document, identifies a respective subset of the respective cluster data that excludes at least a subset of the duplicate data, and generates an index of the respective subset of the respective cluster data.
申请公布号 US8554561(B2) 申请公布日期 2013.10.08
申请号 US201213571316 申请日期 2012.08.09
申请人 DEAN JEFFREY A.;GHEMAWAT SANJAY;THAMBIDORAI GAUTHAM;GOOGLE INC. 发明人 DEAN JEFFREY A.;GHEMAWAT SANJAY;THAMBIDORAI GAUTHAM
分类号 G10L15/06 主分类号 G10L15/06
代理机构 代理人
主权项
地址
您可能感兴趣的专利