发明名称 Identifying similarities within large collections of unstructured data
摘要 A technique for determining when documents stored in digital format in a data processing system are similar. A method compares a sparse representation of two or more documents by breaking the documents into "chunks" of data of predefined sizes. Selected subsets of the chunks are determined as being representative of data in the documents and coefficients are developed to represent such chunks. Coefficients are then combined into coefficient clusters containing coefficients that are similar according to a predetermined similarity metric. The degree of similarity between documents is then evaluated by counting clusters into which chunks of similar documents fall.
申请公布号 US6947933(B2) 申请公布日期 2005.09.20
申请号 US20030738919 申请日期 2003.12.17
申请人 VERDASYS, INC. 发明人 SMOLSKY MICHAEL
分类号 G06F17/00;G06F17/30;(IPC1-7):G06F17/30 主分类号 G06F17/00
代理机构 代理人
主权项
地址