发明名称 Similar document detection and electronic discovery
摘要 Systems and methods are disclosed for performing duplicate document analyses to identify texturally identical or similar documents, which may be electronic documents stored within an electronic discovery platform. A process is described which includes representing each of the documents, including a target document, as a relatively large n-tuple vector and also as a relatively small m-tuple vector, performing a series of one-dimensional searches on the set of m-tuple vectors to identify a set of documents which are near-duplicates to the target document, and then filtering the near set of near duplicate documents based upon the distance of their n-tuple vectors from that of the target document.
申请公布号 US9208219(B2) 申请公布日期 2015.12.08
申请号 US201313763253 申请日期 2013.02.08
申请人 STROZ FRIEDBERG, LLC 发明人 Sperling Michael;Jin Rong;Rayvych Illya;Li Jianghong;Yi Jinfeng
分类号 G06F17/30 主分类号 G06F17/30
代理机构 GTC Law Group LLP & Affiliates 代理人 GTC Law Group LLP & Affiliates
主权项 1. A non-transitory storage medium having stored instructions which, when executed by a processor, cause the processor to perform actions with regard to a first dataset having a plurality of first dataset elements and which is operably accessible to the processor, each of the first dataset elements corresponding to a different document and each of the documents having one or more characteristics, the actions comprising: creating a n-tuple vector for each of a selected number of the first dataset element of the plurality of first dataset elements wherein each component of the n-tuple vector correlates to a characteristic of the relevant first dataset element; creating an m-tuple vector for each of two or more of the n-tuple vectors, wherein each of the m-tuple vectors includes as its components (a) the norm of its corresponding n-tuple vector, (b) the component sum of its corresponding n-tuple vector, and (c) a set of random projections of its corresponding n-tuple vector; selecting one of the dataset elements to be a target; selecting the m-tuple vector which corresponds to the target and at least one other of the m-tuple vectors as elements of a first candidate set; bisectionally performing a series of one-dimensional range searches starting with the first candidate set to create a second candidate set comprising one or more of the m-tuple vectors of the first candidate set; determining for each of the n-tuple vectors which corresponds to one of the m-tuple vectors of the second candidate set its distance from the target's n-tuple vector; and creating a second dataset comprising each of the first dataset elements which has a corresponding n-tuple vector which is within a selected distance from the target's n-tuple vector, wherein the actions further comprise selecting one of the one-dimensional searches to be based upon the norm of the target's m-tuple vector, wherein the one-dimensional search that is based upon the norm of the target's m-tuple vector includes setting a threshold related to a factor multiplied by the norm of the target's m-tuple vector, and wherein the factor is determined based upon the selected distance from the target's n-tuple vector, the norm of the target's n-tuple vector, and the maximum element in the target's n-tuple vector.
地址 New York NY US