发明名称 |
Similar document detection and electronic discovery |
摘要 |
Systems and methods are disclosed for performing duplicate document analyses to identify texturally identical or similar documents, which may be electronic documents stored within an electronic discovery platform. A process is described which includes representing each of the documents, including a target document, as a relatively large n-tuple vector and also as a relatively small m-tuple vector, performing a series of one-dimensional searches on the set of m-tuple vectors to identify a set of documents which are near-duplicates to the target document, and then filtering the near set of near duplicate documents based upon the distance of their n-tuple vectors from that of the target document. |
申请公布号 |
US9208219(B2) |
申请公布日期 |
2015.12.08 |
申请号 |
US201313763253 |
申请日期 |
2013.02.08 |
申请人 |
STROZ FRIEDBERG, LLC |
发明人 |
Sperling Michael;Jin Rong;Rayvych Illya;Li Jianghong;Yi Jinfeng |
分类号 |
G06F17/30 |
主分类号 |
G06F17/30 |
代理机构 |
GTC Law Group LLP & Affiliates |
代理人 |
GTC Law Group LLP & Affiliates |
主权项 |
1. A non-transitory storage medium having stored instructions which, when executed by a processor, cause the processor to perform actions with regard to a first dataset having a plurality of first dataset elements and which is operably accessible to the processor, each of the first dataset elements corresponding to a different document and each of the documents having one or more characteristics, the actions comprising:
creating a n-tuple vector for each of a selected number of the first dataset element of the plurality of first dataset elements wherein each component of the n-tuple vector correlates to a characteristic of the relevant first dataset element; creating an m-tuple vector for each of two or more of the n-tuple vectors, wherein each of the m-tuple vectors includes as its components (a) the norm of its corresponding n-tuple vector, (b) the component sum of its corresponding n-tuple vector, and (c) a set of random projections of its corresponding n-tuple vector; selecting one of the dataset elements to be a target; selecting the m-tuple vector which corresponds to the target and at least one other of the m-tuple vectors as elements of a first candidate set; bisectionally performing a series of one-dimensional range searches starting with the first candidate set to create a second candidate set comprising one or more of the m-tuple vectors of the first candidate set; determining for each of the n-tuple vectors which corresponds to one of the m-tuple vectors of the second candidate set its distance from the target's n-tuple vector; and creating a second dataset comprising each of the first dataset elements which has a corresponding n-tuple vector which is within a selected distance from the target's n-tuple vector, wherein the actions further comprise selecting one of the one-dimensional searches to be based upon the norm of the target's m-tuple vector, wherein the one-dimensional search that is based upon the norm of the target's m-tuple vector includes setting a threshold related to a factor multiplied by the norm of the target's m-tuple vector, and wherein the factor is determined based upon the selected distance from the target's n-tuple vector, the norm of the target's n-tuple vector, and the maximum element in the target's n-tuple vector. |
地址 |
New York NY US |