摘要 |
PROBLEM TO BE SOLVED: To more properly detect a set of files having similar content. SOLUTION: A position for dividing each file into a predetermined number of constituent segments having equal size is temporarily determined. Data preceding or following the temporary division position are read. A part where a specific pattern is detected is determined as a definite division position, and a hash value of each constituent segment obtained by dividing the file at the division positions is calculated. When similarity is decided between files, an eigenvalue of the constituent segment related to one file and an eigenvalue of the constituent segment related to the other file are sequentially compared for each segment. The number or a ratio of the constituent segments where the eigenvalues match is counted. As the number or the ratio of the constituent segments where the eigenvalues match is larger, a degree of similarity is higher. COPYRIGHT: (C)2011,JPO&INPIT
|