发明名称 Hybrid of proximity and identity similarity based deduplication in a data deduplication system
摘要 For a hybrid of proximity and identity similarity based deduplication in a data deduplication, comparing color intensity for additional classification enhancement of colored files grouped together by file coloring where a preferred character is represented for the file coloring using a code selected from a multiplicity of codes that represent a variety of contexts. The original meaning of the preferred character is retained when representing the preferred character for the file coloring by the code selected from the multiplicity of codes.
申请公布号 US9639549(B2) 申请公布日期 2017.05.02
申请号 US201414163721 申请日期 2014.01.24
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Goldberg Itzhack;Sondhi Neil
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Griffiths & Seaton PLLC 代理人 Griffiths & Seaton PLLC
主权项 1. A method for a hybrid of proximity and identity similarity based deduplication in a data deduplication system using a processor device in a computing environment, comprising: comparing color intensity for additional classification enhancement of colored files grouped together by file coloring where a preferred character is represented for the file coloring using a code selected from a plurality of codes that represent a plurality of contexts thereby flattening B-TREE indexes when searching for duplicate data within the data deduplication system, wherein an original meaning of the preferred character is retained when representing the preferred character for the file coloring by the code selected from the plurality of codes; comparing the colored files by comparing vectors of at least two colored files; using the color intensity for comparing the colored files by measuring a ratio between an actual average distance of the colored file divided by an optimal average distance of the colored file for comparing distribution of colors in data chucks of the colored files, wherein the optimal average distance is equal to a file size divided by a total number of the file colors that appear within the colored files, wherein the color intensity includes a distribution pattern characteristic of the file coloring; and deduplicating the colored files grouped together of a same file coloring and color intensity.
地址 Armonk NY US