发明名称 De-duplication deployment planning
摘要 Assignment of files to a de-duplication domain. Address space of data files is divided into multiple containers. For each of the containers, a file metadata scan is performed to obtain file system metadata, which is aggregated and summarized in a content feature summary. A content feature summary prediction measurement is measured between containers from the generated content feature summary, and files from each container are assigned to a de-duplication domain based upon the content similarity predication measurement.
申请公布号 US9275068(B2) 申请公布日期 2016.03.01
申请号 US201314016268 申请日期 2013.09.03
申请人 International Business Machines Corporation 发明人 Chambliss David D.;Constantinescu Mihail C.;Glider Joseph S.;Lu Maohua
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Lieberman & Brandsdorfer, LLC 代理人 Lieberman & Brandsdorfer, LLC
主权项 1. A method comprising: dividing files corresponding to an address space into multiple containers; performing a file metadata scan, including obtaining attributes for files in each container; aggregating the file attributes into characterizations for each attribute dimension, and generating a content feature summary for each container based on a selection window and a signature list, wherein the content feature summary incorporates the characterizations and summarizes the signature list, wherein generating the content feature summary comprises computing one or more discrete file summaries, and wherein computing a discrete file summary comprises: selecting a file from a subset of files within one of the containers, and extracting one or more features from one or more attributes of the selected file;computing a signature from the one or more extracted features, wherein the signature comprises a numerical value; andadding the signature to the signature list in response to the numerical value being less than a first threshold associated with the selection window; measuring a content similarity prediction measurement between containers from the generated content feature summary; and assigning files from each container to a de-duplication domain based on the computed content similarity prediction measurement.
地址 Armonk NY US