发明名称 REAL-TIME IDENTIFICATION OF DATA CANDIDATES FOR CLASSIFICATION BASED COMPRESSION
摘要 Identification of data candidates for data processing is performed in real time by a processor device in a computing environment. Data candidates are sampled for performing a classification-based compression upon the data candidates. A heuristic is computed on a randomly selected data sample from the data candidate for determining if the data candidate may benefit from the classification-based compression, wherein a ratio is summed between the actual number of the characters and the expected number of the characters, and then dividing the ratio by a number of the data classes that are not empty, wherein the non-classifiable data are included in the number of the data classes during the dividing, and the number of the data classes, that are not empty, have characters that belong to the class that were observed in the input; and the classification-based compression is performed on the data candidates if the ratio exceeds a threshold.
申请公布号 US2015234852(A1) 申请公布日期 2015.08.20
申请号 US201514704700 申请日期 2015.05.05
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 AMIT Jonathan;DEMIDOV Lilia;GOLDBERG George;HALOWANI Nir;KAT Ronen I.;KOIFMAN Chaim;MARENKOV Sergey;SOTNIKOV Dmitry
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for real-time identification of data candidates for data processing by a processor device in a computing environment, the method comprising: sampling data candidates for performing a classification-based compression upon the data candidates; and computing a heuristic on a randomly selected data sample from the data candidate thereby determining if the data candidates may benefit from the classification-based compression, wherein a decision is provided for approving the classification-based compression on the data candidates according to the heuristic; wherein if data classes are estimated by the processor to be present in the data candidates; summing a ratio between the actual number of the characters and the expected number of the characters, and then dividing the ratio by a number of the data classes that are not empty, wherein the non-classifiable data are included in the number of the data classes during the dividing, and the number of the data classes, that are not empty, have characters that belong to the class that were observed in the input; andperforming the classification-based compression on the data candidates if the ratio exceeds a threshold.
地址 Armonk NY US