发明名称 GENERATING DATA FROM IMBALANCED TRAINING DATA SETS
摘要 Injecting generated data samples into a minority data class of an imbalanced training data set is provided. In response to receiving an input to balance the imbalanced training data set that includes a majority data class and the minority data class, a set of data samples is generated for the minority data class. A distance is calculated from each data sample in the set of generated data samples to a center of a kernel that includes a set of data samples of the majority data class. Each data sample in the set of generated data samples is stored within a corresponding distance score bucket based on the calculated distance of a data sample. Generated data samples are selected from a number of highest ranking distance score buckets. The generated data samples selected from the number of highest ranking distance score buckets are injected into the minority data class.
申请公布号 US2015088791(A1) 申请公布日期 2015.03.26
申请号 US201314034797 申请日期 2013.09.24
申请人 International Business Machines Corporation 发明人 Lin Ching-Yung;Lin Wan-Yi;Xia Yinglong
分类号 G06N99/00 主分类号 G06N99/00
代理机构 代理人
主权项 1. A computer-implemented method for injecting generated data samples into a minority data class of an imbalanced training data set, the computer-implemented method comprising: responsive to a computer receiving an input to balance the imbalanced training data set that includes a majority data class and the minority data class, generating, by the computer, a set of data samples for the minority data class of the imbalanced training data set; calculating, by the computer, a distance from each data sample in the set of generated data samples to a center of a kernel that includes a set of data samples of the majority data class; storing, by the computer, each data sample in the set of generated data samples within a corresponding distance score bucket based on the calculated distance of a data sample; selecting, by the computer, generated data samples from a predetermined number of highest ranking distance score buckets; and injecting, by the computer, the generated data samples selected from the predetermined number of highest ranking distance score buckets into the minority data class to balance a size of the minority data class with a size of the majority data class.
地址 Armonk NY US