摘要 |
A method for efficiently detecting unknown malicious code, according to which a Data Set that is a collection of files that includes a first subset with malicious code and a second subset with benign code files is created and malicious and benign files are identified by an antivirus program. All files are parsed using n-gram moving windows of several lengths. The TF representation is computed for each n-gram in each file and an initial set of top features of all n-grams is selected, based on the DF measure. The number of the top features is reduced to comply with the computation resources required for classifier training, by using features selection methods. The optimal number of features is determined, based on the evaluation of the detection accuracy of several sets of reduced top features and a dataset with a distribution of benign files is greater than the distribution of and malicious files is prepared, where a portion of the dataset is used for training the classifier. New malicious codes within a stream of new files are automatically detected and acquired by using Active Learning. |