发明名称 Systems and methods for spam detection using character histograms
摘要 Described spam detection techniques including string identification, pre-filtering, and character histogram and timestamp comparison steps facilitate accurate, computationally-efficient detection of rapidly-changing spam arriving in short-lasting waves. In some embodiments, a computer system extracts a target character string from an electronic communication such as a blog comment, transmits it to an anti-spam server, and receives an indicator of whether the respective electronic communication is spam or non-spam from the anti-spam server. The anti-spam server determines whether the electronic communication is spam or non-spam according to certain features of the character histogram of the target string. Some embodiments also perform an unsupervised clustering of incoming target strings into clusters, wherein all members of a cluster have similar character histograms.
申请公布号 US8954519(B2) 申请公布日期 2015.02.10
申请号 US201213358358 申请日期 2012.01.25
申请人 Bitdefender IPR Management Ltd. 发明人 Dichiu Daniel;Lupsescu Lucian Z.
分类号 G06F15/16;H04L29/06 主分类号 G06F15/16
代理机构 Law Office of Andrei D Popovici, PC 代理人 Law Office of Andrei D Popovici, PC
主权项 1. A method comprising: in response to receiving a target string forming a part of an electronic communication, employing at least one processor of a computer system to select a plurality of candidate strings from a corpus of reference strings, wherein selecting the plurality of candidate strings comprises: comparing a string length of the target string to a string length of a reference string of the corpus, and in response, selecting the reference string into the plurality of candidate strings according to a result of the comparison of string lengths; in response to selecting the plurality of candidate strings, employing the at least one processor to perform a first comparison between the target string and a candidate string of the plurality of candidate strings, and a second comparison between the target string and the candidate string; and employing the at least one processor to determine whether the electronic communication is spam or non-spam according to a result of the first comparison and the second comparison, wherein the first comparison comprises comparing, for each character of a plurality of distinct alphanumeric characters, a count of occurrences of the each character within the target string to a count of occurrences of the each character within the reference string, wherein the count of occurrences of the each character within the target string is determined without regard to a position of the each character relative to other characters within the target string, and wherein the second comparison comprises comparing a timestamp of the electronic communication to a timestamp of another electronic communication, the another electronic communication containing the candidate string.
地址 Nicosia CY