发明名称 |
Systems and methods for spam detection using character histograms |
摘要 |
Described spam detection techniques including string identification, pre-filtering, and character histogram and timestamp comparison steps facilitate accurate, computationally-efficient detection of rapidly-changing spam arriving in short-lasting waves. In some embodiments, a computer system extracts a target character string from an electronic communication such as a blog comment, transmits it to an anti-spam server, and receives an indicator of whether the respective electronic communication is spam or non-spam from the anti-spam server. The anti-spam server determines whether the electronic communication is spam or non-spam according to certain features of the character histogram of the target string. Some embodiments also perform an unsupervised clustering of incoming target strings into clusters, wherein all members of a cluster have similar character histograms. |
申请公布号 |
US8954519(B2) |
申请公布日期 |
2015.02.10 |
申请号 |
US201213358358 |
申请日期 |
2012.01.25 |
申请人 |
Bitdefender IPR Management Ltd. |
发明人 |
Dichiu Daniel;Lupsescu Lucian Z. |
分类号 |
G06F15/16;H04L29/06 |
主分类号 |
G06F15/16 |
代理机构 |
Law Office of Andrei D Popovici, PC |
代理人 |
Law Office of Andrei D Popovici, PC |
主权项 |
1. A method comprising:
in response to receiving a target string forming a part of an electronic communication, employing at least one processor of a computer system to select a plurality of candidate strings from a corpus of reference strings, wherein selecting the plurality of candidate strings comprises: comparing a string length of the target string to a string length of a reference string of the corpus, and in response, selecting the reference string into the plurality of candidate strings according to a result of the comparison of string lengths; in response to selecting the plurality of candidate strings, employing the at least one processor to perform a first comparison between the target string and a candidate string of the plurality of candidate strings, and a second comparison between the target string and the candidate string; and employing the at least one processor to determine whether the electronic communication is spam or non-spam according to a result of the first comparison and the second comparison, wherein the first comparison comprises comparing, for each character of a plurality of distinct alphanumeric characters, a count of occurrences of the each character within the target string to a count of occurrences of the each character within the reference string, wherein the count of occurrences of the each character within the target string is determined without regard to a position of the each character relative to other characters within the target string, and wherein the second comparison comprises comparing a timestamp of the electronic communication to a timestamp of another electronic communication, the another electronic communication containing the candidate string. |
地址 |
Nicosia CY |