摘要 |
Described systems and methods allow a classification of electronic documents such as email messages and HTML documents, according to a document-specific text fingerprint. The text fingerprint is calculated for a text block of each target document, and comprises a sequence of characters determined according to a plurality of text tokens of the respective text block. In some embodiments, the length of the text fingerprint is forced within a pre-determined range of lengths (e.g. between 129 and 256 characters) irrespective of the length of the text block, by zooming in for short text blocks, and zooming out for long ones. Classification may include, for instance, determining whether an electronic document represents unsolicited communication (spam) or online fraud such as phishing. |