发明名称 Document Classification Using Multiscale Text Fingerprints
摘要 Described systems and methods allow a classification of electronic documents such as email messages and HTML documents, according to a document-specific text fingerprint. The text fingerprint is calculated for a text block of each target document, and comprises a sequence of characters determined according to a plurality of text tokens of the respective text block. In some embodiments, the length of the text fingerprint is forced within a pre-determined range of lengths (e.g. between 129 and 256 characters) irrespective of the length of the text block, by zooming in for short text blocks, and zooming out for long ones. Classification may include, for instance, determining whether an electronic document represents unsolicited communication (spam) or online fraud such as phishing.
申请公布号 US2014259157(A1) 申请公布日期 2014.09.11
申请号 US201313790636 申请日期 2013.03.08
申请人 Toma Adrian;Tibeica Marius N. 发明人 Toma Adrian;Tibeica Marius N.
分类号 H04L29/06 主分类号 H04L29/06
代理机构 代理人
主权项 1. A client computer system comprising at least one processor configured to determine a text fingerprint of a target electronic document so that a length of the text fingerprint is constrained between a lower bound and an upper bound, wherein the lower and upper bounds are predetermined, and wherein determining the text fingerprint comprises: selecting a plurality of text tokens of the target electronic document; in response to selecting the plurality of text tokens, determining a fingerprint fragment size according to the upper and lower bounds, and according to a count of the selected plurality of text tokens; determining a plurality of fingerprint fragments, each fingerprint fragment of the plurality of fingerprint fragments determined according to a hash of a distinct text token of the selected plurality of text tokens, each fingerprint fragment consisting of a sequence of characters, a length of the sequence chosen to equal the fingerprint fragment size; and concatenating the plurality of fingerprint fragments to form the text fingerprint.
地址 Iasi RO