发明名称 Text, character encoding and language recognition
摘要 A method is disclosed, for recognizing whether some electronic data is the digital representation of a piece of text and, if so, in which character encoding it has been encoded. A fingerprint is constructed from the data, wherein the fingerprint comprises, for each of a plurality of predetermined character encoding schemes, at least one confidence value, representing a confidence that the data was encoded using said character encoding scheme. The fingerprint also comprises a frequency value for each of a subset of byte values, each frequency value representing the frequency of occurrence of a respective byte value in the data. A statistical classification of the data is then performed based on the fingerprint. The method may be applied to spam classification.
申请公布号 EP2506154(A2) 申请公布日期 2012.10.03
申请号 EP20120162469 申请日期 2012.03.30
申请人 CLEARSWIFT LIMITED 发明人 SCHOFIELD, KEVIN;BIRO, ISTVAN
分类号 G06F17/22;G06F17/27;G06Q10/10;H04L12/58 主分类号 G06F17/22
代理机构 代理人
主权项
地址