发明名称 Automatic genre classification determination of web content to which the web content belongs together with a corresponding genre probability
摘要 A mechanism is provided for automatic genre determination of web content. For each type of web content genre, a set of relevant feature types are extracted from collected training material, where genre features and non-genre features are represented by tokens and an integer counts represents a frequency of appearance of the token in both a first type of training material and a second type of training material. In a classification process, fixed length tokens are extracted for relevant features types from different text and structural elements of web content. For each relevant feature type, a corresponding feature probability is calculated. The feature probabilities are combined to an overall genre probability that the web content belongs to a specific trained web content genre. A genre classification result is then output comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability.
申请公布号 US9565236(B2) 申请公布日期 2017.02.07
申请号 US201314096481 申请日期 2013.12.04
申请人 International Business Machines Corporation 发明人 Harz Dirk;Iffert Ralf;Keinhoerster Mark;Usher Mark
分类号 H04L29/08;G06N99/00 主分类号 H04L29/08
代理机构 代理人 Lammes Francis;Walder, Jr. Stephen J.;Zarick Gail H.
主权项 1. A system comprising: a hardware processor; and a memory coupled to the hardware processor, wherein the memory comprises instructions which, when executed by the hardware processor, cause the hardware processor to: for each type of web content genre to be trained in a training process: collect first labeled example data representing a first type of training material reflecting the type of web content genre to be trained;collect second labeled example data representing a second type of training material not reflecting the type of web content genre to be trained;extract a set of feature types comprising genre features and non-genre features from the collected first type of training material and second type of training material, wherein the genre features and the non-genre features are represented by tokens consisting of fixed length character strings extracted from content strings of the first and second type of training material; andstore each token in a corresponding feature database together with a first integer count (CG) representing a frequency of appearance of the token in the first type of training material and a second integer count (CNG) representing a frequency of appearance of the token in the second type of training material; and in a classification process: provide web content, wherein the web content is a HyperText Markup Language (HTML) document, which is parsed to generate HTML document object model (DOM) data providing a tree representation of the HTML document, where each tag, attribute and text data of the web content is represented as a node in the tree, wherein a first feature type is generated by joining together attribute values of all HTML meta data tags to form a single content string, wherein each attribute value is separated by a single space character, further text content from a HTML title tag and HTML anchor tags is extracted and appended to the content string, wherein characters are converted to lower case and only alpha-numeric and space characters are added to the content string and sequences of space characters are compressed to a single space character; wherein a second feature type is generated by joining together attribute values of all HTML anchor tags and all link tags to form a single content string, wherein each attribute value is separated by a single space character, and wherein characters are converted to lower case and only alpha-numeric and space characters are added to the content string and sequences of space characters are compressed to a single space character;extract fixed length tokens for each feature type of the set of feature types from different text and structural elements of the web content;look up frequencies of appearance in the corresponding feature database for each extracted token;calculate for each feature type of the set of feature types a corresponding feature probability that the web content belongs to a corresponding specific trained web content genre by combining probabilities of the genre features and non-genre features;combine the feature probabilities to an overall genre probability that the web content belongs to a specific trained web content genre; andoutput a genre classification result comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability.
地址 Armonk NY US