发明名称 |
Automatic genre classification determination of web content to which the web content belongs together with a corresponding genre probability |
摘要 |
A mechanism is provided for automatic genre determination of web content. For each type of web content genre, a set of relevant feature types are extracted from collected training material, where genre features and non-genre features are represented by tokens and an integer counts represents a frequency of appearance of the token in both a first type of training material and a second type of training material. In a classification process, fixed length tokens are extracted for relevant features types from different text and structural elements of web content. For each relevant feature type, a corresponding feature probability is calculated. The feature probabilities are combined to an overall genre probability that the web content belongs to a specific trained web content genre. A genre classification result is then output comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability. |
申请公布号 |
US9565236(B2) |
申请公布日期 |
2017.02.07 |
申请号 |
US201314096481 |
申请日期 |
2013.12.04 |
申请人 |
International Business Machines Corporation |
发明人 |
Harz Dirk;Iffert Ralf;Keinhoerster Mark;Usher Mark |
分类号 |
H04L29/08;G06N99/00 |
主分类号 |
H04L29/08 |
代理机构 |
|
代理人 |
Lammes Francis;Walder, Jr. Stephen J.;Zarick Gail H. |
主权项 |
1. A system comprising:
a hardware processor; and a memory coupled to the hardware processor, wherein the memory comprises instructions which, when executed by the hardware processor, cause the hardware processor to: for each type of web content genre to be trained in a training process:
collect first labeled example data representing a first type of training material reflecting the type of web content genre to be trained;collect second labeled example data representing a second type of training material not reflecting the type of web content genre to be trained;extract a set of feature types comprising genre features and non-genre features from the collected first type of training material and second type of training material, wherein the genre features and the non-genre features are represented by tokens consisting of fixed length character strings extracted from content strings of the first and second type of training material; andstore each token in a corresponding feature database together with a first integer count (CG) representing a frequency of appearance of the token in the first type of training material and a second integer count (CNG) representing a frequency of appearance of the token in the second type of training material; and in a classification process:
provide web content, wherein the web content is a HyperText Markup Language (HTML) document, which is parsed to generate HTML document object model (DOM) data providing a tree representation of the HTML document, where each tag, attribute and text data of the web content is represented as a node in the tree, wherein a first feature type is generated by joining together attribute values of all HTML meta data tags to form a single content string, wherein each attribute value is separated by a single space character, further text content from a HTML title tag and HTML anchor tags is extracted and appended to the content string, wherein characters are converted to lower case and only alpha-numeric and space characters are added to the content string and sequences of space characters are compressed to a single space character; wherein a second feature type is generated by joining together attribute values of all HTML anchor tags and all link tags to form a single content string, wherein each attribute value is separated by a single space character, and wherein characters are converted to lower case and only alpha-numeric and space characters are added to the content string and sequences of space characters are compressed to a single space character;extract fixed length tokens for each feature type of the set of feature types from different text and structural elements of the web content;look up frequencies of appearance in the corresponding feature database for each extracted token;calculate for each feature type of the set of feature types a corresponding feature probability that the web content belongs to a corresponding specific trained web content genre by combining probabilities of the genre features and non-genre features;combine the feature probabilities to an overall genre probability that the web content belongs to a specific trained web content genre; andoutput a genre classification result comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability. |
地址 |
Armonk NY US |