发明名称 |
Automatic Genre Determination of Web Content |
摘要 |
A mechanism is provided for automatic genre determination of web content. For each type of web content genre, a set of relevant feature types are extracted from collected training material, where genre features and non-genre features are represented by tokens and an integer counts represents a frequency of appearance of the token in both a first type of training material and a second type of training material. In a classification process, fixed length tokens are extracted for relevant features types from different text and structural elements of web content. For each relevant feature type, a corresponding feature probability is calculated. The feature probabilities are combined to an overall genre probability that the web content belongs to a specific trained web content genre. A genre classification result is then output comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability. |
申请公布号 |
US2014201113(A1) |
申请公布日期 |
2014.07.17 |
申请号 |
US201314096481 |
申请日期 |
2013.12.04 |
申请人 |
International Business Machines Corporation |
发明人 |
Harz Dirk;Iffert Ralf;Keinhoerster Mark;Usher Mark |
分类号 |
G06N99/00 |
主分类号 |
G06N99/00 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method, in a data processing system, for automatic genre determination of web content, the method comprising:
a training process, wherein for each type of web content genre to be trained in the training process comprises the steps of:
collecting labeled example data representing a first type of training material reflecting type of web content genre to be trained;collecting labeled example data representing a second type of training material not reflecting the type of web content genre to be trained;extracting a set of relevant feature types comprising genre features and non-genre features from the collected first and second type of training material, wherein the genre features and the non-genre features are represented by tokens consisting of fixed length character strings extracted from content strings of the first and second type of training material; andstoring each token in a corresponding feature database together with a first integer count (CG) representing a frequency of appearance of the token in the first type of training material and a second integer count (CNG) representing a frequency of appearance of the token in the second type of training material; and a classification process, wherein the classification process comprises the steps of:
providing web content;extracting fixed length tokens for relevant features types from different text and structural elements of the web content;looking up frequencies of appearance in the corresponding feature database for each extracted token;calculating for each relevant feature type a corresponding feature probability that the web content belongs to a corresponding specific trained web content genre by combining probabilities of the genre features and non-genre features;combining the feature probabilities to an overall genre probability that the web content belongs to a specific trained web content genre; andoutputting a genre classification result comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability. |
地址 |
Armonk NY US |