发明名称 Automatic Genre Determination of Web Content
摘要 A mechanism is provided for automatic genre determination of web content. For each type of web content genre, a set of relevant feature types are extracted from collected training material, where genre features and non-genre features are represented by tokens and an integer counts represents a frequency of appearance of the token in both a first type of training material and a second type of training material. In a classification process, fixed length tokens are extracted for relevant features types from different text and structural elements of web content. For each relevant feature type, a corresponding feature probability is calculated. The feature probabilities are combined to an overall genre probability that the web content belongs to a specific trained web content genre. A genre classification result is then output comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability.
申请公布号 US2014201113(A1) 申请公布日期 2014.07.17
申请号 US201314096481 申请日期 2013.12.04
申请人 International Business Machines Corporation 发明人 Harz Dirk;Iffert Ralf;Keinhoerster Mark;Usher Mark
分类号 G06N99/00 主分类号 G06N99/00
代理机构 代理人
主权项 1. A method, in a data processing system, for automatic genre determination of web content, the method comprising: a training process, wherein for each type of web content genre to be trained in the training process comprises the steps of: collecting labeled example data representing a first type of training material reflecting type of web content genre to be trained;collecting labeled example data representing a second type of training material not reflecting the type of web content genre to be trained;extracting a set of relevant feature types comprising genre features and non-genre features from the collected first and second type of training material, wherein the genre features and the non-genre features are represented by tokens consisting of fixed length character strings extracted from content strings of the first and second type of training material; andstoring each token in a corresponding feature database together with a first integer count (CG) representing a frequency of appearance of the token in the first type of training material and a second integer count (CNG) representing a frequency of appearance of the token in the second type of training material; and a classification process, wherein the classification process comprises the steps of: providing web content;extracting fixed length tokens for relevant features types from different text and structural elements of the web content;looking up frequencies of appearance in the corresponding feature database for each extracted token;calculating for each relevant feature type a corresponding feature probability that the web content belongs to a corresponding specific trained web content genre by combining probabilities of the genre features and non-genre features;combining the feature probabilities to an overall genre probability that the web content belongs to a specific trained web content genre; andoutputting a genre classification result comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability.
地址 Armonk NY US