发明名称 Creating taxonomies and training data for document categorization
摘要 Methods, apparatus and systems are provided to generate from a set of training documents a set of training data and a set of features for a taxonomy of categories. In this generated taxonomy the degree of feature overlap among categories is minimized in order to optimize use with a machine-based categorizer. However, the categories still make sense to a human because a human makes the decisions regarding category definitions. In an example embodiment, for each category, a plurality of training documents selected using Web search engines is generated, the documents winnowed to produce a more refined set of training documents, and a set of features highly differentiating for that category within a set of categories (a supercategory) extracted. This set of training documents or differentiating features is used as input to a categorizer, which determines for a plurality of test documents the plurality of categories to which they best belong.
申请公布号 US8341159(B2) 申请公布日期 2012.12.25
申请号 US20070734528 申请日期 2007.04.12
申请人 GATES STEPHEN C.;INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 GATES STEPHEN C.
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项
地址