发明名称 Creating Taxonomies And Training Data For Document Categorization
摘要 Methods, apparatus and systems are provided to generate from a set of training documents a set of training data and a set of features for a taxonomy of categories. In this generated taxonomy the degree of feature overlap among categories is minimized in order to optimize use with a machine-based categorizer. However, the categories still make sense to a human because a human makes the decisions regarding category definitions. In an example embodiment, for each category, a plurality of training documents selected using Web search engines is generated, the documents winnowed to produce a more refined set of training documents, and a set of features highly differentiating for that category within a set of categories (a supercategory) extracted. This set of training documents or differentiating features is used as input to a categorizer, which determines for a plurality of test documents the plurality of categories to which they best belong.
申请公布号 US2007185901(A1) 申请公布日期 2007.08.09
申请号 US20070734528 申请日期 2007.04.12
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 GATES STEPHEN C.
分类号 G06F7/00 主分类号 G06F7/00
代理机构 代理人
主权项
地址