发明名称 Non-linear classification of text samples
摘要 Non-linear classifiers and dimension reduction techniques may be applied to text classification. Non-linear classifiers such as random forest, Nyström/Fisher, and others, may be used to determine criteria usable to classify text into one of a plurality of categories. Dimension reduction techniques may also be used to reduce feature space size. Machine learning techniques may be used to develop criteria (e.g., trained models) that can be used to automatically classify text. Automatic classification rates may be improved and result in fewer numbers of text samples being unclassifiable or being incorrectly classified. User-generated content may be classified, in some embodiments.
申请公布号 US9342794(B2) 申请公布日期 2016.05.17
申请号 US201313837803 申请日期 2013.03.15
申请人 Bazaarvoice, Inc. 发明人 Mahler Daniel;Scott Eric D.;Curcic Milos;Allen Eric
分类号 G06F15/18;G06N99/00;G06F17/22;G06N5/02;G06N7/00 主分类号 G06F15/18
代理机构 Meyertons, Hood, Kivlin, Kowert & Goetzel, P.C. 代理人 Meyertons, Hood, Kivlin, Kowert & Goetzel, P.C.
主权项 1. A method, comprising: a computer system processing each text sample in a training set, wherein each text sample in the training set corresponds to one of a plurality of classifications, and wherein processing each text sample includes: generating a respective set of features from that text sample;populating an entry in a data structure based on results of one or more dimension reduction operations performed on ones of the respective set of features for that text sample; the computer system using a non-linear classifier on entries in the data structure to establish criteria usable to classify an unknown text sample into one of the plurality of classifications; wherein each text sample in the training set is made up of characters within a text character set having C characters, wherein each feature in the respective set of features for a given text sample in the training set has N characters, and wherein populating an entry in the data structure for the given text sample includes a dimension reduction operation on individual ones of the respective set of features such that a corresponding value for each of those features is reduced from CN possible values to not greater than CN-1 possible values; wherein N is an integer greater than or equal to 3 and C is an integer greater than or equal to 20.
地址 Austin TX US