发明名称 Systems and methods for calculating category proportions
摘要 Systems and methods are provided for classifying text based on language using one or more computer servers and storage devices. A computer-implemented method includes receiving a training set of elements, each element in the training set being assigned to one of a plurality of categories and having one of a plurality of content profiles associated therewith; receiving a population set of elements, each element in the population set having one of the plurality of content profiles associated therewith; and calculating using at least one of a stacked regression algorithm, a bias formula algorithm, a noise elimination algorithm, and an ensemble method consisting of a plurality of algorithmic methods the results of which are averaged, based on the content profiles associated with and the categories assigned to elements in the training set and the content profiles associated with the elements of the population set, a distribution of elements of the population set over the categories.
申请公布号 US9483544(B2) 申请公布日期 2016.11.01
申请号 US201313804096 申请日期 2013.03.14
申请人 Crimson Hexagon, Inc. 发明人 Firat Aykut;Brooks Mitchell;Bingham Christopher;Herdagdelen Amac;King Gary
分类号 G06F17/30;G06Q50/00 主分类号 G06F17/30
代理机构 Nutter McClennen & Fish LLP 代理人 Nutter McClennen & Fish LLP
主权项 1. A computer-implemented method for categorizing digital documents, containing digital content, in aggregate, the method performed by a computer processor and comprising: (a) receiving by the computer processor a training set of digital documents each containing digital content, each digital document in the training set being assigned to one of a plurality of categories and being associated with one of a plurality of content profiles, each content profile representing existence or absence of one or more features in the digital content of the digital document; (b) receiving by the computer processor a population set of digital documents each containing digital content, each digital document in the population set having one of the plurality of content profiles associated with the digital content contained therein; (c) organizing the digital documents of the training set and the digital documents of the population set into a matrix using the plurality of content profiles, the matrix having rows corresponding to each of the digital documents and cells indicating existence or absence of the one or more features in the digital content of the digital document; (d) determining a weight for each row of the matrix using an estimated total variance for that row of the matrix; (e) determining, by the computer processor applying a stacked regression coupled with weighted regression to the matrix, the weighted regression using the weights determined for the rows of the matrix, a proportion of the digital documents in the population set belonging to each category of the plurality of categories; (f) determining one or more category proportions of the digital documents, each including the portion of the digital documents belonging each category; and (g) categorizing the digital documents by labeling the digital document based on the category corresponding to the proportion to which the digital document belongs.
地址 Boston MA US