发明名称 |
A SYSTEM FOR ESTIMATING A DISTRIBUTION OF MESSAGE CONTENT CATEGORIES IN SOURCE DATA |
摘要 |
<p>A method of computerized content analysis that gives "approximately unbiased and statistically consistent estimates" of a distribution of elements of structured, unstructured, and partially structured source data among a set of categories. In one embodiment, this is done by analyzing a distribution of small set of individually-classified elements in a plurality of categories and then using the information determined from the analysis to extrapolate a distribution in a larger population set. This extrapolation is performed without constraining the distribution of the unlabeled elements to be equal to the distribution of labeled elements, nor constraining a content distribution of content of elements in the labeled set (e.g., a distribution of words used by elements in the labeled set) to be equal to a content distribution of elements in the unlabeled set. Not being constrained in these ways allows the estimation techniques described herein to provide distinct advantages over conventional aggregation techniques.</p> |
申请公布号 |
WO2008115519(A1) |
申请公布日期 |
2008.09.25 |
申请号 |
WO2008US03606 |
申请日期 |
2008.03.19 |
申请人 |
PRESIDENT AND FELLOWS OF HARVARD COLLEGE;KING, GARY;HOPKINS, DANIEL;LU, YING |
发明人 |
KING, GARY;HOPKINS, DANIEL;LU, YING |
分类号 |
G06F19/00 |
主分类号 |
G06F19/00 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|