发明名称 A SYSTEM FOR ESTIMATING A DISTRIBUTION OF MESSAGE CONTENT CATEGORIES IN SOURCE DATA
摘要 <p>A method of computerized content analysis that gives "approximately unbiased and statistically consistent estimates" of a distribution of elements of structured, unstructured, and partially structured source data among a set of categories. In one embodiment, this is done by analyzing a distribution of small set of individually-classified elements in a plurality of categories and then using the information determined from the analysis to extrapolate a distribution in a larger population set. This extrapolation is performed without constraining the distribution of the unlabeled elements to be equal to the distribution of labeled elements, nor constraining a content distribution of content of elements in the labeled set (e.g., a distribution of words used by elements in the labeled set) to be equal to a content distribution of elements in the unlabeled set. Not being constrained in these ways allows the estimation techniques described herein to provide distinct advantages over conventional aggregation techniques.</p>
申请公布号 WO2008115519(A1) 申请公布日期 2008.09.25
申请号 WO2008US03606 申请日期 2008.03.19
申请人 PRESIDENT AND FELLOWS OF HARVARD COLLEGE;KING, GARY;HOPKINS, DANIEL;LU, YING 发明人 KING, GARY;HOPKINS, DANIEL;LU, YING
分类号 G06F19/00 主分类号 G06F19/00
代理机构 代理人
主权项
地址