发明名称 Using simulated pseudo data to speed up statistical predictive modeling from massive data sets
摘要 The computational cost of many statistical modeling algorithms is affected by the input/output (I/O) cost of accessing out-of-core training data. This is an important challenge for emerging data mining applications, where the amount of training data can be potentially enormous. A heuristic approach to this problem is described. This approach is based on constructing a simple probability model from the large training data set, and using this model to generate simulated pseudo data for some aspects of the statistical modeling procedure. This approach is illustrated in the context of building a Naive Bayes probability model with feature selection. Here, the usual algorithms would require numerous data scans over the massive training data set, but our heuristic obtains models of comparable accuracy with just two data scans.
申请公布号 US6388592(B1) 申请公布日期 2002.05.14
申请号 US20010761589 申请日期 2001.01.18
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 NATARAJAN RAMESH
分类号 H03M7/30;(IPC1-7):H03M7/00 主分类号 H03M7/30
代理机构 代理人
主权项
地址