发明名称 |
GENERAL FRAMEWORK FOR CROSS-VALIDATION OF MACHINE LEARNING ALGORITHMS USING SQL ON DISTRIBUTED SYSTEMS |
摘要 |
A general framework for cross-validation of any supervised learning algorithm on a distributed database comprises a multi-layer software architecture that implements training, prediction and metric functions in a C++ layer and iterates processing of different subsets of a data set with a plurality of different models in a Python layer. The best model is determined to be the one with the smallest average prediction error across all database segments. |
申请公布号 |
US2016092794(A1) |
申请公布日期 |
2016.03.31 |
申请号 |
US201514963061 |
申请日期 |
2015.12.08 |
申请人 |
EMC Corporation |
发明人 |
Qian Hai;Iyer Rahul;Yang Shengwen;Welton Caleb E. |
分类号 |
G06N99/00 |
主分类号 |
G06N99/00 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method of cross-validation of a supervised machine learning algorithm within a distributed database having a plurality of database segments in which data are stored, comprising:
partitioning a data set within said database into a training subset and a validation subset, wherein the partitioning data set comprises partitioning the data set according to randomly sorted data to create two data subsets that are independent and statistically equivalent; determining coefficients of a first model of said supervised machine learning algorithm using the training subset; predicting a value of a data element in said validation subset using said first model; determining a prediction error based at least in part on a difference between said predicted value and the actual value of said data element; successively repeating said partitioning k times to form k different partitions, wherein at least a subset of the k different partitions have different training and validation subsets; determining corresponding k prediction errors based at least in part on iteratively determining the coefficients, predicting the value of the data element, and determining the prediction error for each of said k partitions; and evaluating the performance of said first model using said k prediction errors. |
地址 |
Hopkinton MA US |