发明名称 CONSISTENT FILTERING OF MACHINE LEARNING DATA
摘要 Consistency metadata, including a parameter for a pseudo-random number source, are determined for training-and-evaluation iterations of a machine learning model. Using the metadata, a first training set comprising records of at least a first chunk is identified from a plurality of chunks of a data set. The first training set is used to train a machine learning model during a first training-and-evaluation iteration. A first test set comprising records of at least a second chunk is identified using the metadata, and is used to evaluate the model during the first training-and-evaluation iteration.
申请公布号 US2015379425(A1) 申请公布日期 2015.12.31
申请号 US201414460314 申请日期 2014.08.14
申请人 Amazon Technologies, Inc. 发明人 DIRAC LEO PARKER;LI JIN;ZHENG TIANMING;ZHUO DONGHUI
分类号 G06N99/00 主分类号 G06N99/00
代理机构 代理人
主权项 1. A system, comprising: one or more computing devices configured to: generate consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular initialization parameter value for a pseudo-random number source;sub-divide an address space of a particular data set of the machine learning model into a plurality of chunks, including a first chunk comprising a first plurality of observation records, and a second chunk comprising a second plurality of observation records;retrieve, from one or more persistent storage devices, observation records of the first chunk into a memory of a first server, and observation records of the second chunk into a memory of a second server,select, using a first set of pseudo-random numbers, a first training set from the plurality of chunks, wherein the first training set includes at least a portion of the first chunk, wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and-evaluation iterations, and wherein the first set of pseudo-random numbers is obtained using the consistency metadata; andselect, using a second set of pseudo-random numbers, a first test set from the plurality of chunks, wherein the first test set includes at least a portion of the second chunk, wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration, and wherein the second set of pseudo-random numbers is obtained using the consistency metadata.
地址 Reno NV US