发明名称 Method and apparatus for improved reward-based learning using adaptive distance metrics
摘要 The present invention is a method and an apparatus for reward-based learning of policies for managing or controlling a system or plant. In one embodiment, a method for reward-based learning includes receiving a set of one or more exemplars, where at least two of the exemplars comprise a (state, action) pair for a system, and at least one of the exemplars includes an immediate reward responsive to a (state, action) pair. A distance metric and a distance-based function approximator estimating long-range expected value are then initialized, where the distance metric computes a distance between two (state, action) pairs, and the distance metric and function approximator are adjusted such that a Bellman error measure of the function approximator on the set of exemplars is minimized. A management policy is then derived based on the trained distance metric and function approximator.
申请公布号 US9298172(B2) 申请公布日期 2016.03.29
申请号 US200711870661 申请日期 2007.10.11
申请人 International Business Machines Corporation 发明人 Tesauro Gerald J.;Weinberger Kilian Q.
分类号 G06F15/18;G05B13/02;G06K9/62;G06N5/02 主分类号 G06F15/18
代理机构 代理人 Percello Louis
主权项 1. A method for learning a policy for managing a system, comprising: receiving a set of exemplars, where each exemplar in the set of exemplars comprises a (state, action) pair for the system and an immediate reward value responsive to the (state, action) pair, wherein the state of the (state, action) pair comprises a measurement of a regulated quantity of the system, the action of the (state, action) pair comprises an adjustment to an element of the system that affects a future evolution of a state of the system, and the immediate reward value comprises a difference between the state of the (state, action) pair and a target value of the state of the (state, action) pair such that a value of the immediate reward value is inversely proportional to a size of the difference; initializing a distance metric as D({right arrow over ((s, a))}, ({right arrow over (s′, a′)}), where the distance metric is a global function that computes a distance between general pairs of exemplars {right arrow over ((s, a))} and ({right arrow over (s′, a′)}) in the set of exemplars; initializing a function approximator that estimates a value of performing a given action in a given state, wherein the function approximator is denoted by F({right arrow over ((s, a))}) and is computed asF⁢((s,a))⟶=∑j=1S⁢wj⁢((s,a))⟶*QjΩ⁢((s,a))⟶,  wherein S denotes a number of exemplars in the set of exemplars, j denotes a jth exemplar in the set of exemplars, Qj denotes a target long-range expected value for the jth exemplar, Ω({right arrow over ((s, a) )}) is computed as Σj=1swj({right arrow over ((s, a) )}), wj({right arrow over ((s, a) )}) is computed as exp(−dj({right arrow over ((s, a) )})), and dj({right arrow over ((s, a) )}) is computed as D2({right arrow over ((s, a))},{right arrow over ((s)}j,aj)); performing one or more training sweeps through one or more batches, each batch comprising at least a portion of the set of exemplars; terminating the one or more training sweeps when a termination criterion is reached, wherein a trained distance metric and a trained function approximator are obtained simultaneously upon the terminating, the trained distance metric being a last-adjusted state of the distance metric and the trained function approximator being a last-adjusted state of the function approximator; and deriving the policy from the trained distance metric and the trained function approximator, wherein at least one of: the receiving, the initializing the distance metric, the initializing the function approximator, the performing, or the deriving is performed using a processor.
地址 Armonk NY US