摘要 |
The present invention is a method and an apparatus for reward-based learning of policies for managing or controlling a system or plant. In one embodiment, a method for reward-based learning includes receiving a set of one or more exemplars, where at least two of the exemplars comprise a (state, action) pair for a system, and at least one of the exemplars includes an immediate reward responsive to a (state, action) pair. A distance metric and a distance-based function approximator estimating long-range expected value are then initialized, where the distance metric computes a distance between two (state, action) pairs, and the distance metric and function approximator are adjusted such that a Bellman error measure of the function approximator on the set of exemplars is minimized. A management policy is then derived based on the trained distance metric and function approximator. |
主权项 |
1. A method for learning a policy for managing a system, comprising:
receiving a set of exemplars, where each exemplar in the set of exemplars comprises a (state, action) pair for the system and an immediate reward value responsive to the (state, action) pair, wherein the state of the (state, action) pair comprises a measurement of a regulated quantity of the system, the action of the (state, action) pair comprises an adjustment to an element of the system that affects a future evolution of a state of the system, and the immediate reward value comprises a difference between the state of the (state, action) pair and a target value of the state of the (state, action) pair such that a value of the immediate reward value is inversely proportional to a size of the difference; initializing a distance metric as D({right arrow over ((s, a))}, ({right arrow over (s′, a′)}), where the distance metric is a global function that computes a distance between general pairs of exemplars {right arrow over ((s, a))} and ({right arrow over (s′, a′)}) in the set of exemplars; initializing a function approximator that estimates a value of performing a given action in a given state, wherein the function approximator is denoted by F({right arrow over ((s, a))}) and is computed asF((s,a))⟶=∑j=1Swj((s,a))⟶*QjΩ((s,a))⟶, wherein S denotes a number of exemplars in the set of exemplars, j denotes a jth exemplar in the set of exemplars, Qj denotes a target long-range expected value for the jth exemplar, Ω({right arrow over ((s, a) )}) is computed as Σj=1swj({right arrow over ((s, a) )}), wj({right arrow over ((s, a) )}) is computed as exp(−dj({right arrow over ((s, a) )})), and dj({right arrow over ((s, a) )}) is computed as D2({right arrow over ((s, a))},{right arrow over ((s)}j,aj)); performing one or more training sweeps through one or more batches, each batch comprising at least a portion of the set of exemplars; terminating the one or more training sweeps when a termination criterion is reached, wherein a trained distance metric and a trained function approximator are obtained simultaneously upon the terminating, the trained distance metric being a last-adjusted state of the distance metric and the trained function approximator being a last-adjusted state of the function approximator; and deriving the policy from the trained distance metric and the trained function approximator, wherein at least one of: the receiving, the initializing the distance metric, the initializing the function approximator, the performing, or the deriving is performed using a processor. |