主权项 |
1. A method of training a Q network used to select actions to be performed by an agent that interacts with an environment by receiving observations characterizing states of the environment and performing actions from a set of actions in response to the observations,
wherein the Q network is a deep neural network that is configured to receive as input an input observation and an input action and to generate an estimated future cumulative reward from the input in accordance with a set of parameters, and wherein the method comprises:
obtaining a plurality of experience tuples, wherein each experience tuple includes a training observation, an action performed by the agent in response to receiving the training observation, a reward received in response to the agent performing the action, and a next training observation that characterizes a next state of the environment; andtraining the Q network on each of the experience tuples, comprising, for each experience tuple:
processing the training observation in the experience tuple and the action in the experience tuple using the Q network to determine a current estimated future cumulative reward for the experience tuple in accordance with current values of the parameters of the Q network;selecting an action from the set of actions that, when processed in combination with the next observation by the Q network, results in the Q network generating a highest estimated future cumulative reward;processing the next observation in the experience tuple and the selected action using a target Q network to determine a next target estimated future cumulative reward for the selected action in accordance with current values of the parameters of the target Q network, wherein the target Q network is identical to the Q network but the current values of the parameters of the target Q network are different from the current values of the parameters of the Q network;determining an error for the experience tuple from the reward in the experience tuple, the next target estimated future cumulative reward for the selected action, and the current estimated future cumulative reward; andusing the error for the experience tuple to update the current values of the parameters of the Q network. |