发明名称 TRAINING REINFORCEMENT LEARNING NEURAL NETWORKS
摘要 Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a Q network used to select actions to be performed by an agent interacting with an environment. One of the methods includes obtaining a plurality of experience tuples and training the Q network on each of the experience tuples using the Q network and a target Q network that is identical to the Q network but with the current values of the parameters of the target Q network being different from the current values of the parameters of the Q network.
申请公布号 US2017076201(A1) 申请公布日期 2017.03.16
申请号 US201615261579 申请日期 2016.09.09
申请人 Google Inc. 发明人 van Hasselt Hado Philip;Guez Arthur Clément
分类号 G06N3/08 主分类号 G06N3/08
代理机构 代理人
主权项 1. A method of training a Q network used to select actions to be performed by an agent that interacts with an environment by receiving observations characterizing states of the environment and performing actions from a set of actions in response to the observations, wherein the Q network is a deep neural network that is configured to receive as input an input observation and an input action and to generate an estimated future cumulative reward from the input in accordance with a set of parameters, and wherein the method comprises: obtaining a plurality of experience tuples, wherein each experience tuple includes a training observation, an action performed by the agent in response to receiving the training observation, a reward received in response to the agent performing the action, and a next training observation that characterizes a next state of the environment; andtraining the Q network on each of the experience tuples, comprising, for each experience tuple: processing the training observation in the experience tuple and the action in the experience tuple using the Q network to determine a current estimated future cumulative reward for the experience tuple in accordance with current values of the parameters of the Q network;selecting an action from the set of actions that, when processed in combination with the next observation by the Q network, results in the Q network generating a highest estimated future cumulative reward;processing the next observation in the experience tuple and the selected action using a target Q network to determine a next target estimated future cumulative reward for the selected action in accordance with current values of the parameters of the target Q network, wherein the target Q network is identical to the Q network but the current values of the parameters of the target Q network are different from the current values of the parameters of the Q network;determining an error for the experience tuple from the reward in the experience tuple, the next target estimated future cumulative reward for the selected action, and the current estimated future cumulative reward; andusing the error for the experience tuple to update the current values of the parameters of the Q network.
地址 Mountain View CA US