摘要 |
<p>At step S1, a prediction operation to confer a maximum reward is carried out in a recurrent neural network by a forward dynamics. At step S2, a plan is made by a reverse dynamics. Thus, an action plan constituted of a sequence of differential values of an action for conferring the maximum reward. The steps are repeated until it is judged that a desired action plan is made at step S3. In such a way, an action plan which maximizes the reward is generated from a few action experiences.</p> |