摘要 |
A system and method for online reinforcement learning is provided. In particular, a method for performing the explore-vs.-exploit tradeoff is provided. Although the method is heuristic, it can be applied in a principled manner while simultaneously learning the parameters and/or structure of the model (e.g., Bayesian network model). The system includes a model which receives an input (e.g., from a user) and provides a probability distribution associated with uncertainty regarding parameters of the model to a decision engine. The decision engine can determine whether to exploit the information known to it or to explore to obtain additional information based, at least in part, upon the explore-vs.-exploit tradeoff (e.g., Thompson strategy). A reinforcement learning component can obtain additional information (e.g., feedback from a user) and update parameter(s) and/or the structure of the model. The system can be employed in scenarios in which an influence diagram is used to make repeated decisions and maximization of long-term expected utility is desired.
|