https://classroom.udacity.com/courses/ud501/lessons/5326212698/concepts/54629888620923
hallucinate 產(chǎn)生幻覺
Dyna-Q:混合 Model-Free 和 Model-based
每一次和真實世界的交互吃型,都會自己更新100次父虑。
T'[s,a,s']: 從狀態(tài) s霉咨,采取動作 a循衰,到狀態(tài) s’的概率
R'[s,a]: 從狀態(tài) s,采取動作 a的 reward
根據(jù)真實世界發(fā)生的次數(shù),更新 T
練習(xí): How To Evaluate T?
Type in your expression usingMathQuill
- a WYSIWYG math renderer that understands LaTeX.
Correction: The expression should be:
R:模型中的 Reward
r: 真實的立即 reward
Summary
The Dyna architecture consists of a combination of:
- direct reinforcement learning from real experience tuples gathered by acting in an environment,
- updating an internal model of the environment, and,
- using the model to simulate experiences.
Sutton and Barto.
Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]
Resources
-
Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In
Proceedings of the Seventh International Conference on Machine Learning, Austin, TX, 1990. [pdf]
-
Sutton and Barto.
Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]
-
(videos, slides)
- Lecture 8: Integrating Learning and Planning [pdf]