Model-Free vs. Model-Based
Two big classification of the reinforcement learning algorithm. The differences are: Whether the agent can fully understand or learn the model of the environment
Model-Based has an early understanding of the environment, and can consider planning in advance. But if the model is inconsistent with the real use scenarios, it will not perform well.
Model-free abandons model learning, thus it is easier to implement and fits to a real-world scenarios better. It is more popular now.
Basic models and concepts
Process Model
Math Model
Use a Markov chain to represents the stages of an agent.
The agent will switch between two stages: approaching a state and taking an action.
State (S): The current situation the agent is in.
Action (A): The decision or move the agent takes to get from a state S to another state S’.
Reward (R): The immediate gain or loss the agent receives after taking an action in a state.
Policy (π): The strategy that determines the next action A to take for the current state S.
Uncertainty
- When Policy π is different, the same state S may get different next action A.
- Markov chain allows the environment to include randomness, thus even when two a state S takes the same action A twice, the next state S’ may be different.
Our target is to get a Policy π that leads us to the future (game end) with most benefits.
Action-Value Function (Q): Evaluates the expected average reward R for taking action A at state S. Determines the action A maximize long-term return at the current state S.
State-Value Function (V): Evaluates the expected average reward R at state S (taking all actions it might take later into consideration). Represents whether the current state S is promising comparing to the other states.
Monte Carlo Sampling
正向傳播:選擇一個state S痢士,持續(xù)執(zhí)行到最終state S’。
反向傳播:從最終state S’開始趋距,倒著計算每一個state所累積的價值V省艳,在未來state的價值上增加折扣率八酒。選擇到達原始state S時累計價值V最高的Policy耽装。
Monte Carlo Estimation
添加增量更新法孔轴,類似于梯度下降蓬衡,不需要再等待所有的分步到達最終狀態(tài)即可開始調整
learning rate alpha
Time Difference(TD) Estimation
正向傳播只進行最多N步,若到達的state Sn已經(jīng)有V值扫俺,則將這個V值納入反向傳播計算
SARSA
使用Q值代替V值進行計算
Q-learning
An off-policy RL algorithm, updating its Q-values using the maximum possible future reward, regardless of the action taken. It go through zll possible actions and choose the best one.
Comparatively, SARSA is a on-policy RL algorithm. It updates its Q-values based on the actions actually taken by the policy.
e.g. The.current policy has 20% chance to choose Action1 and 80% to choose Action2. Action1 has total future reward larger than Action2. In the current step, the policy chooses Action2.
With SARSA, the agent updates Q-values as it takes Action2 in this step.
With Q-learning, the agent compares the future total reward of all items in the action space (Action1 & Action2), then updates Q-values as it takes Action1 in this step.
Epsilon Greedy
Randomly choose next action instead of choosing action with the max future Q for a percentage of times (e.g. 10%). Add exploration chances to the model.
Which Part is target? Which Part is predict?
DQN
Q-learning需要查表苍苞,因此只適合處理離散的state。對于處理連續(xù)的state狼纬,比如速度羹呵、距離等,需要使用DQN疗琉。
DQN中使用一個函數(shù)F(S) = A來代替Q-learning中的Q-Table表担巩,來實現(xiàn)找到對于State S來說未來總收益reward最大的Action A。
確定性策略
根據(jù)State直接輸出Action值而不是Action的概率分布(與策略梯度方法對應)没炒,有助于在連續(xù)動作空間里更好的學習
Replay Buffer
At each step, store the state S, action A, next state S', reward R in a buffer.
After the buffer size reach batch_size
, take batch_size
rows of data for training together with the current state at each step. (Mini-batch GD)
Make the training partially off-policy.
Benefits:
- Model may converge faster.
- A large variety of input helps avoiding model overfit.
Fix Q-Target 目標網(wǎng)絡
DQN的target里包含一個在訓練過程中一直變化的深度神經(jīng)網(wǎng)絡Q
一直變化導致深度神經(jīng)網(wǎng)絡Q學習效率較低,不易收斂
Solution:固定深度神經(jīng)網(wǎng)絡targetQ在N次訓練的過程中保持參數(shù)不變犯戏,變更的參數(shù)另記在一個地方送火,N次訓練后更新一次參數(shù)到targetQ
目標網(wǎng)絡的更新通常采取軟更新,即取一個學習率t(通常為0.005)先匪,將舊的網(wǎng)絡參數(shù)和新的網(wǎng)絡參數(shù)做加權平均种吸,然后賦值給目標網(wǎng)絡,即
Q target params = t(Q params) + (1-t)Q target params
Double DQN
Policy Gradient (PG)
Actor-Critic (AC)
合并了以值為基礎(比如Q- Learning)和以動作概率為基礎(比如Policy Gradient)兩類強化學習算法
Actor前生為如Policy Gradient呀非,可以在連續(xù)的動作空間里選擇合適的動作
Critic前生為Q- Learning坚俗,可以單步更新,類似于TD之于Monte Carlo岸裙。而傳統(tǒng)的Policy Gradient必須回合更新猖败,學習效率較低
Actor網(wǎng)絡的表現(xiàn)由Critic網(wǎng)絡進行打分
Deep Determinstic Policy Gradient (DDPG)
進一步結合了DQN,AC降允,和Target Q的思路恩闻,一共有四個神經(jīng)網(wǎng)絡:Actor,Target Actor剧董,Critic幢尚,Target Critic
實驗環(huán)境搭建
可以分為三個部分
- 將環(huán)境env與策略action串聯(lián)起來,形成迭代優(yōu)化回環(huán)
簡單的使用gym引入環(huán)境教程 - 自定義環(huán)境
簡單教程
注冊環(huán)境的部分翅楼,如果是簡單的項目可以在項目代碼內直接注冊
假設環(huán)境文件env1.py
和引用他的算法文件dqn.py
在同一目錄下尉剩,則可以在dqn.py
開頭編寫如下代碼引入環(huán)境
from gym.envs.registration import register
register(
id="qkd-v1",
entry_point="env1:Env1", # env1為文件名,Env1為env1.py中繼承了gym.Env的class
)
自己使用時其實可以直接在
- 自定義智能體agent毅臊,以及其策略policy
編寫DQN網(wǎng)絡解決CartPole問題