強化學習（Reinforcement Learning）

Model-Free vs. Model-Based

Two big classification of the reinforcement learning algorithm. The differences are: Whether the agent can fully understand or learn the model of the environment

Model-Based has an early understanding of the environment, and can consider planning in advance. But if the model is inconsistent with the real use scenarios, it will not perform well.

Model-free abandons model learning, thus it is easier to implement and fits to a real-world scenarios better. It is more popular now.

major categories of RL

Basic models and concepts

Process Model

train_process

Math Model

Use a Markov chain to represents the stages of an agent.

The agent will switch between two stages: approaching a state and taking an action.

State (S): The current situation the agent is in.
Action (A): The decision or move the agent takes to get from a state S to another state S’.
Reward (R): The immediate gain or loss the agent receives after taking an action in a state.
Policy (π): The strategy that determines the next action A to take for the current state S.

Uncertainty

When Policy π is different, the same state S may get different next action A.
Markov chain allows the environment to include randomness, thus even when two a state S takes the same action A twice, the next state S’ may be different.

Our target is to get a Policy π that leads us to the future (game end) with most benefits.

Action-Value Function (Q): Evaluates the expected average reward R for taking action A at state S. Determines the action A maximize long-term return at the current state S.

State-Value Function (V): Evaluates the expected average reward R at state S (taking all actions it might take later into consideration). Represents whether the current state S is promising comparing to the other states.

Monte Carlo Sampling

正向傳播：選擇一個state S痢士，持續(xù)執(zhí)行到最終state S’。
反向傳播：從最終state S’開始趋距，倒著計算每一個state所累積的價值V省艳，在未來state的價值上增加折扣率八酒。選擇到達原始state S時累計價值V最高的Policy耽装。

Monte Carlo Estimation

添加增量更新法孔轴，類似于梯度下降蓬衡，不需要再等待所有的分步到達最終狀態(tài)即可開始調整

learning rate alpha

Time Difference(TD) Estimation

正向傳播只進行最多N步，若到達的state Sn已經(jīng)有V值扫俺，則將這個V值納入反向傳播計算

SARSA

使用Q值代替V值進行計算

SARSA

Q-learning

An off-policy RL algorithm, updating its Q-values using the maximum possible future reward, regardless of the action taken. It go through zll possible actions and choose the best one.

Q-learning

Comparatively, SARSA is a on-policy RL algorithm. It updates its Q-values based on the actions actually taken by the policy.

e.g. The.current policy has 20% chance to choose Action1 and 80% to choose Action2. Action1 has total future reward larger than Action2. In the current step, the policy chooses Action2.
With SARSA, the agent updates Q-values as it takes Action2 in this step.
With Q-learning, the agent compares the future total reward of all items in the action space (Action1 & Action2), then updates Q-values as it takes Action1 in this step.

Epsilon Greedy

Randomly choose next action instead of choosing action with the max future Q for a percentage of times (e.g. 10%). Add exploration chances to the model.

Which Part is target? Which Part is predict?

DQN

Q-learning需要查表苍苞，因此只適合處理離散的state。對于處理連續(xù)的state狼纬，比如速度羹呵、距離等，需要使用DQN疗琉。

DQN中使用一個函數(shù)F(S) = A來代替Q-learning中的Q-Table表担巩，來實現(xiàn)找到對于State S來說未來總收益reward最大的Action A。

DQN

確定性策略

根據(jù)State直接輸出Action值而不是Action的概率分布（與策略梯度方法對應）没炒，有助于在連續(xù)動作空間里更好的學習

Replay Buffer

At each step, store the state S, action A, next state S', reward R in a buffer.

After the buffer size reach batch_size, take batch_size rows of data for training together with the current state at each step. (Mini-batch GD)

Make the training partially off-policy.

Benefits:

Model may converge faster.
A large variety of input helps avoiding model overfit.

Fix Q-Target 目標網(wǎng)絡

DQN的target里包含一個在訓練過程中一直變化的深度神經(jīng)網(wǎng)絡Q
一直變化導致深度神經(jīng)網(wǎng)絡Q學習效率較低，不易收斂
Solution：固定深度神經(jīng)網(wǎng)絡targetQ在N次訓練的過程中保持參數(shù)不變犯戏，變更的參數(shù)另記在一個地方送火，N次訓練后更新一次參數(shù)到targetQ

目標網(wǎng)絡的更新通常采取軟更新，即取一個學習率t（通常為0.005）先匪，將舊的網(wǎng)絡參數(shù)和新的網(wǎng)絡參數(shù)做加權平均种吸，然后賦值給目標網(wǎng)絡，即
Q target params = t(Q params) + (1-t)Q target params

Double DQN

Policy Gradient (PG)

Actor-Critic (AC)

合并了以值為基礎（比如Q- Learning）和以動作概率為基礎（比如Policy Gradient）兩類強化學習算法

Actor前生為如Policy Gradient呀非，可以在連續(xù)的動作空間里選擇合適的動作

Critic前生為Q- Learning坚俗，可以單步更新，類似于TD之于Monte Carlo岸裙。而傳統(tǒng)的Policy Gradient必須回合更新猖败，學習效率較低

Actor網(wǎng)絡的表現(xiàn)由Critic網(wǎng)絡進行打分

Deep Determinstic Policy Gradient (DDPG)

進一步結合了DQN，AC降允，和Target Q的思路恩闻，一共有四個神經(jīng)網(wǎng)絡：Actor，Target Actor剧董，Critic幢尚，Target Critic

實驗環(huán)境搭建

可以分為三個部分

將環(huán)境env與策略action串聯(lián)起來，形成迭代優(yōu)化回環(huán)
簡單的使用gym引入環(huán)境教程
自定義環(huán)境
簡單教程
注冊環(huán)境的部分翅楼，如果是簡單的項目可以在項目代碼內直接注冊
假設環(huán)境文件env1.py和引用他的算法文件dqn.py在同一目錄下尉剩，則可以在dqn.py開頭編寫如下代碼引入環(huán)境

from gym.envs.registration import register
register(
    id="qkd-v1",
    entry_point="env1:Env1",  # env1為文件名，Env1為env1.py中繼承了gym.Env的class
)

自己使用時其實可以直接在

自定義智能體agent毅臊，以及其策略policy
編寫DQN網(wǎng)絡解決CartPole問題

最后編輯于：2024.09.03 00:35:35

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末理茎，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌功蜓，老刑警劉巖园爷，帶你破解...
沈念sama閱讀 219,110評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異式撼，居然都是意外死亡童社，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,443評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門著隆，熙熙樓的掌柜王于貴愁眉苦臉地迎上來扰楼，“玉大人，你說我怎么就攤上這事美浦∠依担” “怎么了？”我有些...
開封第一講書人閱讀 165,474評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵浦辨，是天一觀的道長蹬竖。經(jīng)常有香客問我，道長流酬，這世上最難降的妖魔是什么币厕？我笑而不...
開封第一講書人閱讀 58,881評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮芽腾，結果婚禮上旦装，老公的妹妹穿的比我還像新娘。我一直安慰自己摊滔，他們只是感情好阴绢，可當我...
茶點故事閱讀 67,902評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著艰躺，像睡著了一般呻袭。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上描滔，一...
開封第一講書人閱讀 51,698評論 1贊 305
城市分裂傳說
那天棒妨，我揣著相機與錄音，去河邊找鬼含长。笑死券腔，一個胖子當著我的面吹牛，可吹牛的內容都是我干的拘泞。我是一名探鬼主播纷纫，決...
沈念sama閱讀 40,418評論 3贊 419
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼陪腌！你這毒婦竟也來了辱魁？” 一聲冷哼從身側響起烟瞧，我...
開封第一講書人閱讀 39,332評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎染簇，沒想到半個月后参滴，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,796評論 1贊 316
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡锻弓，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 37,968評論 3贊 337
?白月光啟示錄
正文我和宋清朗相戀三年砾赔，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片青灼。...
茶點故事閱讀 40,110評論 1贊 351
活死人
序言：一個原本活蹦亂跳的男人離奇死亡暴心，死狀恐怖，靈堂內的尸體忽然破棺而出杂拨，到底是詐尸還是另有隱情专普，我是刑警寧澤，帶...
沈念sama閱讀 35,792評論 5贊 346
?日本核電站爆炸內幕
正文年R本政府宣布弹沽，位于F島的核電站檀夹，受9級特大地震影響，放射性物質發(fā)生泄漏策橘。R本人自食惡果不足惜击胜，卻給世界環(huán)境...
茶點故事閱讀 41,455評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望役纹。院中可真熱鬧，春花似錦暇唾、人聲如沸促脉。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,003評論 0贊 22
一樁弒父案策州，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽瘸味。三九已至，卻和暖如春够挂，著一層夾襖步出監(jiān)牢的瞬間旁仿，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,130評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工孽糖，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留枯冈，地道東北人。一個月前我還...
沈念sama閱讀 48,348評論 3贊 373
代替公主和親
正文我出身青樓办悟，卻偏偏與公主長得像尘奏，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子病蛉，可洞房花燭夜當晚...
茶點故事閱讀 45,047評論 2贊 355

強化學習（Reinforcement Learning）

Model-Free vs. Model-Based

Basic models and concepts

Process Model

Math Model

Monte Carlo Sampling

Monte Carlo Estimation

Time Difference(TD) Estimation

SARSA

Q-learning

Epsilon Greedy

DQN

確定性策略

Replay Buffer

Fix Q-Target 目標網(wǎng)絡

Double DQN

Policy Gradient (PG)

Actor-Critic (AC)

Deep Determinstic Policy Gradient (DDPG)

實驗環(huán)境搭建

推薦閱讀更多精彩內容