強(qiáng)化學(xué)習(xí)中有兩種重要的方法:Policy Gradients和Q-learning沥曹。其中Policy Gradients方法直接預(yù)測在某個(gè)環(huán)境下應(yīng)該采取的Action髓帽,而Q-learning方法預(yù)測某個(gè)環(huán)境下所有Action的期望值(即Q值)。一般來說驳阎,Q-learning方法只適合有少量離散取值的Action環(huán)境抗愁,而Policy Gradients方法適合有連續(xù)取值的Action環(huán)境。在與深度學(xué)習(xí)方法結(jié)合后搞隐,這兩種算法就變成了Policy Network和DQN(Deep Q-learning Network)驹愚。
Policy Gradient:Policy gradient methods for reinforcement learning with function approximation
DQN: Playing Atari with Deep Reinforcement Learning
NatureDQN:Human-level control through deep reinforcement learning
- Python 3.6
- Tensorflow-gpu 1.8.0
- Keras 2.2.2
- Gym 0.10.8
Gym 是 OpenAI 發(fā)布的用于開發(fā)和比較強(qiáng)化學(xué)習(xí)算法的工具包。使用它我們可以讓 AI 智能體做很多事情劣纲,比如行走逢捺、跑動(dòng),以及進(jìn)行多種游戲癞季。在這個(gè)Demo中劫瞳,我們使用的是車桿游戲(Cart-Pole)這個(gè)小游戲。
Cart-Pole世界包括一個(gè)沿水平軸移動(dòng)的車和一個(gè)固定在車上的桿子奈应。 在每個(gè)時(shí)間步,你可以觀察它的位置(x)购披,速度(x_dot)杖挣,角度(theta)和角速度(theta_dot)。 這是這個(gè)世界的可觀察的狀態(tài)刚陡。 在任何狀態(tài)下惩妇,車只有兩種可能的行動(dòng):向左移動(dòng)或向右移動(dòng)株汉。換句話說,Cart-Pole的狀態(tài)空間有四個(gè)維度的連續(xù)值歌殃,行動(dòng)空間有一個(gè)維度的兩個(gè)離散值乔妈。
pip install gym
# -*- coding: utf-8 -*-
import gym
import numpy as np
def try_gym():
# 使用gym創(chuàng)建一個(gè)CartPole環(huán)境
# 這個(gè)環(huán)境可以接收一個(gè)action,返回執(zhí)行action后的觀測值氓皱,獎(jiǎng)勵(lì)與游戲是否結(jié)束
env = gym.make('CartPole-v0')
# 重置游戲環(huán)境
# 游戲輪數(shù)
random_episodes = 0
# 每輪游戲的Reward總和
reward_sum = 0
count = 0
while random_episodes < 10:
# 渲染顯示游戲效果
# 隨機(jī)生成一個(gè)action褒翰,即向左移動(dòng)或者向右移動(dòng)。
# 然后接收執(zhí)行action之后的反饋值
observation, reward, done, _ = env.step(np.random.randint(0, 2))
reward_sum += reward
count += 1
# 如果游戲結(jié)束匀泊,打印Reward總和,重置游戲
if done:
random_episodes += 1
print("Reward for this episode was: {}, turns was: {}".format(reward_sum, count))
reward_sum = 0
count = 0
if __name__ == '__main__':
Reward for this episode was: 20.0, turns was: 20
Reward for this episode was: 26.0, turns was: 26
Reward for this episode was: 18.0, turns was: 18
Reward for this episode was: 25.0, turns was: 25
Reward for this episode was: 25.0, turns was: 25
Reward for this episode was: 23.0, turns was: 23
Reward for this episode was: 29.0, turns was: 29
Reward for this episode was: 17.0, turns was: 17
Reward for this episode was: 13.0, turns was: 13
Reward for this episode was: 27.0, turns was: 27
如果使用的環(huán)境是Anoconda 3,可能會出現(xiàn)下列錯(cuò)誤:
raise NotImplementedError('abstract')
NotImplementedError: abstract
pip uninstall pyglet
pip install pyglet==1.2.4
Policy Network
R.Sutton在2000年提出的Policy Gradient方法是RL中學(xué)習(xí)連續(xù)的行為控制策略的經(jīng)典方法,其解決方案是通過一個(gè)概率分布函數(shù)πθ(st|θπ) 來表示每一步的最優(yōu)策略抡医,在每一步根據(jù)該概率分布進(jìn)行action采樣獲得當(dāng)前的最佳a(bǔ)ction取值躲因,即: at~πθ(st|θπ)。生成action的過程本質(zhì)上是一個(gè)隨機(jī)過程;最后學(xué)習(xí)到的策略忌傻,也是一個(gè)隨機(jī)策略(stochastic policy)大脉。
Policy Network是一種典型的蒙特卡洛方法,是在一個(gè)episode結(jié)束時(shí)對discount reward進(jìn)行學(xué)習(xí)水孩,其實(shí)現(xiàn)流程如下:
(2)在一個(gè)episode結(jié)束時(shí)(游戲勝利或死亡)秤标,將env重置,即observation恢復(fù)到了初始狀態(tài)宙刘。下一次循環(huán)時(shí)苍姜,輸入observation,輸出一個(gè)概率值p0悬包。根據(jù)概率p0選取一個(gè)action輸入到環(huán)境中衙猪,獲取到新的observation和reward。記錄[observation, action, reward]作為后續(xù)訓(xùn)練的數(shù)據(jù)布近。
使用keras實(shí)現(xiàn)的Policy Network如下所示:
# -*- coding: utf-8 -*-
import os
import gym
import numpy as np
from keras.layers import Input, Dense
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
class PG:
def __init__(self):
self.model = self.build_model()
if os.path.exists('pg.h5'):
self.env = gym.make('CartPole-v0')
self.gamma = 0.95
def build_model(self):
inputs = Input(shape=(4,), name='ob_input')
x = Dense(16, activation='relu')(inputs)
x = Dense(16, activation='relu')(x)
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=inputs, outputs=x)
return model
def loss(self, y_true, y_pred):
y_true: (action, reward)
y_pred: action_prob
loss: reward loss
action_pred = y_pred
action_true, discount_episode_reward = y_true[:, 0], y_true[:, 1]
# 二分類交叉熵?fù)p失
action_true = K.reshape(action_true, (-1, 1))
loss = K.binary_crossentropy(action_true, action_pred)
# 乘上discount_reward
loss = loss * K.flatten(discount_episode_reward)
return loss
def discount_reward(self, rewards):
"""Discount reward
rewards: 一次episode中的rewards
# 以時(shí)序順序計(jì)算一次episode中的discount reward
discount_rewards = np.zeros_like(rewards, dtype=np.float32)
cumulative = 0.
for i in reversed(range(len(rewards))):
cumulative = cumulative * self.gamma + rewards[i]
discount_rewards[i] = cumulative
# normalization,有利于控制梯度的方差
discount_rewards -= np.mean(discount_rewards)
discount_rewards //= np.std(discount_rewards)
return list(discount_rewards)
def train(self, episode, batch):
episode: 游戲次數(shù)
batch: 一個(gè)batch包含幾次episode算谈,每個(gè)batch更新一次梯度
history: 訓(xùn)練記錄
self.model.compile(loss=self.loss, optimizer=Adam(lr=0.01))
history = {'episode': [], 'Batch_reward': [], 'Episode_reward': [], 'Loss': []}
episode_reward = 0
states = []
actions = []
rewards = []
discount_rewards = []
for i in range(episode):
observation = self.env.reset()
erewards = []
while True:
x = observation.reshape(-1, 4)
prob = self.model.predict(x)[0][0]
# 根據(jù)隨機(jī)概率選擇action
action = np.random.choice(np.array(range(2)), size=1, p=[1 - prob, prob])[0]
observation, reward, done, _ = self.env.step(action)
# 記錄一個(gè)episode中產(chǎn)生的數(shù)據(jù)
if done:
# 一次episode結(jié)束后計(jì)算discount rewards
# 保存batch個(gè)episode的數(shù)據(jù)涩禀,用這些數(shù)據(jù)更新模型
if i != 0 and i % batch == 0:
batch_reward = sum(rewards)
episode_reward = batch_reward / batch
# 輸入X為狀態(tài), y為action與discount_rewards然眼,用來與預(yù)測出來的prob計(jì)算損失
X = np.array(states)
y = np.array(list(zip(actions, discount_rewards)))
loss = self.model.train_on_batch(X, y)
print('Episode: {} | Batch reward: {} | Episode reward: {} | loss: {:.3f}'.format(i, batch_reward, episode_reward, loss))
episode_reward = 0
states = []
actions = []
rewards = []
discount_rewards = []
return history
def play(self):
observation = self.env.reset()
count = 0
reward_sum = 0
random_episodes = 0
while random_episodes < 10:
x = observation.reshape(-1, 4)
prob = self.model.predict(x)[0][0]
action = 1 if prob > 0.5 else 0
observation, reward, done, _ = self.env.step(action)
count += 1
reward_sum += reward
if done:
print("Reward for this episode was: {}, turns was: {}".format(reward_sum, count))
random_episodes += 1
reward_sum = 0
count = 0
observation = self.env.reset()
if __name__ == '__main__':
model = PG()
history = model.train(5000, 5)
訓(xùn)練結(jié)果與測試結(jié)果如下所示艾船,可以看出隨著訓(xùn)練次數(shù)的增加,Policy Network模型在游戲中獲得Reward不斷的增加高每,并且Loss不斷降低屿岂。在完成5000次Episode的訓(xùn)練后進(jìn)行模型測試, 相比隨機(jī)操作來說Policy Network模型能達(dá)到200 reward鲸匿,由于到達(dá)200個(gè)reward之后游戲也會結(jié)束爷怀,因此Policy Network可以說是解決了這個(gè)問題。
但是根據(jù)我的實(shí)驗(yàn)带欢,Policy Network訓(xùn)練起來并不穩(wěn)定运授,模型參數(shù)初始化對訓(xùn)練效果也有著較大的影響,需要多次嘗試乔煞。有時(shí)reward收斂一段時(shí)間后又會快速下降吁朦,出現(xiàn)周期性的變化,從圖中也可以看出訓(xùn)練過程的不穩(wěn)定渡贾。
Episode: 5 | Batch reward: 120.0 | Episode reward: 24.0 | loss: -0.325
Episode: 10 | Batch reward: 67.0 | Episode reward: 13.4 | loss: -0.300
Episode: 15 | Batch reward: 128.0 | Episode reward: 25.6 | loss: -0.326
Episode: 20 | Batch reward: 117.0 | Episode reward: 23.4 | loss: -0.332
Episode: 25 | Batch reward: 122.0 | Episode reward: 24.4 | loss: -0.330
Episode: 30 | Batch reward: 97.0 | Episode reward: 19.4 | loss: -0.339
Episode: 35 | Batch reward: 120.0 | Episode reward: 24.0 | loss: -0.331
Episode: 4960 | Batch reward: 973.0 | Episode reward: 194.6 | loss: -0.228
Episode: 4965 | Batch reward: 1000.0 | Episode reward: 200.0 | loss: -0.224
Episode: 4970 | Batch reward: 881.0 | Episode reward: 176.2 | loss: -0.238
Episode: 4975 | Batch reward: 1000.0 | Episode reward: 200.0 | loss: -0.213
Episode: 4980 | Batch reward: 974.0 | Episode reward: 194.8 | loss: -0.229
Episode: 4985 | Batch reward: 862.0 | Episode reward: 172.4 | loss: -0.235
Episode: 4990 | Batch reward: 914.0 | Episode reward: 182.8 | loss: -0.233
Episode: 4995 | Batch reward: 737.0 | Episode reward: 147.4 | loss: -0.254
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
DQN是一種典型的時(shí)序差分方法逗宜,與Policy Network不同,DQN對時(shí)刻n與時(shí)刻n+1的數(shù)據(jù)進(jìn)行學(xué)習(xí)空骚,這樣話其產(chǎn)生的方差要小于蒙特卡洛方法锦溪。常用的DQN算法是在15年提出來的Nature DQN,這里使用Nature DQN為例府怯。
DQN使用單個(gè)網(wǎng)絡(luò)來進(jìn)行選擇動(dòng)作和計(jì)算目標(biāo)Q值刻诊;Nature DQN使用了兩個(gè)網(wǎng)絡(luò),一個(gè)當(dāng)前主網(wǎng)絡(luò)用來選擇動(dòng)作牺丙,更新模型參數(shù)则涯,另一個(gè)目標(biāo)網(wǎng)絡(luò)用于計(jì)算目標(biāo)Q值,兩個(gè)網(wǎng)絡(luò)的結(jié)構(gòu)是一模一樣的冲簿。目標(biāo)網(wǎng)絡(luò)的網(wǎng)絡(luò)參數(shù)不需要迭代更新粟判,而是每隔一段時(shí)間從當(dāng)前主網(wǎng)絡(luò)復(fù)制過來,即延時(shí)更新峦剔,這樣可以減少目標(biāo)Q值和當(dāng)前的Q值相關(guān)性档礁。Nature DQN和DQN相比,除了用一個(gè)新的相同結(jié)構(gòu)的目標(biāo)網(wǎng)絡(luò)來計(jì)算目標(biāo)Q值以外吝沫,其余部分基本是完全相同的呻澜。
Nature DQN的實(shí)現(xiàn)流程如下:
(2)在一個(gè)episode結(jié)束時(shí)(游戲勝利或死亡)栅受,將env重置将硝,即observation恢復(fù)到了初始狀態(tài)observation,通過貪婪選擇法ε-greedy選擇action屏镊。根據(jù)選擇的action依疼,獲取到新的next_observation、reward和游戲狀態(tài)而芥。將[observation, action, reward, next_observation, done]放入到經(jīng)驗(yàn)池中涛贯。經(jīng)驗(yàn)池有一定的容量,會將舊的數(shù)據(jù)刪除蔚出。
使用keras實(shí)現(xiàn)的Nature DQN如下所示:
# -*- coding: utf-8 -*-
import os
import gym
import random
import numpy as np
from collections import deque
from keras.layers import Input, Dense
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
class DQN:
def __init__(self):
self.model = self.build_model()
self.target_model = self.build_model()
if os.path.exists('dqn.h5'):
# 經(jīng)驗(yàn)池
self.memory_buffer = deque(maxlen=2000)
# Q_value的discount rate,以便計(jì)算未來reward的折扣回報(bào)
self.gamma = 0.95
# 貪婪選擇法的隨機(jī)選擇行為的程度
self.epsilon = 1.0
# 上述參數(shù)的衰減率
self.epsilon_decay = 0.995
# 最小隨機(jī)探索的概率
self.epsilon_min = 0.01
self.env = gym.make('CartPole-v0')
def build_model(self):
inputs = Input(shape=(4,))
x = Dense(16, activation='relu')(inputs)
x = Dense(16, activation='relu')(x)
x = Dense(2, activation='linear')(x)
model = Model(inputs=inputs, outputs=x)
return model
def update_target_model(self):
def egreedy_action(self, state):
state: 狀態(tài)
action: 動(dòng)作
if np.random.rand() <= self.epsilon:
return random.randint(0, 1)
q_values = self.model.predict(state)[0]
return np.argmax(q_values)
def remember(self, state, action, reward, next_state, done):
state: 狀態(tài)
action: 動(dòng)作
reward: 回報(bào)
next_state: 下一個(gè)狀態(tài)
done: 游戲結(jié)束標(biāo)志
item = (state, action, reward, next_state, done)
def update_epsilon(self):
if self.epsilon >= self.epsilon_min:
self.epsilon *= self.epsilon_decay
def process_batch(self, batch):
batch: batch size
X: states
y: [Q_value1, Q_value2]
# 從經(jīng)驗(yàn)池中隨機(jī)采樣一個(gè)batch
data = random.sample(self.memory_buffer, batch)
# 生成Q_target寡夹。
states = np.array([d[0] for d in data])
next_states = np.array([d[3] for d in data])
y = self.model.predict(states)
q = self.target_model.predict(next_states)
for i, (_, action, reward, _, done) in enumerate(data):
target = reward
if not done:
target += self.gamma * np.amax(q[i])
y[i][action] = target
return states, y
def train(self, episode, batch):
episode: 游戲次數(shù)
batch: batch size
history: 訓(xùn)練記錄
self.model.compile(loss='mse', optimizer=Adam(1e-3))
history = {'episode': [], 'Episode_reward': [], 'Loss': []}
count = 0
for i in range(episode):
observation = self.env.reset()
reward_sum = 0
loss = np.infty
done = False
while not done:
# 通過貪婪選擇法ε-greedy選擇action处面。
x = observation.reshape(-1, 4)
action = self.egreedy_action(x)
observation, reward, done, _ = self.env.step(action)
# 將數(shù)據(jù)加入到經(jīng)驗(yàn)池。
reward_sum += reward
self.remember(x[0], action, reward, observation, done)
if len(self.memory_buffer) > batch:
# 訓(xùn)練
X, y = self.process_batch(batch)
loss = self.model.train_on_batch(X, y)
count += 1
# 減小egreedy的epsilon參數(shù)菩掏。
# 固定次數(shù)更新target_model
if count != 0 and count % 20 == 0:
if i % 5 == 0:
print('Episode: {} | Episode reward: {} | loss: {:.3f} | e:{:.2f}'.format(i, reward_sum, loss, self.epsilon))
return history
def play(self):
observation = self.env.reset()
count = 0
reward_sum = 0
random_episodes = 0
while random_episodes < 10:
x = observation.reshape(-1, 4)
q_values = self.model.predict(x)[0]
action = np.argmax(q_values)
observation, reward, done, _ = self.env.step(action)
count += 1
reward_sum += reward
if done:
print("Reward for this episode was: {}, turns was: {}".format(reward_sum, count))
random_episodes += 1
reward_sum = 0
count = 0
observation = self.env.reset()
if __name__ == '__main__':
model = DQN()
history = model.train(600, 32)
訓(xùn)練結(jié)果與測試結(jié)果如下所示魂角,可以看出隨著訓(xùn)練次數(shù)的增加,DQN模型在游戲中獲得Reward不斷的增加智绸,并且Loss不斷降低野揪。在batch=32的條件下500次Episode的訓(xùn)練后進(jìn)行模型測試访忿, DQN也有不錯(cuò)的表現(xiàn),如果進(jìn)一步訓(xùn)練應(yīng)該能達(dá)到和Policy Network同樣的效果囱挑。
相比Policy Network醉顽,DQN的訓(xùn)練過程更穩(wěn)定一些,但是DQN有個(gè)問題平挑,就是它并不一定能保證Q網(wǎng)絡(luò)的收斂游添。也就是說,我們不一定可以得到收斂后的Q網(wǎng)絡(luò)參數(shù)通熄,這會導(dǎo)致我們訓(xùn)練出的模型效果很差唆涝,因此也需要反復(fù)嘗試選取最好的模型。
Episode: 0 | Episode reward: 11.0 | loss: inf | e:1.00
Episode: 5 | Episode reward: 23.0 | loss: 0.816 | e:0.67
Episode: 10 | Episode reward: 18.0 | loss: 2.684 | e:0.46
Episode: 15 | Episode reward: 11.0 | loss: 3.662 | e:0.34
Episode: 20 | Episode reward: 16.0 | loss: 2.702 | e:0.23
Episode: 25 | Episode reward: 10.0 | loss: 4.092 | e:0.18
Episode: 30 | Episode reward: 12.0 | loss: 3.734 | e:0.13
Episode: 460 | Episode reward: 111.0 | loss: 6.325 | e:0.01
Episode: 465 | Episode reward: 180.0 | loss: 0.046 | e:0.01
Episode: 470 | Episode reward: 141.0 | loss: 0.136 | e:0.01
Episode: 475 | Episode reward: 169.0 | loss: 0.110 | e:0.01
Episode: 480 | Episode reward: 200.0 | loss: 0.095 | e:0.01
Episode: 485 | Episode reward: 200.0 | loss: 0.024 | e:0.01
Episode: 490 | Episode reward: 200.0 | loss: 0.066 | e:0.01
Episode: 495 | Episode reward: 146.0 | loss: 0.022 | e:0.01
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 196.0, turns was: 196
Reward for this episode was: 198.0, turns was: 198
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 199.0, turns was: 199
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 193.0, turns was: 193
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 189.0, turns was: 189
Reward for this episode was: 200.0, turns was: 200
(1)Policy Network可以處理連續(xù)的action唇辨,而DQN則只能處理離散問題廊酣,通過枚舉的方式來實(shí)現(xiàn),連續(xù)的action只能離散化后再處理赏枚。
(2)Policy Network通過輸出的action概率值大小隨機(jī)選擇action亡驰,而DQN則通過貪婪選擇法ε-greedy選擇action。
(2)DQN的更新是一個(gè)一個(gè)的reward進(jìn)行更新饿幅,即當(dāng)前的reward只跟鄰近的一個(gè)相關(guān)凡辱;Policy Network則將一個(gè)episode的reward全部保存起來,然后用discount的方式修正reward栗恩,標(biāo)準(zhǔn)化后進(jìn)行更新顾瞻。