Chapter 3

Chapter 3: Finite Markov Decision Processes

Basic Definitions

MDP is the most basic formulation of sequential decision process under the assumption of Markov property.

  1. State: The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.
  2. Action
  3. Reward: The reward defines what we want to achieve instead of how we want to achieve it.
  4. Dynamics: p(s', r | s, a)
  5. Return: Return is defined as some function of the reward sequence
    For episodic tasks, we have G_t = R_{t+1} + \cdots + R_T
    For continuing tasks, we have G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots
    They can be unified under the same framework as G_t = \sum_{k=0}^\infty \gamma^{k} R_{t+k+1} by adding an absorbing state with zero reward to the terminal of episodic tasks
    The recursive form of return is G_t = R_{t+1} + \gamma G_{t+1}, which forms the basis of Bellman equations

Further notes:

  1. In the RL book, the reward obtained from taking action A_t in state S_t at time step t is denoted as R_{t+1} instead of R_t;
  2. RL beyond MDP assumption is an important research topic (also discussed in the RL book)
  3. The representation of the states and actions has a great influence on the learning process, but is beyond the scope of the RL book (many recent works actually focus on this topic)
  4. The RL book focuses on scalar reward signal, but there are also some recent works focusing on multi-objective reward signal in vector form

Policies and Value Functions

Value function is the expected return of a state or a state-action pair
Policy is a mapping from states to the probabilities of selecting each possible action
Value functions are defined w.r.t. particular policies, i.e., v_\pi (s) = \mathbb{E}_\pi [G_t | S_t = s], \quad q_\pi (s, a) = \mathbb{E}_\pi [G_t | S_t = s, A_t = a]
Based on the simple relationships of v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a), \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ], we can derive the Bellman equation which expresses the relationship between the value of a state (state-action pair) and the values of its successor states (state-action pairs), i.e., v_\pi (s) = \sum_a \pi (a | s) \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \\ q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \sum_{a'} \pi (a' | s') q_\pi (s', a') \big ]. The value function v_\pi is the unique solution to its Bellman equation by solving a set of |\mathcal{S}| linear equations. Notice that the assumption here is that the system dynamics p(s', r | s, a) is known.
Another useful tool to visualize the recursive relationships of value functions is backup diagram.

Optimal Policies and Optimal Value Functions

Definition of a "better" policy: \pi \geq \pi' if and only if v_\pi (s) \geq v_{\pi'} (s) for all s \in \mathcal{S}.
There always exists an optimal value function v_*(s) and q_*(s, a) and its corresponding optimal policies (potentially more than one) for MDPs. Intuitively, if a policy is not optimal, we can always improve the value of a state s by changing the policy for this specific state. The improvement in the value of s will then backpropagate to all the values of the states which can reach s in the state transition graph. In this way, we can always achieve a better policy and gradually reach the optimal policy.
Based on the following simple relations, i.e., v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a) \rightarrow v_*(s) = \max_{a \in \mathcal{A}(s)} q_*(s, a) \\ \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \rightarrow q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ], we have the Bellman optimality equation without reference to any specific policy as v_*(s) = \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ] \\ q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \max_{a'} q_*(s', a') \big ].
The optimal policy can be easily derived by greedy search over the state values.
Solving the Bellman optimality equation requires solving |\mathcal{S}| nonlinear equations based on the assumption of fully known system dynamics and Markov property. Even these two assmuptions are satisfied, solving the equations is still computationally infeasible when the state space is very large. Consequently, different RL methods mainly focus on how to solve the Bellman optimality equation approximately.

Further notes:
The MDP formulation of RL makes it closely related to (stochastic) optimal control.

Reinforcement learning adds to MDPs a focus on approximation and incomplete information for realistically large problems.

The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effor into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末蠢棱,一起剝皮案震驚了整個濱河市突颊,隨后出現(xiàn)的幾起案子印蓖,更是在濱河造成了極大的恐慌玄括,老刑警劉巖,帶你破解...
    沈念sama閱讀 222,464評論 6 517
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異,居然都是意外死亡躁绸,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 95,033評論 3 399
  • 文/潘曉璐 我一進(jìn)店門臣嚣,熙熙樓的掌柜王于貴愁眉苦臉地迎上來净刮,“玉大人,你說我怎么就攤上這事硅则⊙透福” “怎么了?”我有些...
    開封第一講書人閱讀 169,078評論 0 362
  • 文/不壞的土叔 我叫張陵怎虫,是天一觀的道長暑认。 經(jīng)常有香客問我,道長大审,這世上最難降的妖魔是什么蘸际? 我笑而不...
    開封第一講書人閱讀 59,979評論 1 299
  • 正文 為了忘掉前任,我火速辦了婚禮徒扶,結(jié)果婚禮上粮彤,老公的妹妹穿的比我還像新娘。我一直安慰自己姜骡,他們只是感情好导坟,可當(dāng)我...
    茶點故事閱讀 69,001評論 6 398
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著圈澈,像睡著了一般惫周。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上康栈,一...
    開封第一講書人閱讀 52,584評論 1 312
  • 那天递递,我揣著相機與錄音喷橙,去河邊找鬼。 笑死漾狼,一個胖子當(dāng)著我的面吹牛重慢,可吹牛的內(nèi)容都是我干的饥臂。 我是一名探鬼主播逊躁,決...
    沈念sama閱讀 41,085評論 3 422
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼隅熙!你這毒婦竟也來了稽煤?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 40,023評論 0 277
  • 序言:老撾萬榮一對情侶失蹤囚戚,失蹤者是張志新(化名)和其女友劉穎酵熙,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體驰坊,經(jīng)...
    沈念sama閱讀 46,555評論 1 319
  • 正文 獨居荒郊野嶺守林人離奇死亡匾二,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,626評論 3 342
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了拳芙。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片察藐。...
    茶點故事閱讀 40,769評論 1 353
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖舟扎,靈堂內(nèi)的尸體忽然破棺而出分飞,到底是詐尸還是另有隱情,我是刑警寧澤睹限,帶...
    沈念sama閱讀 36,439評論 5 351
  • 正文 年R本政府宣布譬猫,位于F島的核電站,受9級特大地震影響羡疗,放射性物質(zhì)發(fā)生泄漏染服。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 42,115評論 3 335
  • 文/蒙蒙 一叨恨、第九天 我趴在偏房一處隱蔽的房頂上張望柳刮。 院中可真熱鬧,春花似錦特碳、人聲如沸诚亚。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,601評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽站宗。三九已至,卻和暖如春益愈,著一層夾襖步出監(jiān)牢的瞬間梢灭,已是汗流浹背夷家。 一陣腳步聲響...
    開封第一講書人閱讀 33,702評論 1 274
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留敏释,地道東北人库快。 一個月前我還...
    沈念sama閱讀 49,191評論 3 378
  • 正文 我出身青樓,卻偏偏與公主長得像钥顽,于是被迫代替她去往敵國和親义屏。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,781評論 2 361