《Budget Allocation using Weakly Coupled, Constrained Markov Decision Processes 》論文導(dǎo)讀

本文研究了在涉及大量同時運行的子進(jìn)程的順序決策問題中預(yù)算（或其他資源）的分配問題，其中這些子進(jìn)程之間唯一的交互是通過它們逐漸消耗預(yù)算（或所涉及的資源）。我們以廣告商與大量目標(biāo)客戶的交互為主要動機(jī)，考慮了兩個互補(bǔ)的問題，包括計算MDP價值函數(shù)作為可用預(yù)算的函數(shù)，以及構(gòu)建一個政策，規(guī)定給定系統(tǒng)的聯(lián)合狀態(tài)時跨所有子進(jìn)程采取的聯(lián)合政策屎蜓，受全局預(yù)算約束。

Abstract

We consider the problem of budget (or other resource) allocation in sequential decision prob- lems involving a large number of concurrently running sub-processes, whose only interaction is through their gradual consumption of budget (or the resource in question). We use the case of an advertiser interacting with a large population of target customers as a primary motivation. We consider two complementary problems that comprise our key contributions.
Our first contribution addresses the problem of computing MDP value functions as a function of the available budget. In contrast to standard constrained MDPs—which find optimal value functions given a fixed expected budget constraint—our aim is to assess the tradeoff between expected budget spent and the value attained when using that budget optimally. We show that optimal value functions are concave in budget. More importantly, in the finite-horizon case, we show there are a finite number of useful budget levels. This gives rise to piecewise-linear, concave value functions (piecewise-constant if we restrict to deterministic policies) with an representation that can be computed readily via dynamic programming. This representation also supports natural approximations. Our model not only allows the assessment of budget/value tradeoffs (e.g., to find the “sweet spot” in spend), but plays an important role in the allocation of budget across competing subprocesses.
Our second contribution is a method for constructing a policy that prescribes the joint policy to be taken across all sub-processes given the joint state of the system, subject to a global budget constraint. We cast the problem as a weakly coupled MDP in which budget is allocated online to the individual subprocesses based on its observed (initial) state and the subprocess-specific value function. We show that the budget allocation process can be cast as a multi-item, multiple-choice knapsack problem (MCKP), which admits an efficient greedy algorithm to determine optimal al- locations. We also discuss the possibility of online, per-stage re-allocation of budget to adaptively satisfy strict rather than expected budget constraints.

這篇論文主要研究在涉及大量同時運行的子進(jìn)程的順序決策問題中的預(yù)算（或其他資源）分配問題钥勋，這些子進(jìn)程之間唯一的交互是通過它們逐漸消耗預(yù)算（或所涉及的資源）炬转。作者以廣告商與大量目標(biāo)客戶互動的情況為主要動機(jī)。作者提出了兩個互補(bǔ)的問題笔诵，這是作者的主要貢獻(xiàn)返吻。

作者的第一個貢獻(xiàn)解決了計算MDP值函數(shù)的問題，這是一個關(guān)于可用預(yù)算的函數(shù)的問題乎婿。與標(biāo)準(zhǔn)的約束MDP不同测僵，標(biāo)準(zhǔn)約束MDP是在給定固定預(yù)算約束的情況下找到最優(yōu)值函數(shù)，而我們的目標(biāo)是評估在使用該預(yù)算時預(yù)期的預(yù)算支出和獲得的價值之間的權(quán)衡谢翎。我們證明了最優(yōu)值函數(shù)在預(yù)算方面是凸的捍靠。更重要的是，在有限時間內(nèi)森逮，我們證明有一定數(shù)量的有用預(yù)算水平榨婆。這產(chǎn)生了分段線性、凸的價值函數(shù)（如果我們限制為確定性策略褒侧，則為分段常數(shù)）良风，其表示可以通過動態(tài)規(guī)劃輕松計算。這種表示也支持自然的近似闷供。我們的模型不僅允許評估預(yù)算/價值的權(quán)衡（例如烟央，找到支出的“甜點”），而且在競爭子進(jìn)程之間分配預(yù)算方面起著重要作用歪脏。

作者的第二個貢獻(xiàn)是一種構(gòu)建策略的方法疑俭，該策略規(guī)定了在給定系統(tǒng)的聯(lián)合狀態(tài)下，對所有子進(jìn)程采取的聯(lián)合策略婿失，受全局預(yù)算約束的限制钞艇。我們將問題作為弱耦合MDP啄寡，其中預(yù)算根據(jù)其觀察到的（初始）狀態(tài)和子進(jìn)程特定的值函數(shù)在線分配給各個子進(jìn)程。我們證明了預(yù)算分配過程可以被視為多項哩照、多選背包問題（MCKP）挺物，該問題具有有效的貪心算法來確定最優(yōu)分配。我們還討論了在線葡秒、逐階段重新分配預(yù)算以適應(yīng)嚴(yán)格而不是預(yù)期預(yù)算約束的可能性姻乓。

總之嵌溢，這篇論文提出了一種新的方法來解決順序決策問題中的預(yù)算分配問題眯牧，這對于廣告商與大量目標(biāo)客戶互動的情況非常有用。作者提出了兩個互補(bǔ)的問題赖草，一個是計算MDP值函數(shù)学少，另一個是構(gòu)建策略。作者的方法可以幫助我們評估預(yù)算/價值的權(quán)衡秧骑，以及在競爭子進(jìn)程之間分配預(yù)算版确。舉個例子，這種方法可以應(yīng)用于廣告商在不同的廣告平臺上投放廣告的決策中乎折，以找到最優(yōu)的預(yù)算分配方案绒疗。

Introduction

Markov decision processes (MDPs) have found wide application throughout AI and provide firm foundations for decision-theoretic planning [10] and reinforcement learning [27]. In many domains, however, actions have costs or otherwise consume limited resources—in such cases, optimal policies must be constructed subject to resource constraints, a problem often best formulated using constrained MDPs (CMDPs) [2]. The complexity of these problems is exacerbated in domains where multiple independent or semi-independent MDPs must be controlled jointly, since the state and action spaces for the joint process typically consists of the cross-product of those from the individual subprocesses [32].
A key setting where both of these characteristics arise is online advertising. A number of models of online user engagement have been proposed that treat user behavior as a Markov process [15, 7]. Archak et al. [6], in particular, propose a constrained MDP model for the optimal allocation of an advertiser’s budget over an extended horizon that captures the sequential, cumulative, stochastic effect of multiple advertising touch points on a user’s behavior. Their model assumes that a fixed budget is predetermined for each user, and focuses on how to optimally advertise to a user, subject to that budget constraint. However, this formulation is not well-suited to advertisers who (a) have a global budget constraint (rather than a per-user constraint); and (b) are free to allocate that budget differentially and adaptively across users of different types, or users in different states of the underlying MDP. We instead model this as a joint optimization problem in which a joint policy across all users must be constructed. In particular, we consider the optimal, global, dynamic allocation of budget across all users in the target market, removing the need to pre-specify a per-user budget.
Using this problem as motivation, we consider a natural decomposition of this type of resource allocation problem as a weakly coupled MDP [32]. In this model, each “l(fā)ocal” MDP (corresponding to a customer or customer type) is solved separately, and the solutions composed into a joint pol- icy. Within this setting, our first contribution relates to the solution of the local MDPs themselves. One standard approach to dealing with resource constraints, such as budget limits, is CMDPs [2], in which actions consume one or more resources (e.g., budget, power) and optimal policies are sought that use no more than some pre-defined level of each resource in expectation. CMDPs have been applied in online advertising (see above), robotics, and other domains [12, 19, 18]. While CMDPs are valuable models, one weakness is their required a priori specification of fixed “budget” or limit on each resource—in our example, this would be, say, a daily monetary budget cap on ad spend for each customer or across a specific ad network [7, 6, 4]. However, in many cases, determining a suitable budget depends on the tradeoff between ad spend and expected value. In our motivating example, in particular, the budget to be allocated to customers of a specific type, or in a specific MDP state, must be traded off against that allocated to other types or states.

這篇論文主要研究在涉及大量同時運行的子進(jìn)程的順序決策問題中的預(yù)算（或其他資源）分配問題，這些子進(jìn)程之間唯一的交互是通過它們逐漸消耗預(yù)算（或所涉及的資源）骂澄。作者以廣告商與大量目標(biāo)客戶互動的情況為主要動機(jī)吓蘑。作者提出了兩個互補(bǔ)的問題，一個是計算MDP值函數(shù)坟冲，另一個是構(gòu)建策略磨镶。作者的方法可以幫助我們評估預(yù)算/價值的權(quán)衡，以及在競爭子進(jìn)程之間分配預(yù)算健提。舉個例子琳猫，這種方法可以應(yīng)用于廣告商在不同的廣告平臺上投放廣告的決策中，以找到最優(yōu)的預(yù)算分配方案私痹。

在介紹中脐嫂，作者指出了在許多領(lǐng)域中，行動具有成本或以其他方式消耗有限資源紊遵，因此必須在資源約束下構(gòu)建最優(yōu)策略账千，這通常最好使用約束MDP（CMDP）進(jìn)行制定。當(dāng)多個獨立或半獨立的MDP必須聯(lián)合控制時癞蚕，這些問題的復(fù)雜性會加劇蕊爵。作者以在線廣告為例，提出了一種弱耦合MDP的自然分解方法桦山，其中每個“本地”MDP（對應(yīng)于客戶或客戶類型）都單獨解決攒射，并將解決方案組合成聯(lián)合策略醋旦。作者的第一個貢獻(xiàn)是解決本地MDP的解決方案，其中行動消耗一個或多個資源（例如会放，預(yù)算饲齐、功率），并尋求使用每個資源的預(yù)定義水平的最優(yōu)策略咧最。作者的第二個貢獻(xiàn)是一種構(gòu)建策略的方法捂人，該策略規(guī)定了在給定系統(tǒng)的聯(lián)合狀態(tài)下，對所有子進(jìn)程采取的聯(lián)合策略矢沿，受全局預(yù)算約束的限制滥搭。

這篇論文主要介紹了在涉及有限資源的情況下，如何構(gòu)建最優(yōu)策略的問題捣鲸。在許多領(lǐng)域中瑟匆，行動具有成本或以其他方式消耗有限資源，因此必須在資源約束下構(gòu)建最優(yōu)策略栽惶，這通常最好使用約束MDP（CMDP）進(jìn)行制定愁溜。當(dāng)多個獨立或半獨立的MDP必須聯(lián)合控制時，這些問題的復(fù)雜性會加劇外厂。作者以在線廣告為例冕象，提出了一種弱耦合MDP的自然分解方法，其中每個“本地”MDP（對應(yīng)于客戶或客戶類型）都單獨解決汁蝶，并將解決方案組合成聯(lián)合策略渐扮。作者的第一個貢獻(xiàn)是解決本地MDP的解決方案，其中行動消耗一個或多個資源（例如穿仪，預(yù)算席爽、功率），并尋求使用每個資源的預(yù)定義水平的最優(yōu)策略啊片。作者的第二個貢獻(xiàn)是一種構(gòu)建策略的方法只锻，該策略規(guī)定了在給定系統(tǒng)的聯(lián)合狀態(tài)下，對所有子進(jìn)程采取的聯(lián)合策略紫谷，受全局預(yù)算約束的限制齐饮。

在廣告投放領(lǐng)域，已經(jīng)有很多關(guān)于用戶行為的馬爾可夫過程模型笤昨，但這些模型都是基于用戶行為的預(yù)算限制祖驱，而不是廣告商的全局預(yù)算限制。因此瞒窒，作者提出了一種聯(lián)合優(yōu)化問題的模型捺僻，該模型需要構(gòu)建一個適用于所有用戶的聯(lián)合策略，而不是預(yù)先指定每個用戶的預(yù)算。作者的方法可以幫助我們評估預(yù)算/價值的權(quán)衡匕坯，以及在競爭子進(jìn)程之間分配預(yù)算束昵。

作者提出的方法可以應(yīng)用于廣告商在不同的廣告平臺上投放廣告的決策中，以找到最優(yōu)的預(yù)算分配方案葛峻。在許多領(lǐng)域中锹雏，行動具有成本或以其他方式消耗有限資源，因此必須在資源約束下構(gòu)建最優(yōu)策略术奖，這通常最好使用約束MDP（CMDP）進(jìn)行制定礁遵。當(dāng)多個獨立或半獨立的MDP必須聯(lián)合控制時，這些問題的復(fù)雜性會加劇采记。

We propose a model called budgeted MDPs (BMDPs) in which, like CMDPs, actions consume resources and can only be taken when the required resources are available. Unlike CMDPs, we do not place an a priori limit on resources, but instead solve the for all possible resource levels, allowing one to explore the tradeoff between (optimal) expected value and allocated resources. Note that we could formulate this problem by embedding the available resource levels in the state space, creating a “hybrid state” with one continuous dimension per resource. However, typically one chooses a budget(or appropriate resource level) when initiating a policy—in our approach, after solving the budgeted MDP to assess value/budget tradeoffs. As a consequence, we instead treat the budget as a separate parameter of the value function (VF) and policy, and analyze the structure of optimal VFs w.r.t. budget. Specifically, for finite state and action BMDPs, we show that for any fixed state s: (a) the optimal VF V (s, b) is concave in available budget b; and (b) the optimal VF for any finite horizon is piecewise- linear and concave (PWLC) in budget, and is defined by a finite set of useful resource levels (the VF is piecewise constant if policies are deterministic). We present a dynamic programming (DP) algorithm to compute this PWLC VF representation, one that readily supports approximation. This approach is well-suited not only to multi-task problems, like our ad domain, but more generally to any constrained MDP where the appropriate level of resource to allocate depends on the value obtainable (e.g., the “sweet spot” in spend) and may not be easily determined a priori. Using the VF to inform the selection of an appropriate budget, one automatically determines the optimal policy for that (expected) spend level—as if one had solved the corresponding CMDP for that budget.
Our second contribution is a method for piecing together the solutions of the individual BMDPs to determine a joint policy, subject to a global resource/budget constraint. Since the MDP is weakly coupled [32]—specifically, the individual customer MDPs evolve independently, linked only through the consumption of shared budget—our primary aim is to determine an appropriate allocation of budget to each customer, which is turn dictates the optimal policy for that customer. We show that, given the form of BMDP optimal value functions, the budget allocation problem can be formulated as a multiple-choice knapsack problem (MCKP) [40], for which a straightforward greedy optimization can be used to construct the optimal online allocation of budget. We also discuss circumstances in which the dynamic, per-stage reallocation of budget may be valuable—in such cases, the offline solution of the underlying BMDPs and the greedy nature of the online allocation algorithm admit fast, real-time response.
The paper is organized as follows. We discuss related work on ad budget optimization and con- strained and weakly coupled MDPs in Sec. 2. We outline our basic constrained weakly coupled MDP model in Sec. 3. In Sec. 4, we introduce budgeted MDPs (BMDPs) as a model for the local (e.g., single-customer) MDPs that make up the weakly coupled model. Unlike CMDPs, the aim is to de- termine VFs and policies as a function of available budget. In this section we describe the PWLC structure of optimal VFs for BMDPs; introduce a DP algorithm for computing VFs and policies; de- scribe methods for approximation and provide an analysis of approximation error; describe how to compute the variance in expected spend induced by the optimal policy; and provide empirical results on both synthetic MDPs and MDPs derived from the data of a large advertiser. In Sec. 5, we describe how to exploit BMDP solutions to allocate a global budget to multiple independent local MDPs, linked only by their budget consumption. We formulate the problem as a multiple-choice knapsack problem (MCKP) and describe a simple greedy algorithm for optimal budget allocation. We show that this way of exploiting BMDP solutions provides much better results than adopting a static or fixed per- user budget. We also propose a dynamic budget reallocation method that reallocates budget as each CMDP process evolves. This reduces variance in the global spend, ensures guaranteed budget con- straint satisfaction rather than expected satisfaction and, in settings with tightly constrained budgets, can offer better expected value through budget redistribution.

本文提出了一種名為預(yù)算馬爾可夫決策過程（Budgeted MDPs佣耐，BMDPs）的模型，類似于約束馬爾可夫決策過程（Constrained MDPs挺庞，CMDPs）晰赞，動作會消耗資源，只有在所需資源可用時才能執(zhí)行选侨。與CMDPs不同的是，我們不會事先限制資源然走，而是解決所有可能的資源水平援制，以探索（最優(yōu)）期望價值和分配資源之間的權(quán)衡。請注意芍瑞，我們可以通過將可用資源水平嵌入狀態(tài)空間中晨仑，創(chuàng)建一個具有每個資源的連續(xù)維度的“混合狀態(tài)”，來制定此問題拆檬。然而洪己，通常在啟動策略時選擇預(yù)算（或適當(dāng)?shù)馁Y源水平）- 在我們的方法中，在解決預(yù)算MDP以評估價值/預(yù)算權(quán)衡之后竟贯。因此答捕，我們將預(yù)算視為價值函數(shù)（VF）和策略的單獨參數(shù)，并分析最優(yōu)VF的結(jié)構(gòu)與預(yù)算有關(guān)屑那。具體而言拱镐，對于有限狀態(tài)和動作的BMDPs，我們證明對于任何固定狀態(tài)s：（a）可用預(yù)算b的最優(yōu)VF V（s持际，b）是凸的沃琅；（b）任何有限時間段的最優(yōu)VF都是分段線性和凸的（PWLC），并且由一組有用的資源水平定義（如果策略是確定性的蜘欲，則VF是分段常數(shù)）益眉。我們提出了一種動態(tài)規(guī)劃（DP）算法來計算這種PWLC VF表示，該算法可以輕松支持近似。這種方法不僅非常適合多任務(wù)問題郭脂，例如我們的廣告領(lǐng)域空繁，而且更普遍地適用于任何約束MDP，其中分配的適當(dāng)資源水平取決于可獲得的價值（例如朱庆，花費的“甜點”）并且可能不容易事先確定盛泡。使用VF來指導(dǎo)選擇適當(dāng)?shù)念A(yù)算，可以自動確定該（預(yù)期）支出水平的最優(yōu)策略-就像已經(jīng)解決了相應(yīng)的CMDP一樣娱颊。

我們的第二個貢獻(xiàn)是一種方法傲诵，將單個BMDP的解決方案拼接在一起，以確定受全局資源/預(yù)算約束的聯(lián)合策略箱硕。由于MDP是弱耦合的[32] -具體而言拴竹，單個客戶MDP獨立演變，僅通過共享預(yù)算的消耗相連-我們的主要目標(biāo)是確定每個客戶的預(yù)算分配剧罩，這反過來決定了該客戶的最優(yōu)策略栓拜。我們展示了，鑒于BMDP最優(yōu)價值函數(shù)的形式惠昔，預(yù)算分配問題可以被制定為多項選擇背包問題（MCKP）[40]幕与，可以使用簡單的貪心優(yōu)化來構(gòu)建預(yù)算的最優(yōu)在線分配。我們還討論了動態(tài)的镇防，每個階段的預(yù)算重新分配可能有價值的情況-在這種情況下啦鸣，基于離線BMDP的解決方案和在線分配算法的貪心性質(zhì)，可以實現(xiàn)快速的實時響應(yīng)来氧。

本文的結(jié)構(gòu)如下诫给。我們在第2節(jié)中討論了廣告預(yù)算優(yōu)化和約束和弱耦合MDP的相關(guān)工作。我們在第3節(jié)中概述了我們的基本約束弱耦合MDP模型啦扬。在第4節(jié)中中狂，我們介紹了預(yù)算MDPs（BMDPs）作為組成弱耦合模型的本地（例如單個客戶）MDPs的模型。與CMDPs不同扑毡，目的是根據(jù)可用預(yù)算確定VF和策略胃榕。在本節(jié)中，我們描述了BMDP最優(yōu)VF的PWLC結(jié)構(gòu)僚楞；介紹了計算VF和策略的DP算法勤晚；描述了逼近方法并提供逼近誤差分析；描述了如何計算最優(yōu)策略引起的預(yù)期支出方差泉褐；并提供了關(guān)于合成MDP和大型廣告商數(shù)據(jù)的合成MDP的實證結(jié)果赐写。在第5節(jié)中，我們描述了如何利用BMDP解決方案將全局預(yù)算分配給多個獨立的本地MDP膜赃，僅通過其預(yù)算消耗相連挺邀。我們將問題制定為多項選擇背包問題（MCKP），并提出了一種貪心優(yōu)化算法來構(gòu)建預(yù)算的最優(yōu)在線分配。我們還討論了動態(tài)的端铛，每個階段的預(yù)算重新分配的潛在好處泣矛，以及如何使用BMDP解決方案有效地實現(xiàn)它。最后禾蚕，在第6節(jié)中您朽，我們總結(jié)了本文，并討論了未來的研究方向换淆。

Background and Related Work

Budget Optimization in Online Advertising

Optimal allocation of advertising budgets is addressed at length in the marketing literature. Tradi- tional concerns include how to allocate budget to different segments of the target market [41], and exploiting the different stochastic responses predicted from such segments [26]. Budget allocation is often modeled as the control of a continuous dynamical system [21], where audience response (e.g., purchases) to control actions (e.g., ad spending) are used to justify policies such as pulsing. User behavior is also sometimes modeled as a discrete Markov process (reflecting consumer demand, ad response, ad carryover effects, etc.) [12, 34].
A variety of mechanisms to support online advertising have been studied (and successfully de- ployed), including auctions for sponsored search [44, 20]. In this context, advertiser budget optimiza- tion is critical to maximizing impact and ROI in keyword auctions, leading to a variety of propos- als for optimal bidding strategies under various models and constraints, including budget constraints [8, 23, 33, 24, 17, 28]. Even more expressive methods (e.g., for entire ad campaigns) have been proposed for both search and display advertising in online settings [11, 29].
By contrast, relatively little work has considered budget optimization in an online setting that is sensitive to user behavior as it extends over time. Charikar et al. [15] assume the web surfing behav- ior of different user types can be modeled as a Markov process and construct policies to optimally allocate ad spend at different states (e.g., web sites) to maximize return (e.g., conversions), possibly as a function of user history. Archak et al. [6] tackle a related problem, offering a somewhat different model, but still assuming Markovian user behavior (here in a sponsored search setting). They propose a constrained MDP model that can be used to determine the optimal ad spend for a given user, assum- ing an a priori fixed budget for that user. It is this model we adopt for our approach below, though we do not fix a per-user budget. Interestingly, the same authors [7] analyze user behavior empiri- cally, demonstrating that users in a search context exhibit what might be called a “general to specific” search behavior that is approximately Markovian, and suggest that myopic click-through optimiza- tion will fail to optimize spend (hence motivating their MDP approach [6]). Amin et al. [4] develop an MDP model of the budget optimization process itself, formulating the problem facing the online advertiser—who is uncertain about the winning bid distribution—as one of learning under censored observations. Unlike the models above (and unlike our work), this work does not model Markovian user behavior.

這篇論文主要討論了如何在在線廣告投放中哗总，通過考慮用戶行為的影響，進(jìn)行預(yù)算的優(yōu)化分配倍试。傳統(tǒng)的廣告預(yù)算分配問題主要關(guān)注如何將預(yù)算分配到目標(biāo)市場的不同細(xì)分領(lǐng)域讯屈，并利用這些領(lǐng)域的不同隨機(jī)響應(yīng)來制定策略。廣告預(yù)算分配通常被建模為連續(xù)動態(tài)系統(tǒng)的控制县习，其中觀眾對控制行為（例如廣告支出）的響應(yīng)（例如購買）被用來證明脈沖等策略的合理性涮母。用戶行為有時也被建模為離散的馬爾可夫過程（反映消費者需求、廣告響應(yīng)躁愿、廣告延續(xù)效應(yīng)等）叛本。

在網(wǎng)絡(luò)廣告中，已經(jīng)研究（并成功部署）了各種支持機(jī)制攘已，包括贊助搜索的拍賣炮赦。在這種情況下，廣告主預(yù)算優(yōu)化對于在關(guān)鍵字拍賣中最大化影響和投資回報至關(guān)重要样勃，因此提出了各種在不同模型和約束條件下的最優(yōu)出價策略，包括預(yù)算約束性芬。對于整個廣告活動峡眶，也提出了更具表現(xiàn)力的方法，包括在線搜索和展示廣告植锉。

與此相比辫樱，相對較少的工作考慮了在線環(huán)境中的預(yù)算優(yōu)化問題，該問題對用戶行為的敏感性隨時間延伸俊庇。Charikar等人假設(shè)不同用戶類型的網(wǎng)絡(luò)瀏覽行為可以建模為馬爾可夫過程狮暑，并構(gòu)建策略以在不同狀態(tài)（例如網(wǎng)站）上最優(yōu)地分配廣告支出，以最大化回報（例如轉(zhuǎn)化）辉饱，可能作為用戶歷史的函數(shù)搬男。Archak等人解決了一個相關(guān)的問題，提供了一個略微不同的模型彭沼，但仍然假設(shè)馬爾可夫用戶行為（這里是在贊助搜索設(shè)置中）缔逛。他們提出了一個受限制的MDP模型，可用于確定給定用戶的最佳廣告支出，假設(shè)預(yù)先確定了該用戶的預(yù)算褐奴。這就是我們下面采用的模型按脚，盡管我們沒有固定每個用戶的預(yù)算。有趣的是敦冬，同樣的作者通過實證分析用戶行為辅搬，證明了在搜索環(huán)境中用戶表現(xiàn)出近似馬爾可夫的“從一般到特定”的搜索行為，并建議短視的點擊率優(yōu)化將無法優(yōu)化支出（從而推動他們的MDP方法）脖旱。Amin等人開發(fā)了一個預(yù)算優(yōu)化過程的MDP模型堪遂，將在線廣告商面臨的問題（不確定獲勝出價分布）形式化為在被審查的觀察下學(xué)習(xí)的問題。與上述模型（以及我們的工作）不同夯缺，這項工作沒有對馬爾可夫用戶行為進(jìn)行建模蚤氏。

Li et al. [30] also consider Markov models in behavioral and contextual advertising. More recent work has considered the application of reinforcement learning methods to determine appropriate ad- vertising and offer strategies to large populations of customers, learning optimal policies from data generated by online interactions with users [38, 43]. These methods are similar in their underlying user modeling assumptions, but attempt to learn policies from data and do not account for budget constraints or tradeoffs, whereas we assume an underlying transition model is available and optimize policies relative to these budget tradeoffs.

Li等人在行為和上下文廣告中也考慮了馬爾可夫模型。更近期的工作考慮了將強(qiáng)化學(xué)習(xí)方法應(yīng)用于確定適當(dāng)?shù)膹V告和優(yōu)惠策略踊兜，以大規(guī)母捅酰客戶群體為目標(biāo)，從與用戶的在線交互生成的數(shù)據(jù)中學(xué)習(xí)最優(yōu)策略捏境。這些方法在其基本的用戶建模假設(shè)上類似于游，但試圖從數(shù)據(jù)中學(xué)習(xí)策略，而不考慮預(yù)算約束或權(quán)衡垫言，而我們假設(shè)一個基本的轉(zhuǎn)移模型可用贰剥，并相對于這些預(yù)算權(quán)衡優(yōu)化策略。

Constrained MDPs

An important extension of the standard MDP model is offered by constrained MDPs (CMDPs) [2], in which actions consume one or more resources (e.g., monetary budget, power) and optimal policies are sought that use no more than some pre-defined level of each resource in expectation. CMDPs have been applied in online advertising, mobile robotics, and other domains within AI and OR [12, 19, 18], and techniques that exploit the structure of CMDPs for effective solution have received some attention [2, 19, 18]. We briefly define MDPs and the basic (one-dimensional) constrained extension.

這篇論文講的是一種叫做約束馬爾可夫決策過程（CMDP）的模型筷频，它是標(biāo)準(zhǔn)馬爾可夫決策過程（MDP）模型的一種重要擴(kuò)展蚌成。在CMDP中，每個行動都會消耗一個或多個資源凛捏，比如說預(yù)算担忧、能量等等。我們希望找到最優(yōu)策略坯癣，使得每種資源的使用量不超過預(yù)先設(shè)定的閾值瓶盛。CMDP已經(jīng)被應(yīng)用于在線廣告、移動機(jī)器人等人工智能和運籌學(xué)領(lǐng)域的研究中示罗。一些技術(shù)已經(jīng)開始利用CMDP的結(jié)構(gòu)來有效地解決問題惩猫。在論文中，作者還簡要介紹了MDP和基本的（一維）約束擴(kuò)展蚜点。

Weakly Coupled MDPs

The decomposition of MDPs into independent or semi-independent processes can often be used to mitigate the curse of dimensionality by reducing the size of the MDP state and action space. Chal- lenges lie in discovering a suitable decomposition structure and in determining how best to use the sub-process solutions to construct a (generally approximate) global solution.
There have been a number of approaches developed that have this flavor, in both standard MDPs and decentralized (DEC) MDPs and POMDPs [39, 39, 1, 42, 31, 19, 45, 3, 35, 14]. The approach most related to ours is the decomposition method for weakly-coupled MDPs of Meuleau et al. [32]. In their model, a joint fully observable MDP is comprised of a number of almost completely independent subprocesses, each itself a fully observable MDP (a local MDP). Each MDP reflects the task or ob- jectives of a specific agent, but the local policies themselves require resources, both consumable (e.g., fuel or money, that, once used in one local policy, are unavailable for future use in that or any other local MDP), and non-consumable (e.g., equipment, that can be reused multiple times and swapped across local MDPs, but can only be used in one local MDP at a time). The method of Meuleau et al. [32] solves the local MDPs independently to produce local value functions and policies that are parameterized by the resources available. these local value functions then inform a simple greedy algorithm that assigns resources to each local MDP. Finally, unconsumed resources at each stage of the process are reassigned depending on the updated observed state of the local MDPs.
Our approach to budget decomposition is similar, but has some key differences. We use the more standard expected budget constraint, rather than a guaranteed budget constraint to determine the optimal policy—this requires the development of new DP techniques (as opposed to the use of standard value or policy iteration [32]). We also show that our resource (budget) allocation algorithm is optimal. Our dynamic budget reallocation scheme is derived from the reallocation mechanism of Meuleau et al. [32].

這段內(nèi)容主要講的是將MDP分解成獨立或半獨立的子過程轧房，可以減小MDP狀態(tài)和動作空間的大小，以緩解維度災(zāi)難禽额。但是锯厢，如何找到合適的分解結(jié)構(gòu)皮官，并確定如何最好地使用子過程的解來構(gòu)建（通常是近似的）全局解，是一個挑戰(zhàn)实辑。已經(jīng)有一些方法被開發(fā)出來捺氢，既適用于標(biāo)準(zhǔn)MDP，也適用于分散式（DEC）MDP和POMDP [39, 39, 1, 42, 31, 19, 45, 3, 35, 14]剪撬。與我們最相關(guān)的方法是Meuleau等人提出的弱耦合MDP的分解方法[32]摄乒。在他們的模型中，一個聯(lián)合完全可觀察的MDP由許多幾乎完全獨立的子過程組成残黑，每個子過程本身都是一個完全可觀察的MDP（本地MDP）馍佑。每個MDP反映了特定代理的任務(wù)或目標(biāo)，但本地策略本身需要資源梨水，包括可消耗的資源（例如燃料或資金拭荤，在一個本地策略中使用后，在該本地MDP或任何其他本地MDP中都無法再次使用）和不可消耗的資源（例如設(shè)備疫诽，可以多次重復(fù)使用并在本地MDP之間交換舅世，但一次只能在一個本地MDP中使用）。Meuleau等人的方法[32]獨立解決本地MDP奇徒，以產(chǎn)生由可用資源參數(shù)化的本地值函數(shù)和策略雏亚。這些本地值函數(shù)然后通知一個簡單的貪婪算法，將資源分配給每個本地MDP摩钙。最后罢低，在過程的每個階段重新分配未消耗的資源，取決于本地MDP的更新觀察狀態(tài)胖笛。我們的預(yù)算分解方法類似网持，但有一些關(guān)鍵的差異。我們使用更標(biāo)準(zhǔn)的預(yù)期預(yù)算約束长踊，而不是保證預(yù)算約束來確定最優(yōu)策略翎碑，這需要開發(fā)新的DP技術(shù)（而不是使用標(biāo)準(zhǔn)的值或策略迭代[32]）。我們還展示了我們的資源（預(yù)算）分配算法是最優(yōu)的之斯。我們的動態(tài)預(yù)算重新分配方案是從Meuleau等人的重新分配機(jī)制中推導(dǎo)出來的[32]。

MDPs for Large-scale Budget Allocation

We begin by outlining a simple MDP model for engagement with users from a large, heterogeneous population, using a sequence of potentially costly actions. We use direct-response advertising as our main motivating example, though our techniques apply to control of any large, distributed collection of fully observable MDPs where actions consume limited resources. We note that our model abstracts away a number of factors that arise in realistic ad domains (e.g., partial or intermittent observability of user responses, hidden user state, coarse-grained control and lagging effects, incentives and game- theoretic interactions) in order to focus on the essence of budget allocation.
We assume an advertiser has a fixed budget with which to influence a target population of users though the use of various advertising actions. Each action taken by the advertiser has a specific cost and is targeted to a specific user (e.g., a textual or graphical ad in response to a search query, an in-app or web page graphical/video ad, direct mail to a household).3
Users respond to advertising actions stochastically, and in a way that may depend not only on user features (e.g., demographics), but also on past actions to which they have been subjected (e.g., previous ad exposures or solicitations) and past responses (e.g., click through or purchase behavior). Following Archak et al. [6], we model this situation as an MDP in the following fashion. We assume a finite set of users i ≤ M. Generally speaking, users may be segmented into types that reflect static, observable characteristics that influence their response to advertising. For ease of exposition, we assume all users have the same type; but the extension to multiple types is straightforward, and aspects of the model requiring modification for multiple types will be indicated as they arise.
We assume a finite set S of user states j ≤ N. At any time, user i is some state s[i] ∈ S. Let S = SM be the joint state space, with the joint state of all users denoted s = ?s[1],...,s[M]? ∈ S. Intuitively, the user state captures all relevant user characteristics and history that could alter her re- sponse to an advertising action. S could be relatively small or may be quite large in certain contexts (e.g., the most recent keyword search on which the advertiser bid, or some sufficient statistics summa- rizing historical interactions and user responses). A finite set A of advertising actions is available: for ease of presentation, we assume that any a ∈ A can be applied to a user in any state (subject to cost constraints, discussed below). Accommodating state-dependent constraints on the feasible action set is straightforward (as in any MDP model). At each stage, the advertiser selects an action a[i] to apply to user i. Letting A = AM, a joint action is a = ?a[1],...,a[M]? ∈ A.4
The stochastic response of a user to an action is captured by the transition model P : S × A → ?(S ), with P (s, a, t) denoting the probability that a user in state s moves to state t when subjected to action a.5 The reward model R(s, a) reflects the costs and payoffs when an advertiser applies action a to a user in state s. We decompose reward as R(s, a) = U (s) ? C (s, a): cost model C reflects action costs (e.g., cost of placing an ad, or potential annoyance factor and associated lost revenue); and utility function U reflects benefits/payoffs (e.g., revenue from sales, value of brand exposure, etc.).

本文首先提出了一個簡單的MDP模型遣铝，用于處理與大型異構(gòu)人群的用戶互動佑刷，使用一系列可能具有成本的行動。我們以直接響應(yīng)廣告為主要的動機(jī)示例酿炸，盡管我們的技術(shù)適用于控制任何大型瘫絮、分布式的完全可觀察的MDP集合，其中行動消耗有限的資源填硕。我們注意到麦萤，我們的模型抽象了現(xiàn)實廣告領(lǐng)域中出現(xiàn)的許多因素（例如鹿鳖，用戶響應(yīng)的部分或間歇性可觀察性、隱藏的用戶狀態(tài)壮莹、粗粒度控制和滯后效應(yīng)翅帜、激勵和博弈理論交互），以便專注于預(yù)算分配的本質(zhì)命满。

我們假設(shè)廣告商有一個固定的預(yù)算涝滴，用于通過使用各種廣告行動來影響目標(biāo)用戶群體。廣告商采取的每個行動都有特定的成本胶台，并針對特定的用戶（例如歼疮，響應(yīng)搜索查詢的文本或圖形廣告、應(yīng)用程序或網(wǎng)頁圖形/視頻廣告诈唬、直接郵寄到家庭）韩脏。用戶隨機(jī)地對廣告行動做出反應(yīng)，這種反應(yīng)可能不僅取決于用戶特征（例如铸磅，人口統(tǒng)計學(xué)）赡矢，還取決于他們曾經(jīng)遭受過的過去行動（例如，以前的廣告曝光或征求）和過去的反應(yīng)（例如愚屁，點擊或購買行為）济竹。根據(jù)Archak等人的方法，我們將這種情況建模為MDP霎槐。我們假設(shè)有一個有限的用戶集合i≤M送浊。一般來說，用戶可以被分成類型丘跌，反映出影響他們對廣告反應(yīng)的靜態(tài)袭景、可觀察特征。為了便于闡述闭树，我們假設(shè)所有用戶都具有相同的類型耸棒；但是，擴(kuò)展到多個類型是簡單的报辱，并且需要修改的模型方面將在出現(xiàn)時指出与殃。

我們假設(shè)有一個用戶狀態(tài)的有限集合S，其中j≤N碍现。在任何時候幅疼，用戶i是某個狀態(tài)s[i]∈S。讓S=SM成為聯(lián)合狀態(tài)空間昼接，所有用戶的聯(lián)合狀態(tài)表示為s=?s[1]爽篷，...，s[M]?∈S慢睡。直觀地說逐工，用戶狀態(tài)捕捉所有相關(guān)的用戶特征和歷史铡溪，這些特征和歷史可能會改變她對廣告行動的反應(yīng)。在某些情況下泪喊，S可能相對較小棕硫，或者可能相當(dāng)大（例如，廣告商出價的最新關(guān)鍵字搜索窘俺，或者總結(jié)歷史互動和用戶反應(yīng)的足夠統(tǒng)計數(shù)據(jù)）饲帅。有一個廣告行動的有限集合A可用：為了方便起見，我們假設(shè)任何a∈A都可以應(yīng)用于任何狀態(tài)的用戶（受成本約束的限制討論如下）瘤泪。適應(yīng)狀態(tài)依賴性約束的可行行動集合是簡單的（如任何MDP模型中一樣）灶泵。在每個階段，廣告商選擇一個要應(yīng)用于用戶i的行動a[i]对途。讓A=AM成為聯(lián)合行動赦邻，聯(lián)合行動為a=?a[1]，...实檀，a[M]?∈A惶洲。

用戶對行動的隨機(jī)響應(yīng)由轉(zhuǎn)移模型P：S×A→?（S）捕獲，其中P（s膳犹，a恬吕，t）表示當(dāng)受到行動a影響時，處于狀態(tài)s的用戶移動到狀態(tài)t的概率须床。獎勵模型R（s铐料，a）反映了廣告商將行動a應(yīng)用于狀態(tài)s的用戶時的成本和回報。我們將獎勵分解為R（s豺旬，a）=U（s）-C（s钠惩，a）：成本模型C反映了行動成本（例如，放置廣告的成本族阅，或潛在的煩惱因素和相關(guān)的失去收入）篓跛；效用函數(shù)U反映了收益/回報（例如，銷售收入坦刀、品牌曝光的價值等）愧沟。

Solving Budgeted MDPs

We first formulate budgeted MDPs, a variant of CMDPs in which budgets (or other resources) are modeled so that decisions can be made readily about the level of budget to provide. The budget is implicitly treated as a component of the state, and value functions and policies are constructed that vary with both the state and available budget. We first outline the model, and then develop key intuitions regarding VF structure by considering deterministic policies. We move on to discuss stochastic policies—where randomization is much more natural than in the CMDP model—and show that these give rise to value functions that have computationally useful structural properties. This structure can be exploited in a simple dynamic programming (DP) algorithm, and gives rise to natural approximations. We conclude the section with some computational experiments to show the efficacy of our DP method.

這篇論文首先提出了一種叫做“budgeted MDPs”的CMDPs變體，其中預(yù)算（或其他資源）被建模鲤遥，以便可以輕松地做出關(guān)于提供預(yù)算水平的決策央渣。預(yù)算被隱式地視為狀態(tài)的一個組成部分，并構(gòu)建了隨著狀態(tài)和可用預(yù)算變化的價值函數(shù)和策略渴频。我們首先概述了模型，然后通過考慮確定性策略來發(fā)展關(guān)于VF結(jié)構(gòu)的關(guān)鍵直覺北启。我們繼續(xù)討論隨機(jī)策略——在CMDP模型中卜朗，隨機(jī)化比在這里更自然——并展示了這些策略產(chǎn)生的價值函數(shù)具有計算上有用的結(jié)構(gòu)性質(zhì)拔第。這種結(jié)構(gòu)可以在一個簡單的動態(tài)規(guī)劃（DP）算法中被利用，并產(chǎn)生自然的近似值场钉。我們通過一些計算實驗來結(jié)束這一部分蚊俺，以展示我們DP方法的功效。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末逛万，一起剝皮案震驚了整個濱河市泳猬，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌宇植，老刑警劉巖得封，帶你破解...
沈念sama閱讀 219,188評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異指郁，居然都是意外死亡忙上，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,464評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門闲坎，熙熙樓的掌柜王于貴愁眉苦臉地迎上來疫粥，“玉大人，你說我怎么就攤上這事腰懂」４” “怎么了？”我有些...
開封第一講書人閱讀 165,562評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵绣溜，是天一觀的道長慷彤。經(jīng)常有香客問我，道長涮毫，這世上最難降的妖魔是什么瞬欧？我笑而不...
開封第一講書人閱讀 58,893評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮罢防，結(jié)果婚禮上艘虎，老公的妹妹穿的比我還像新娘。我一直安慰自己咒吐，他們只是感情好野建，可當(dāng)我...
茶點故事閱讀 67,917評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著恬叹，像睡著了一般候生。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上绽昼，一...
開封第一講書人閱讀 51,708評論 1贊 305
城市分裂傳說
那天唯鸭，我揣著相機(jī)與錄音，去河邊找鬼硅确。笑死目溉，一個胖子當(dāng)著我的面吹牛明肮，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播缭付，決...
沈念sama閱讀 40,430評論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼柿估，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了陷猫？” 一聲冷哼從身側(cè)響起秫舌，我...
開封第一講書人閱讀 39,342評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎绣檬，沒想到半個月后足陨，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,801評論 1贊 317
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡河咽，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,976評論 3贊 337
?白月光啟示錄
正文我和宋清朗相戀三年钠右，在試婚紗的時候發(fā)現(xiàn)自己被綠了梭姓。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片牵囤。...
茶點故事閱讀 40,115評論 1贊 351
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖葱轩，靈堂內(nèi)的尸體忽然破棺而出媚值，到底是詐尸還是另有隱情狠毯，我是刑警寧澤，帶...
沈念sama閱讀 35,804評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布褥芒，位于F島的核電站嚼松，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏锰扶。R本人自食惡果不足惜献酗，卻給世界環(huán)境...
茶點故事閱讀 41,458評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望坷牛。院中可真熱鬧罕偎，春花似錦、人聲如沸京闰。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,008評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽蹂楣。三九已至俏站，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間痊土，已是汗流浹背肄扎。一陣腳步聲響...
開封第一講書人閱讀 33,135評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人反浓。一個月前我還...
沈念sama閱讀 48,365評論 3贊 373
代替公主和親
正文我出身青樓萌丈，卻偏偏與公主長得像，于是被迫代替她去往敵國和親雷则。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,055評論 2贊 355