Started [[2025-03-02]] --- A student finds themselves in the following little world. Today is a sunny Saturday and it is very tempting to spend the day playing outside. But they have an exam on Monday. Studying today will yield good exam results and playing instead of studying will lead to bad exam results. The chances of a good life thereafter depend on the exam. What does the student do? ![[sunnySaturday01.png|400]] We now have **five states**: - $s_1$: _Saturday, sunny outside_ → The decision point. - $s_{2}$​: _Exam Day (High Prep)_ → If the teenager studied. - $s_3$: _Exam Day (Low Prep)_ → If the teenager played. - $s_4$: _Good Life_ → A great future, higher probability if well-prepared. - $s_5$: _Bad Life_ → A struggling future, higher probability if underprepared. In $s_1$ the student has two possible actions: play outside ($a_1$) or study ($a_2$). If we study we will be well prepared for the exam: $P(s_2|a_2)= 1$. If, on the other hand, we choose to play, we will be unprepared for the exam: $P(s_3|a_2) = 1$. After the exam, if we do well having been well prepared, the chances of a good life are 90%, but doing poorly after low preparation offers only a 30% chance at a good life. ![[sunnySaturday02.png|400]] What is our life situation on Saturday? We have a lifetime of choices and fates in front of. Assuming we make good choices all along the way, what is the "value" of today? The Bellman equation gives us the utility of state $s_1$: $U(s_{1}) = \max_{a \in A(s_{1})} \sum_{s'}P(s'|s_{1},a)[R(s_{1},a,s') + \gamma U(s')]$ Let's unpack this, starting from the left. First of all, what does "the utility of today"? Answer: the best future you can expect if you always make the right choices from today forward. The expression $\max_{a \in A(s_{1})}$ means the we will evaluate the rest of the equation for each of the actions you can take in state $s_1$ - that is, we will compare the life that can unfold if you choose to study today to the life that can unfold if you choose to play today. The value of today will be whichever is greater. Literally, $a \in A(s_{1})$ means "for all the actions, $a$ that are in the set of actions available in state $s_1$. $ U(\text{today})= \max \{U(\text{your life}(\text{if you study})),U(\text{your life}(\text{if you play}))\}$ In order to only be comparing the actions you can choose today, we assume that you make only the right decisions in the rest of your life. But actions don't always lead to the expected state. For this reason, we compute $U(\text{your life}(\text{if you study}))$ as the weighted sum of the various lives that could unfold following a given action choice. That's why the sum ($\sum_{s'}$) is "over" $s'$, all the states that a chosen action can lead to. The weighting in the sum is what the $P(s'|s_1,a)$ does: the first factor is the probability that action $a$ in state $s_1$ leads to state $s'$. The second factor is the reward that follows from that. The reward term has two parts. The immediate reward and the rest of your life. $\underset{\text{immediate reward}}{R(s_{1},a,s')}\hspace{1em}+\hspace{1em} \underset{\text{rest of your life}}{\gamma U(s')}$ $R(s_1,a,s')$ is the immediate reward you get from the action ($a$) you choose today ($s_1$). If you choose play, it's fun (say, +10); if you choose study, it's not fun (say, -5). $U(s')$ is the utility of the rest of your life starting from here assuming you make all the right decisions. We multiply this by $\gamma, 0 \le \gamma \le 1$, because, in general we tend to discount the future. High values of $\gamma$ indicate we strongly take into account the future while low values suggest we are living more in the moment. $P(s'|s_1,a)$ means the probability you end up in state $s'$ if you take some action $a$. In our example we have made this simple. If you study you will transition to state $s_2 = \text{Exam Day (High Prep)}$ with certainty and if you play you will transition to $s_3 = \text{Exam Day (Low Prep)}$. This means this term is always equal to 1. $\begin{equation}\begin{split} &P(\text{high prep}|\text{today},\text{study}) = 1 \\&P(\text{low prep}|\text{today},\text{play}) = 1 \end{split} \end{equation}$ This means that for each action we can take today, there is only one $s'$ so Bellman's equation is a bit simpler: $U(s_{1}) = \max \{ [R(s_{1},a_{study},s_2) + \gamma U(s_{2})], [R(s_{1},a_{play},s_3) + \gamma U(s_{3})] \} $ Above we interpreted this as $ \begin{equation} \begin{split} U(\text{today})= \max \{&\text{pain of study} + \text{discounted }U(\text{life after}(\text{high prep exam})), \\ &\text{joy of playing} + \text{discounted }U(\text{life after}(\text{low prep exam}))\} \end{split} \end{equation}$ And then we need to compute the $U(\text{life after})$ terms. There is only one action in states $s_2$ and $s_3$: live the rest of your life. This action might lead to a good life, might lead to a bad life. So we can compute the utilities of high prep and low prep like this: $ U(s_{2}) = P(s_{4}|\text{live},s_{2}) R(s_{2},live,s_{4}) + P(s_{5}|\text{live},s_{2}) R(s_{2},live,s_{5}) $ $ U(s_{3}) = P(s_{4}|\text{live},s_{3}) R(s_{3},live,s_{4}) + P(s_{5}|\text{live},s_{3}) R(s_{3},live,s_{5}) $ $ \begin{equation} \begin{split} &U(s_{2}) = 0.9 \times 100 + 0.1 \times 30 = 93 \\ &U(s_{3}) = 0.3 \times 100 + 0.7 \times 30 = 32.1 \end{split} \end{equation} $ So where does this leave us? $ \begin{equation} \begin{split} U(\text{today})&= \max \{&&\text{pain of study} + \text{discounted }U(\text{life after}(\text{high prep exam})), \\& &&\text{joy of playing} + \text{discounted }U(\text{life after}(\text{low prep exam}))\} \\ & = \max{} \{&&(-5 + \gamma \times 93), (25 + \gamma \times 32.1)\} \end{split} \end{equation}$ Recall that $\gamma$ here is the factor by which we "discount" future rewards. Our parents might say "you better study today or you won't have a big pension when you are 65 years old" and our teenage brain will multiply "big pension" by a very small gamma; mom's argument is not making studying a more attractive option. If $\gamma = 0.4$ we get $U(\text{today}) = \max{ \{32.2, 37.84}\}$ and playing is the action choice that maximizes expected utility given the student's discounting of the future. On the other hand, if $\gamma = 0.9$, then we get $U(\text{today}) = \max{ \{78.7, 53.89}\}$ and studying is the action choice that maximizes expected utility given the student's discounting of the future. #### Why are we here again? Our goal is to explore the analogy between reinforcement learning and behavioral situations we encounter in policy conversations. For example, an intervention program provides a resource/support/program known to make a difference in young people's lives but uptake/compliance/participation is hard to motivate. We want to think about ways that our agent is "tempted" to choose not to use a resource (in our example, study) even though they ought to "objectively" (that is, it is a part of the optimal policy). ##### How do we model temptation? One way is to assume that the agent has **a biased reward function**. Instead of using the "true" utility function, the teenager **perceives rewards differently** due to cognitive biases. ###### **Option A: Temporal Discounting Distortion** - A human **over-discounts future rewards**, making the **long-term payoff from studying feel smaller than it really is**. - Instead of a discount factor $\gamma = 0.9$, the teenager **perceives** a lower $\gamma^*, like 0.60$. - This means they think the future reward of a good exam score is **less valuable in the moment** than it actually is. ###### **Option B: Overweighting Immediate Reward** - The teenager **may feel like playing outside is worth more than it really is**. - Instead of $R(s_1, \text{Play}) = 5$, they mentally perceive it as **10 or 15**. - This inflates the short-term appeal of playing, making it more tempting. ###### Option C: Beliefs (correct or not) of Attenuated Upstream Opportunity. - The teenager might believe they face different probabilities of good or bad life conditional on preparation for the exam, or, equivalently, might assign different levels of reward to "good life" and "bad life." - Analogy: ###### **Option D: Decision Noise (Softmax Policy)** - Instead of following the **strictly best action**, the agent **sometimes picks the wrong choice due to noise**. - We could model this by using a **softmax** decision function, where the teenager sometimes chooses play even when studying is better. ##### Using Reward Shaping to Encourage "Better" Behavior #reinforcement_learning #reward_shaping