Model-Based Reinforcement Learning

>[!quote] In a Nutshell >Learn a [[Static, Dynamic and Stochastic Systems|dynamics model]] (and often the reward as well) from data and optimize the [[Policy|policy]] using the approximate dynamics model by minimizing a cost function / maximizing a reward (as in optimal control). >- Good when there is a well-understood underlying physical model / domain knowledge >- Bad when physical model is very complex / not well understood or environment is too complex to model (use [[Model-Free Reinforcement Learning|MFRL]] instead) --- A **full model** $\hat{\mathcal{M}}$ in [[- Reinforcement Learning -|- Reinforcement Learning -]] can be stochastic or deterministic and consists of an [[Markov Decision Process]]. In model-based RL, instead of directly incorporating experience into our [[Value Functions|value function]] / [[Policy|policy]], we first try to learn - An approximated [[Markov Decision Process|reward function]] $\hat{\mathcal{R}}\approx \mathcal{R}$ - An approximated [[Markov Decision Process|transition dynamics]] $\hat{\mathcal{P}}\approx \mathcal{P}$ ![[Pasted image 20230717095355.png|center|300]] In this context, **planning** is the process of searching through the state space for an optimal policy by using the model. --- #### Performance Bias - Reasons - Policy optimizer (e.g. [[Gradient Descent - From Vanilla to Adam|gradient descent]]) can lead to problems, e.g. through poorly conditioned gradients - Modeling errors due to model missspecifications or limited data, can be fatal - Sim-to-Real Gap, model never reflects reality - Improvements - Better choice of optimizers - More accurate / involved models - [[Domain Randomization|Domain Randomization]] - Artificially add noise / uncertainty to model --- #### Examples and Categories **Dynamic Programming** Closely related to divide and conquer approaches. In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner ([[Bellman's Principle and Operator]]). - For model-based reinforcement learning two commonly used approaches that rely on [[Dynamic Programming - Policy and Value Iteration|dynamic programming]] are - [[Dynamic Programming - Policy and Value Iteration|Value Iteration]] - [[Dynamic Programming - Policy and Value Iteration|Policy Iteration]] - Two **dimensions -> four categories** - Are we using the structure of the model or are we sampling from it ? - Do we use the model during the roll-out (**online**) or between rollouts (**offline**) - **Structured Models** - **Offline** Usage of Structured Models - **Optimal Control**, e.g. [[LQG Regulator|LQR]], [[Non-Linear Systems|iLQR]] - **Why ?** - No heavy computation since entirely offline - Scales well to high dimensional action spaces - **Problems** - Only valid along nominal trajectory - **Online** Usage of Structured Models - Replan every step online - **[[MPC - Model Predictive Control]]** - Solve small horizon problem at every single timestep using e.g. [[Non-Linear Systems|iLQR]] - Use first step of solution for next step in global problem - **Why ?** - React to disturbances, deviation from nominal trajectory - Scales well to high dimensional action spaces - **Problems** - Very costly - Often acts overly greedy - Requires structured model - **Sample-Based Methods** - Why? - Most structured models use local linearizations, which makes them susceptible to local optima - Model errors get amplified in the gradient with local linearizations - Sampling-Based **Online** Methods - Sample points around the current solution, enables to wiggle out of local minima - **[[CEM - Cross-Entropy Method]]** - **Why ?** - No structured model required - Perform better regarding local minima - React to disturbances, deviation from nominal trajectory - **Problems** - Do not scale well to high dimensions - Costly, high sample complexity - Sampling-Based **Offline** Methods - **[[MBPO - Model-Based Policy Optimization]]** - **Why ?** - No heavy computations - Policy is not bound to nominal trajectory - No structured model necessary - **Problems** - Computations between rollouts are heavy ---