>[!quote] In a Nutshell
>Learn a [[Static, Dynamic and Stochastic Systems|dynamics model]] (and often the reward as well) from data and optimize the [[Policy|policy]] using the approximate dynamics model by minimizing a cost function / maximizing a reward (as in optimal control).
>- Good when there is a well-understood underlying physical model / domain knowledge
>- Bad when physical model is very complex / not well understood or environment is too complex to model (use [[Model-Free Reinforcement Learning|MFRL]] instead)
---
A **full model** $\hat{\mathcal{M}}$ in [[- Reinforcement Learning -|- Reinforcement Learning -]] can be stochastic or deterministic and consists of an [[Markov Decision Process]]. In model-based RL, instead of directly incorporating experience into our [[Value Functions|value function]] / [[Policy|policy]], we first try to learn
- An approximated [[Markov Decision Process|reward function]] $\hat{\mathcal{R}}\approx \mathcal{R}$
- An approximated [[Markov Decision Process|transition dynamics]] $\hat{\mathcal{P}}\approx \mathcal{P}$
![[Pasted image 20230717095355.png|center|300]]
In this context, **planning** is the process of searching through the state space for an optimal policy by using the model.
---
#### Performance Bias
- Reasons
- Policy optimizer (e.g. [[Gradient Descent - From Vanilla to Adam|gradient descent]]) can lead to problems, e.g. through poorly conditioned gradients
- Modeling errors due to model missspecifications or limited data, can be fatal
- Sim-to-Real Gap, model never reflects reality
- Improvements
- Better choice of optimizers
- More accurate / involved models
- [[Domain Randomization|Domain Randomization]]
- Artificially add noise / uncertainty to model
---
#### Examples and Categories
**Dynamic Programming**
Closely related to divide and conquer approaches. In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner ([[Bellman's Principle and Operator]]).
- For model-based reinforcement learning two commonly used approaches that rely on [[Dynamic Programming - Policy and Value Iteration|dynamic programming]] are
- [[Dynamic Programming - Policy and Value Iteration|Value Iteration]]
- [[Dynamic Programming - Policy and Value Iteration|Policy Iteration]]
- Two **dimensions -> four categories**
- Are we using the structure of the model or are we sampling from it ?
- Do we use the model during the roll-out (**online**) or between rollouts (**offline**)
- **Structured Models**
- **Offline** Usage of Structured Models
- **Optimal Control**, e.g. [[LQG Regulator|LQR]], [[Non-Linear Systems|iLQR]]
- **Why ?**
- No heavy computation since entirely offline
- Scales well to high dimensional action spaces
- **Problems**
- Only valid along nominal trajectory
- **Online** Usage of Structured Models
- Replan every step online
- **[[MPC - Model Predictive Control]]**
- Solve small horizon problem at every single timestep using e.g. [[Non-Linear Systems|iLQR]]
- Use first step of solution for next step in global problem
- **Why ?**
- React to disturbances, deviation from nominal trajectory
- Scales well to high dimensional action spaces
- **Problems**
- Very costly
- Often acts overly greedy
- Requires structured model
- **Sample-Based Methods**
- Why?
- Most structured models use local linearizations, which makes them susceptible to local optima
- Model errors get amplified in the gradient with local linearizations
- Sampling-Based **Online** Methods
- Sample points around the current solution, enables to wiggle out of local minima
- **[[CEM - Cross-Entropy Method]]**
- **Why ?**
- No structured model required
- Perform better regarding local minima
- React to disturbances, deviation from nominal trajectory
- **Problems**
- Do not scale well to high dimensions
- Costly, high sample complexity
- Sampling-Based **Offline** Methods
- **[[MBPO - Model-Based Policy Optimization]]**
- **Why ?**
- No heavy computations
- Policy is not bound to nominal trajectory
- No structured model necessary
- **Problems**
- Computations between rollouts are heavy
---