To answer this question, lets revisit the components of an MDP, the most typical decision making framework for RL.

An MDP is typically defined by a 4-tuple (π,π΄,π ,π)(S,A,R,T) where

πS is the state/observation space of an environment

π΄A is the set of actions the agent can choose between

π
(π ,π)R(s,a) is a function that returns the reward received for taking action πa in state π s

π(π β²|π ,π)T(sβ²|s,a) is a transition probability function, specifying the probability that the environment will transition to state π β²sβ² if the agent takes action πa in state π s.

Our goal is to find a policy πΟ that maximizes the expected future (discounted) reward.

Now if we know what all those elements of an MDP are, we can just compute the solution before ever actually executing an action in the environment. In AI, we typically call computing the solution to a decision-making problem before executing an actual decision *planning*. Some classic planning algorithms for MDPs include Value Iteration, Policy Iteration, and whole lot more.

But the RL problem isnβt so kind to us. What makes a problem an RL problem, rather than a planning problem, is the agent does *not* know all the elements of the MDP, precluding it from being able to plan a solution. Specifically, the agent does not know how the world will change in response to its actions (the transition function πT), nor what immediate reward it will receive for doing so (the reward function π R). The agent will simply have to try taking actions in the environment, observe what happens, and somehow, find a good policy from doing so.

So, if the agent does not know the transition function πT nor the reward function π R, preventing it from planning a solution out, how can it find a good policy? Well, it turns out there are lots of ways!

One approach that might immediately strike you, after framing the problem like this, is for the agent to learn a **model **of how the environment works from its observations and then plan a solution using that model. That is, if the agent is currently in state π 1s1, takes action π1,a1, and then observes the environment transition to state π 2s2 with reward π2r2, that information can be used to improve its estimate of π(π 2|π 1,π1)T(s2|s1,a1) and π
(π 1,π1)R(s1,a1), which can be performed using supervised learning approaches. Once the agent has adequately modelled the environment, it can use a planning algorithm with its learned model to find a policy. RL solutions that follow this framework are *model-based RL algorithms*.

As it turns out though, we donβt have to learn a model of the environment to find a good policy. One of the most classic examples is *Q-learning*, which directly estimates the optimal *Q*-values of each action in each state (roughly, the utility of each action in each state), from which a policy may be derived by choosing the action with the highest Q-value in the current state. *Actor-critic* and *policy search* methods directly search over policy space to find policies that result in better reward from the environment. Because these approaches do not learn a model of the environment they are called *model-free algorithms*.

So if you want a way to check if an RL algorithm is model-based or model-free, ask yourself this question: after learning, can the agent make predictions about what the next state and reward will be before it takes each action? If it can, then itβs a model-based RL algorithm. if it cannot, itβs a model-free algorithm.

This same idea may also apply to decision-making processes other than MDPs

references