IERG 5330 Reinforcement Learning
January 27, 2026
Notes for Mathematical Foundations of Reinforcement Learning.
Basic Concepts
State and action
- State: the agent’s status with respect to the environment.
- The set of all the states is called the state space, denoted as $\mathcal{S}$.
- Different states can have different action spaces.
State transition
When taking an action, the agent may move from one state to another. Such a process is called state transition. For example, if the agent is in state $s_1$ and selects action $a_2$, then the agent moves to state $s_2$. Such a process can be expressed as
\[s_1 \xrightarrow{a_2} s_2\]In general, state transitions can be stochastic and must be described by conditional probability distributions.
Policy
- A policy tells the agent which actions to take at every state.
- Following a policy, the agent can generate a trajectory starting from an initial state
- Mathematically, policies can be described by conditional probabilities, denoted as $\pi(a \mid s)$, which is a conditional probability distribution function defined for every state.
Reward
- After executing an action at a state, the agent obtains a reward, denoted as $r$, as feedback from the environment.
- The reward is a function of the state $s$ and action $a$. Hence, it is also denoted as $r(s, a)$.
- A reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as we expect.
- Designing appropriate rewards is an important step in reinforcement learning. This step is, however, nontrivial for complex tasks since it may require the user to understand the given problem well.
- To determine a good policy, we must consider the total reward obtained in the long run. An action with the greatest immediate reward may not lead to the greatest total reward.
- A general approach is to use conditional probabilities $p(r \mid s, a)$ to describe reward processes.
Trajectories, returns, and episodes
A trajectory is a state-action-reward chain:
\[s_1 \xrightarrow[r=0]{a_2} s_2 \xrightarrow[r=0]{a_3} s_5 \xrightarrow[r=0]{a_3} s_8 \xrightarrow[r=1]{a_2} s_9\]The return of this trajectory is defined as the sum of all the rewards collected along the trajectory:
\[\text{return} = 0+0+0+1=1\]Returns are also called total rewards or cumulative rewards.
Returns can be used to evaluate policies.
A return consists of an immediate reward and future rewards. Here, the immediate reward is the reward obtained after taking an action at the initial state; the future rewards refer to the rewards obtained after leaving the initial state.
Return can also be defined for infinitely long trajectories.
To avoid divergence, we must introduce the discounted return concept for infinitely long trajectories:
\[\text{discounted return} = 0 + \gamma 0 + \gamma^2 0 + \gamma^3 1 + \gamma^4 1 + \gamma^5 1 + \dots\]where $\gamma \in (0, 1)$ is called the discount rate.
The introduction of the discount rate is useful for the following reasons.
- First, it removes the stop criterion and allows for infinitely long trajectories.
- Second, the discount rate can be used to adjust the emphasis placed on near- or far-future rewards.
When interacting with the environment by following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode (or a trial).
An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks. However, some tasks may have no terminal states, meaning that the process of interacting with the environment will never end. Such tasks are called continuing tasks.
Markov decision processes
An Markov decision process (MDP) is a general framework for describing stochastic dynamical systems. The key ingredients of an MDP are listed below.
- Sets:
- State space: the set of all states, denoted as $\mathcal{S}$.
- Action space: a set of actions, denoted as $\mathcal{A}(s)$, associated with each state $s \in \mathcal{S}$.
- Reward set: a set of rewards, denoted as $\mathcal{R}(s, a)$, associated with each state-action pair $(s, a)$.
- Models:
- State transition probability: In state $s$, when taking action $a$, the probability of transitioning to state $s’$ is $p(s’ \mid s, a)$. It holds that $\sum_{s’ \in \mathcal{S}} p(s’ \mid s, a) = 1$ for any $(s, a)$.
- Reward probability: In state $s$, when taking action $a$, the probability of obtaining reward $r$ is $p(r \mid s, a)$. It holds that $\sum_{r \in \mathcal{R}(s, a)} p(r \mid s, a) = 1$ for any $(s, a)$.
- Policy: In state $s$, the probability of choosing action $a$ is $\pi(a \mid s)$. It holds that $\sum_{a \in \mathcal{A}(s)} \pi(a \mid s) = 1$ for any $s \in \mathcal{S}$.
- Markov property: The Markov property refers to the memoryless property of a stochastic process. Mathematically, it means that
where $t$ represents the current time step and $t + 1$ represents the next time step. This indicates that the next state or reward depends merely on the current state and action and is independent of the previous ones.
Q&A
Is the reward a function of the next state?
The answer is that $r$ depends on $s$, $a$, and $s’$. However, since $s’$ also depends on $s$ and $a$, we can equivalently write $r$ as a function of $s$ and $a$: $p(r \mid s, a) = \sum_{s’} p(r \mid s, a, s’) p(s’ \mid s, a)$.
State Values and Bellman Equation
- State Value: the average reward that an agent can obtain if it follows a given policy
-
The greater the state value is, the better the corresponding policy is.
-
The Bellman equation describes the relationships between the values of all states, which is an important tool for analyzing state values.
-
By solving the Bellman equation, we can obtain the state values. This process is called policy evaluation.
- The core idea of the Bellman equation: the return obtained by starting from one state depends on those obtained when starting from other states (the idea of bootstrapping, which is to obtain the values of some quantities from themselves).
State values
- Why introduce the concept of state value?
- Returns can be used to evaluate policies. However, they are inapplicable to stochastic systems because starting from one state may lead to different returns.
Consider a sequence of time steps $t=0,1,2, \dots$. At time $t$, the agent is in state $S_{t}$, and the action taken following a policy $\pi$ is $A_{t}$. The next state is $S_{t+1}$, and the immediate reward obtained is $R_{t+1}$. This process can be expressed concisely as
\[S_{t} \xrightarrow{A_{t}} S_{t+1}, R_{t+1}\]Note that $S_{t}, S_{t+1}, A_{t}, R_{t+1}$ are all random variables. Moreover, $S_{t}, S_{t+1} \in \mathcal{S}, A_{t} \in \mathcal{A}\left(S_{t}\right)$, and $R_{t+1} \in \mathcal{R}\left(S_{t}, A_{t}\right)$.
Starting from $t$, we can obtain a state-action-reward trajectory:
\[S_{t} \xrightarrow{A_{t}} S_{t+1}, R_{t+1} \xrightarrow{A_{t+1}} S_{t+2}, R_{t+2} \xrightarrow{A_{t+2}} S_{t+3}, R_{t+3} \ldots\]By definition, the discounted return along the trajectory is
\[G_{t} \doteq R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\ldots,\]where $\gamma \in(0,1)$ is the discount rate. Note that $G_{t}$ is a random variable since $R_{t+1}, R_{t+2}, \ldots$ are all random variables.
Since $G_{t}$ is a random variable, we can calculate its expected value (also called the expectation or mean):
\[v_{\pi}(s) \doteq \mathbb{E}\left[G_{t} \mid S_{t}=s\right] .\]here, $v_{\pi}(s)$ is called the state-value function or simply the state value of $s$.
- $v_{\pi}(s)$ depends on $s$. This is because its definition is a conditional expectation with the condition that the agent starts from $S_{t}=s$.
- $v_{\pi}(s)$ depends on $\pi$. This is because the trajectories are generated by following the policy $\pi$. For a different policy, the state value may be different.
- $v_{\pi}(s)$ does not depend on $t$. If the agent moves in the state space, $t$ represents the current time step. The value of $v_{\pi}(s)$ is determined once the policy is given.
The relationship between state values and returns:
- When both the policy and the system model are deterministic, starting from a state always leads to the same trajectory. In this case, the return obtained starting from a state is equal to the value of that state. By contrast, when either the policy or the system model is stochastic, starting from the same state may generate different trajectories. In this case, the returns of different trajectories are different, and the state value is the mean of these returns.