IERG 5330 Reinforcement Learning

January 27, 2026

Notes for Mathematical Foundations of Reinforcement Learning.

Basic Concepts

State and action

State transition

When taking an action, the agent may move from one state to another. Such a process is called state transition. For example, if the agent is in state $s_1$ and selects action $a_2$, then the agent moves to state $s_2$. Such a process can be expressed as

\[s_1 \xrightarrow{a_2} s_2\]

In general, state transitions can be stochastic and must be described by conditional probability distributions.

Policy

Reward

Trajectories, returns, and episodes

A trajectory is a state-action-reward chain:

\[s_1 \xrightarrow[r=0]{a_2} s_2 \xrightarrow[r=0]{a_3} s_5 \xrightarrow[r=0]{a_3} s_8 \xrightarrow[r=1]{a_2} s_9\]

The return of this trajectory is defined as the sum of all the rewards collected along the trajectory:

\[\text{return} = 0+0+0+1=1\]

Returns are also called total rewards or cumulative rewards.

Returns can be used to evaluate policies.

A return consists of an immediate reward and future rewards. Here, the immediate reward is the reward obtained after taking an action at the initial state; the future rewards refer to the rewards obtained after leaving the initial state.

Return can also be defined for infinitely long trajectories.

To avoid divergence, we must introduce the discounted return concept for infinitely long trajectories:

\[\text{discounted return} = 0 + \gamma 0 + \gamma^2 0 + \gamma^3 1 + \gamma^4 1 + \gamma^5 1 + \dots\]

where $\gamma \in (0, 1)$ is called the discount rate.

The introduction of the discount rate is useful for the following reasons.

When interacting with the environment by following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode (or a trial).

An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks. However, some tasks may have no terminal states, meaning that the process of interacting with the environment will never end. Such tasks are called continuing tasks.

Markov decision processes

An Markov decision process (MDP) is a general framework for describing stochastic dynamical systems. The key ingredients of an MDP are listed below.

\[\begin{aligned} p(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) &= p(s_{t+1} \mid s_t, a_t), \\ p(r_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) &= p(r_{t+1} \mid s_t, a_t), \end{aligned}\]

where $t$ represents the current time step and $t + 1$ represents the next time step. This indicates that the next state or reward depends merely on the current state and action and is independent of the previous ones.

Q&A

Is the reward a function of the next state?

The answer is that $r$ depends on $s$, $a$, and $s’$. However, since $s’$ also depends on $s$ and $a$, we can equivalently write $r$ as a function of $s$ and $a$: $p(r \mid s, a) = \sum_{s’} p(r \mid s, a, s’) p(s’ \mid s, a)$.

State Values and Bellman Equation

State values

Consider a sequence of time steps $t=0,1,2, \dots$. At time $t$, the agent is in state $S_{t}$, and the action taken following a policy $\pi$ is $A_{t}$. The next state is $S_{t+1}$, and the immediate reward obtained is $R_{t+1}$. This process can be expressed concisely as

\[S_{t} \xrightarrow{A_{t}} S_{t+1}, R_{t+1}\]

Note that $S_{t}, S_{t+1}, A_{t}, R_{t+1}$ are all random variables. Moreover, $S_{t}, S_{t+1} \in \mathcal{S}, A_{t} \in \mathcal{A}\left(S_{t}\right)$, and $R_{t+1} \in \mathcal{R}\left(S_{t}, A_{t}\right)$.

Starting from $t$, we can obtain a state-action-reward trajectory:

\[S_{t} \xrightarrow{A_{t}} S_{t+1}, R_{t+1} \xrightarrow{A_{t+1}} S_{t+2}, R_{t+2} \xrightarrow{A_{t+2}} S_{t+3}, R_{t+3} \ldots\]

By definition, the discounted return along the trajectory is

\[G_{t} \doteq R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\ldots,\]

where $\gamma \in(0,1)$ is the discount rate. Note that $G_{t}$ is a random variable since $R_{t+1}, R_{t+2}, \ldots$ are all random variables.

Since $G_{t}$ is a random variable, we can calculate its expected value (also called the expectation or mean):

\[v_{\pi}(s) \doteq \mathbb{E}\left[G_{t} \mid S_{t}=s\right] .\]

here, $v_{\pi}(s)$ is called the state-value function or simply the state value of $s$.

The relationship between state values and returns: