Fundamentals of Reinforcement Learning

11 min readOct 15, 2020

In RL, the main goal is to learn from interaction. We want agents to learn a behavior, a way of selecting actions in given situations, to achieve some goal. The main difference between classical programming or planning is that we do not want to code the planning software explicitly on our own, as this would require a great effort; it can be very inefficient and even impossible. The RL discipline was born precisely for this reason.

RL agents start (usually) with no idea of what to do. They typically do not know the goal, they do not know the game’s rules, and they do not know the dynamics of the environment or how their actions influence the state.

There are three main components of RL: perception, actions, and goals.

Agents should be able to perceive the current environment state to deal with a task. This perception, also called observation, might be different from the actual environment state, can be subject to noise, or can be partial.

For example, think of a robot moving in an unknown environment. For robotic applications, usually, the robot perceives the environment using cameras. Such a perception does not represent the environment state completely; it can be subject to occlusions, poor lighting, or adverse conditions. The system should be able to deal with this incomplete representation and learn a way of moving in the environment.

The other main component of an agent is the ability to act; the agent should be able to take actions that affect the environment state or the agent’s state.

Agents should also have a goal defined through the environment state. Goals are described using high-level concepts such as winning a game, moving in an environment, or driving correctly.

One of the challenges of RL, a challenge that does not arise in other types of learning, is the exploration-exploitation trade-off. In order to improve, the agent has to exploit its knowledge; it should prefer actions that have demonstrated themselves as useful in the past. There’s a problem here: to discover better actions, the agent should continue exploring, trying moves they have never done before. To estimate the effect of an action reliably, an agent has to perform each action many times. The critical thing to notice here is that neither exploration nor exploitation can be performed individually in order to learn a task.

The aforementioned is very similar to the challenges we face as babies when we have to learn how to walk. At first, we try different types of movement, and we start from a simple movement yielding satisfactory results: crawling. Then, we want to improve our behavior to become more efficient. To learn a new behavior, we have to do movements we never did before: we try to walk. At first, we perform different actions yielding unsatisfactory results: we fall many times. Once we discover the correct way of moving our legs and balancing our body, we become more efficient in walking. If we did not explore further and we stopped at the first behavior that yields satisfactory results, we would crawl forever. By exploring, we learn that there can be different behaviors that are more efficient. Once we learn how to walk, we can stop exploring, and we can start exploiting our knowledge.

Elements of RL

Let’s introduce the main elements of the RL framework intuitively.

Agent

In RL, the agent is the abstract concept of the entity that moves in the world, takes actions, and achieves goals. An agent can be a piece of autonomous driving software, a chess player, a Go player, an algorithmic trader, or a robot. The agent is everything that can perceive and influence the state of the environment and, therefore, can be used to accomplish goals.

Actions

An agent can perform actions based on the current situation. Actions can assume different forms depending on the specific task.

Actions can be to steer, to push the accelerator pedal, or to push the brake pedal in an autonomous driving context. Other examples of actions include moving the horse to the H5 position or moving the king to the A5 position in a chess context.

Actions can be low-level, such as controlling the voltage of the motors of a vehicle, but they can also be high-level, or planning actions, such as deciding where to go. The decision on the action level is the responsibility of the algorithm’s designer. Actions that are too high-level can be challenging to implement at a lower level; they might require extensive planning at lower levels. At the same time, low-level actions make the problem difficult to learn.

Environment

The environment represents the context in which the agent moves and takes decisions. An environment is composed of three main elements: states, dynamics, and rewards. They can be explained as follows:

State: This represents all of the information describing the environment at a particular timestep. The state is available to the agent through observations, which can be a partial or full representation.
Dynamics: The dynamics of an environment describe how actions influence the state of the environment. The environment dynamic is usually very complex or unknown. An RL algorithm using the information of the environment dynamic to learn how to achieve a goal belongs to the category of model-based RL, where the model represents the mathematical description of the environment. Most of the time, the environment dynamic is not available to the agent. In this case, the algorithm belongs to the model-free category. Even if the environment model is not available, too complicated, or too approximated, the agent can learn a model of the environment during training. Also, in this case, the algorithm is said to be model-based.
Rewards: Rewards are scalar values associated with each timestep describing the agent’s goal. Rewards can also be described as environmental feedback, providing information to an agent about its behavior; it is, therefore, necessary for making learning possible. If the agent receives a high reward, it means that it performed a good move, a move bringing it closer to its goal.

Policy

A policy describes the behavior of the agent. Agents select actions by following their policies. Mathematically, a policy is a function mapping states to actions. What does this mean? Well, it means that the input of the policy is the current state, and its output is the action to take. A policy can have different forms. It can be a simple set of rules, a lookup table, a neural network, or any function approximator. A policy is the core of the RL framework, and the goal of all RL algorithms (implicit or explicit) is to improve the agent’s policy to maximize the agent’s performance on a task (or on a set of tasks). A policy can be stochastic, involving a distribution over actions, or it can be deterministic. In the latter case, the selected action is uniquely determined by the environment’s state.

An Example of an Autonomous Driving Environment

To better understand the environment’s role and its characteristics in the RL framework, let’s formalize an autonomous driving environment, as shown in the following figure:

Figure 1.10: An autonomous driving scenario

Considering the preceding figure, let’s now look at each of the components of the environment:

State: The state can be represented by the 360-degree image of the street around our car. In this case, the state is an image, that is, a matrix of pixels. It can also be represented by a series of images covering the whole space around the car. Another possibility is to describe the state using features and not images. The state can be the current velocity and acceleration of our vehicle, the distance from other cars, or the distance from the street border. In this case, we are using preprocessed information to represent the state more easily. These features can be extracted from images or other types of sensors (for example, Light Detection and Ranging — LIDAR).
Dynamics: The dynamics of the environment in an autonomous car scenario are represented by the equations describing how the system changes when the car accelerates, breaks, or steers. For instance, the vehicle is going at 30 km/h, and the next vehicle is 100 meters away from it. The state is represented by the car’s speed and the proximity information concerning the next vehicle. If the car accelerates, the speed changes according to the car’s properties (included in the environment dynamics). Also, the proximity information changes since the next vehicle can be closer or further away (according to the speed). In this situation, at the next timestep, the car’s speed can be 35 km/h, and the next vehicle can be closer, for example, only 90 meters away.
Reward: The reward can represent how well the agent is driving. It’s not easy to formalize a reward function. A natural reward function should award states in which the car is aligned to the street and should avoid states in which the car crashes or goes off the road. The reward function definition is an open problem and researchers are putting efforts into developing algorithms where the reward function is not needed (self-motivation or curiosity-driven agents), where the agent learns from demonstrations (imitation learning), and where the agent recovers the reward function from demonstrations (Inverse Reinforcement Learning or IRL).

We are now ready to design and implement our first environment class using Python. We will demonstrate how to implement the state, the dynamics, and the reward of a toy problem in the following exercise.

Exercise 1.01: Implementing a Toy Environment Using Python

In this exercise, we will implement a simple toy environment using Python. The environment is illustrated in Figure 1.11. It is composed of three states (1, 2, 3) and two actions (A and B). The initial state is state 1. States are represented by nodes. Edges represent transitions between states. On the edges, we have an action causing the transition and the associated reward.

The representation of the environment in Figure 1.11 is the standard environment representation in the context of RL. In this exercise, we will become acquainted with the concept of the environment and its implementation:

Figure 1.11: A toy environment composed of three states (1, 2, 3) and two actions (A and B)

In the preceding figure, the reward is associated with each state-action pair.

The goal of this exercise is to implement an Environment class with a step() method that takes as input the agent's actions and returns a state-action pair (next state, reward). In addition to this, we will write a reset() method that restarts the environment's state:

The Agent-Environment Interface

RL considers sequential decision-making problems. In this context, we can refer to the agent as the “decision-maker.” In sequential decision-making problems, actions taken by the decision-maker do not only influence the immediate reward and the immediate environment’s state, but they also affect future rewards and states. MDPs are a natural way of formalizing sequential decision-making problems. In MDPs, an agent interacts with an environment through actions and receives rewards based on the action, on the current state of the environment, and on the environment’s dynamics. The goal of the decision-maker is to maximize the cumulative sum of rewards given a horizon (which is possibly infinite). The task the agent has to learn is defined through the rewards it receives, as you can see in the following figure:

Figure 1.12: The Agent-Environment interface

In RL, an episode is divided into a sequence of discrete timesteps:

. Here,

represents the horizon length, which is possibly infinite. The interaction between the agent and the environment happens at each timestep. At each timestep, the agent receives a representation of the current environment’s state,

. Based on this state, it selects an action,

, belonging to the action space given the current state,

. The action affects the environment. As a result, the environment changes its state, transitioning to the next state,

, according to its dynamics. At the same time, the agent receives a scalar reward,

quantifying how good the action taken in that state was.

Let’s now try to understand the mathematical notations used in the preceding example:

Time horizon

: If a task has a finite time horizon, then

is an integer number representing the maximum duration of an episode. In infinite tasks,

can also be

.
Action

is the action taken by the agent in the timestep, t. The action belongs to the action space,

, defined by the current state,

.
State

is the representation of the environment’s state received by the agent at time t. It belongs to the state space,

, defined by the environment. It can be represented by an image, a sequence of images, or a simple vector assuming different shapes. Note that the actual environment state can be different and more complex than the state perceived by the agent.
Reward

is represented by a real number, describing how good the taken action was. A high reward corresponds to a good action. The reward is fundamental for the agent to understand how to achieve a goal.

In episodic RL, the agent-environment interaction is divided into episodes; the agent has to achieve the goal within the episode. The interaction is finalized to learn better behavior. After several episodes, the agent can decide to update its behavior by incorporating its knowledge of past interactions. Based on the effect of the action on the environment and the received rewards, the agent will perform more frequent actions yielding higher rewards.