Skip to content

Commit f30dbc2

Browse files
committed
migrate tips
1 parent f8ed470 commit f30dbc2

File tree

7 files changed

+305
-316
lines changed

7 files changed

+305
-316
lines changed

units/en/unit1/deep-rl.mdx

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,20 @@
1-
# The “Deep” in Reinforcement Learning [[deep-rl]]
2-
3-
<Tip>
4-
What we've talked about so far is Reinforcement Learning. But where does the "Deep" come into play?
5-
</Tip>
6-
7-
Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
8-
9-
For instance, in the next unit, we’ll learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
10-
11-
You’ll see the difference is that, in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
12-
13-
In the second approach, **we will use a Neural Network** (to approximate the Q value).
14-
15-
<figure>
16-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Value based RL"/>
17-
<figcaption>Schema inspired by the Q learning notebook by Udacity
18-
</figcaption>
19-
</figure>
20-
21-
If you are not familiar with Deep Learning you should definitely watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).
1+
# The “Deep” in Reinforcement Learning [[deep-rl]]
2+
3+
> [!TIP]
4+
> What we've talked about so far is Reinforcement Learning. But where does the "Deep" come into play?
5+
6+
Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
7+
8+
For instance, in the next unit, we’ll learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
9+
10+
You’ll see the difference is that, in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
11+
12+
In the second approach, **we will use a Neural Network** (to approximate the Q value).
13+
14+
<figure>
15+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Value based RL"/>
16+
<figcaption>Schema inspired by the Q learning notebook by Udacity
17+
</figcaption>
18+
</figure>
19+
20+
If you are not familiar with Deep Learning you should definitely watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).

units/en/unit1/rl-framework.mdx

Lines changed: 143 additions & 144 deletions
Original file line numberDiff line numberDiff line change
@@ -1,144 +1,143 @@
1-
# The Reinforcement Learning Framework [[the-reinforcement-learning-framework]]
2-
3-
## The RL Process [[the-rl-process]]
4-
5-
<figure>
6-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
7-
<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
8-
<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
9-
</figure>
10-
11-
To understand the RL process, let’s imagine an agent learning to play a platform game:
12-
13-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
14-
15-
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
16-
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
17-
- The environment goes to a **new** **state \\(S_1\\)** — new frame.
18-
- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
19-
20-
This RL loop outputs a sequence of **state, action, reward and next state.**
21-
22-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="State, Action, Reward, Next State" width="100%">
23-
24-
The agent's goal is to _maximize_ its cumulative reward, **called the expected return.**
25-
26-
## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]
27-
28-
⇒ Why is the goal of the agent to maximize the expected return?
29-
30-
Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).
31-
32-
That’s why in Reinforcement Learning, **to have the best behavior,** we aim to learn to take actions that **maximize the expected cumulative reward.**
33-
34-
35-
## Markov Property [[markov-property]]
36-
37-
In papers, you’ll see that the RL process is called a **Markov Decision Process** (MDP).
38-
39-
We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.
40-
41-
## Observations/States Space [[obs-space]]
42-
43-
Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.
44-
45-
There is a differentiation to make between *observation* and *state*, however:
46-
47-
- *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment.
48-
49-
50-
<figure>
51-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/chess.jpg" alt="Chess">
52-
<figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption>
53-
</figure>
54-
55-
In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.
56-
57-
- *Observation o*: is a **partial description of the state.** In a partially observed environment.
58-
59-
<figure>
60-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
61-
<figcaption>In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</figcaption>
62-
</figure>
63-
64-
In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
65-
66-
In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
67-
68-
<Tip>
69-
In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.
70-
</Tip>
71-
72-
To recap:
73-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/obs_space_recap.jpg" alt="Obs space recap" width="100%">
74-
75-
76-
## Action Space [[action-space]]
77-
78-
The Action space is the set of **all possible actions in an environment.**
79-
80-
The actions can come from a *discrete* or *continuous space*:
81-
82-
- *Discrete space*: the number of possible actions is **finite**.
83-
84-
<figure>
85-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
86-
<figcaption>In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).</figcaption>
87-
88-
</figure>
89-
90-
Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.
91-
92-
- *Continuous space*: the number of possible actions is **infinite**.
93-
94-
<figure>
95-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/self_driving_car.jpg" alt="Self Driving Car">
96-
<figcaption>A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
97-
</figcaption>
98-
</figure>
99-
100-
To recap:
101-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/action_space.jpg" alt="Action space recap" width="100%">
102-
103-
Taking this information into consideration is crucial because it will **have importance when choosing the RL algorithm in the future.**
104-
105-
## Rewards and the discounting [[rewards]]
106-
107-
The reward is fundamental in RL because it’s **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not.**
108-
109-
The cumulative reward at each time step **t** can be written as:
110-
111-
<figure>
112-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards">
113-
<figcaption>The cumulative reward equals the sum of all rewards in the sequence.
114-
</figcaption>
115-
</figure>
116-
117-
Which is equivalent to:
118-
119-
<figure>
120-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_2.jpg" alt="Rewards">
121-
<figcaption>The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...
122-
</figcaption>
123-
</figure>
124-
125-
However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.
126-
127-
Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is **to eat the maximum amount of cheese before being eaten by the cat.**
128-
129-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_3.jpg" alt="Rewards" width="100%">
130-
131-
As we can see in the diagram, **it’s more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is).
132-
133-
Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we’re not really sure we’ll be able to eat it.
134-
135-
To discount the rewards, we proceed like this:
136-
137-
1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.95 and 0.99**.
138-
- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
139-
- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**
140-
141-
2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.**
142-
143-
Our discounted expected cumulative reward is:
144-
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" alt="Rewards" width="100%">
1+
# The Reinforcement Learning Framework [[the-reinforcement-learning-framework]]
2+
3+
## The RL Process [[the-rl-process]]
4+
5+
<figure>
6+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
7+
<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
8+
<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
9+
</figure>
10+
11+
To understand the RL process, let’s imagine an agent learning to play a platform game:
12+
13+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
14+
15+
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
16+
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
17+
- The environment goes to a **new** **state \\(S_1\\)** — new frame.
18+
- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
19+
20+
This RL loop outputs a sequence of **state, action, reward and next state.**
21+
22+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="State, Action, Reward, Next State" width="100%">
23+
24+
The agent's goal is to _maximize_ its cumulative reward, **called the expected return.**
25+
26+
## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]
27+
28+
⇒ Why is the goal of the agent to maximize the expected return?
29+
30+
Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).
31+
32+
That’s why in Reinforcement Learning, **to have the best behavior,** we aim to learn to take actions that **maximize the expected cumulative reward.**
33+
34+
35+
## Markov Property [[markov-property]]
36+
37+
In papers, you’ll see that the RL process is called a **Markov Decision Process** (MDP).
38+
39+
We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.
40+
41+
## Observations/States Space [[obs-space]]
42+
43+
Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.
44+
45+
There is a differentiation to make between *observation* and *state*, however:
46+
47+
- *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment.
48+
49+
50+
<figure>
51+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/chess.jpg" alt="Chess">
52+
<figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption>
53+
</figure>
54+
55+
In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.
56+
57+
- *Observation o*: is a **partial description of the state.** In a partially observed environment.
58+
59+
<figure>
60+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
61+
<figcaption>In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</figcaption>
62+
</figure>
63+
64+
In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
65+
66+
In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
67+
68+
> [!TIP]
69+
> In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.
70+
71+
To recap:
72+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/obs_space_recap.jpg" alt="Obs space recap" width="100%">
73+
74+
75+
## Action Space [[action-space]]
76+
77+
The Action space is the set of **all possible actions in an environment.**
78+
79+
The actions can come from a *discrete* or *continuous space*:
80+
81+
- *Discrete space*: the number of possible actions is **finite**.
82+
83+
<figure>
84+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
85+
<figcaption>In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).</figcaption>
86+
87+
</figure>
88+
89+
Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.
90+
91+
- *Continuous space*: the number of possible actions is **infinite**.
92+
93+
<figure>
94+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/self_driving_car.jpg" alt="Self Driving Car">
95+
<figcaption>A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
96+
</figcaption>
97+
</figure>
98+
99+
To recap:
100+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/action_space.jpg" alt="Action space recap" width="100%">
101+
102+
Taking this information into consideration is crucial because it will **have importance when choosing the RL algorithm in the future.**
103+
104+
## Rewards and the discounting [[rewards]]
105+
106+
The reward is fundamental in RL because it’s **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not.**
107+
108+
The cumulative reward at each time step **t** can be written as:
109+
110+
<figure>
111+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards">
112+
<figcaption>The cumulative reward equals the sum of all rewards in the sequence.
113+
</figcaption>
114+
</figure>
115+
116+
Which is equivalent to:
117+
118+
<figure>
119+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_2.jpg" alt="Rewards">
120+
<figcaption>The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...
121+
</figcaption>
122+
</figure>
123+
124+
However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.
125+
126+
Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is **to eat the maximum amount of cheese before being eaten by the cat.**
127+
128+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_3.jpg" alt="Rewards" width="100%">
129+
130+
As we can see in the diagram, **it’s more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is).
131+
132+
Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we’re not really sure we’ll be able to eat it.
133+
134+
To discount the rewards, we proceed like this:
135+
136+
1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.95 and 0.99**.
137+
- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
138+
- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**
139+
140+
2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.**
141+
142+
Our discounted expected cumulative reward is:
143+
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" alt="Rewards" width="100%">

0 commit comments

Comments
 (0)