Skip to content

Commit 695d9ac

Browse files
committedMar 21, 2025·
Upload chapter10 reinforcement learning
1 parent 1de1542 commit 695d9ac

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+835
-0
lines changed
 
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Chapter Summary
2+
3+
1. The RL systems are usually consisted of the agent, environment, and
4+
a policy. The reward function can be also important for a policy to
5+
be optimized in RL.
6+
7+
2. Single-agent single-node RL system is a relatively simple framework
8+
in RL systems. It is also composed with several key components.
9+
10+
3. Distributed RL systems are more complex than single-node system, but
11+
with benefits of speeding up the policy optimization process.
12+
13+
4. Multi-agent RL involves more than one agent to interact with each
14+
other and the environment. Thus multi-agent RL systems can be more
15+
complicated than single-agent RL system, even with different
16+
objectives.
17+
18+
5. Due to the particularity of reinforcement learning problem settings
19+
(e.g., sampling through interaction with the environment), related
20+
algorithms pose stricter requirements on the computing system. This
21+
raises a couple of questions: How can we better balance sample
22+
collection and strategy training while also evenly utilizing the
23+
capabilities of different compute hardware such as CPUs and GPUs?
24+
And how can reinforcement learning agents be deployed in a
25+
large-scale distributed system? To find the answers to these
26+
questions, we must deeply understand the design and use of computer
27+
systems.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Distributed Reinforcement Learning System
2+
3+
The distributed reinforcement learning system is more powerful than the
4+
single-node reinforcement system we discussed earlier. It features
5+
parallel processing capability of multiple models in multiple
6+
environments, meaning it can update multiple models on multiple computer
7+
systems at the same time. As such, it significantly accelerates the
8+
learning process and improves the overall performance of the
9+
reinforcement learning system. This section focuses on common algorithms
10+
and systems in distributed reinforcement learning.
11+
12+
## Distributed RL Algorithm--A3C
13+
14+
Asynchronous Advantage Actor-Critic (A3C) was proposed by DeepMind
15+
researchers in 2016. This algorithm can update networks on multiple
16+
computing devices in parallel. Unlike the single-node reinforcement
17+
learning system, A3C creates a group of workers, allocates the workers
18+
to different computing devices, and creates an interactive environment
19+
for each worker to implement parallel sampling and model update. In
20+
addition, it uses a master node to update actor networks (policy
21+
networks) and critic networks (value networks). These two types of
22+
networks correspond to the policy and value functions in reinforcement
23+
learning, respectively. Such a design allows each worker to send the
24+
gradients computed based on the collected samples to the master node in
25+
real time in order to update the parameters on the master node. The
26+
parameters are then transferred in real time to each worker for model
27+
synchronization. Each worker can perform the computing on a GPU. In this
28+
way, the entire algorithm updates the model in parallel on a GPU
29+
cluster. Figure :numref:`ch011/ch11-a3c` depicts the algorithm structure.
30+
Research shows that in addition to accelerating model learning,
31+
distributed reinforcement learning helps stabilize learning performance.
32+
This is because the gradients in distributed reinforcement learning are
33+
computed based on environment sampled from multiple nodes.
34+
35+
![A3C distributed algorithmarchitecture](../img/ch11/ch11-a3c.pdf)
36+
:label:`ch011/ch11-a3c`
37+
38+
## Distributed RL Algorithm--IMPALA
39+
40+
Importance Weighted Actor-Learner Architecture (IMPALA) is a
41+
reinforcement learning framework proposed by Lasse Espeholt et al. in
42+
2018 to implement clustered multi-machine training. Figure
43+
:numref:`ch011/ch11-impala` depicts this architecture. Like A3C,
44+
IMPALA enables gradient computation on multiple GPUs in parallel. In
45+
IMPALA, multiple actors and learners are paralleled. Each actor has a
46+
policy network to collect samples by interacting with another
47+
environment. The collected sample trajectories are sent by actors to
48+
their respective learners for gradient computation. Among the learners,
49+
there is a master learner. It can communicate with other learners to
50+
obtain their computed gradients for the update of its model. After the
51+
model is updated, the model is delivered to other learners and actors
52+
for a new round of sampling and gradient computation. As a distributed
53+
computing architecture, IMPALA is proved to be more efficient than A3C.
54+
It benefits from a specially designed gradient computation function in
55+
learners and from V-trace target in addition to stabilizing training
56+
based on importance weights. Because the V-trace technique is not
57+
related to our area of focus here, we will not elaborate on it.
58+
Interested readers can learn more from the original paper.
59+
60+
![IMPALA distributed algorithmarchitecture](../img/ch11/ch11-impala.pdf)
61+
:label:`ch011/ch11-impala`
62+
63+
## Other Algorithms
64+
65+
Apart from A3C and IMPALA, researchers have proposed other algorithms in
66+
recent studies, for example, SEED  and Ape-X . These algorithms are more
67+
effective in distributed reinforcement learning. Readers can find out
68+
more about these algorithms from the corresponding papers. Next, we move
69+
on to some typical distributed reinforcement learning algorithm
70+
libraries.
71+
72+
## Distributed RL System -- RLlib
73+
74+
RLlib  --- based on Ray , which is a distributed computing framework
75+
initiated by several researchers from UC Berkeley --- is built for
76+
reinforcement learning. It is an open-source reinforcement learning
77+
framework oriented to industrial applications. RLlib contains a
78+
reinforcement learning algorithm library and is convenient for users who
79+
are not that experienced in reinforcement learning.
80+
81+
Figure :numref:`ch011/ch11-rllib-arch` shows the architecture of RLlib.
82+
Its bottom layer is built on Ray's basic components for distributed
83+
computing and communications. Oriented to reinforcement learning, basic
84+
components such as Trainer, Environment, and Policy are abstracted at
85+
the Python layer. There are built-in implementations for the abstracted
86+
components, and users can extend the components based on their algorithm
87+
requirements. With these built-in and customized algorithm components,
88+
researchers can quickly implement specific reinforcement learning
89+
algorithms.
90+
91+
![RLlibarchitecture](../img/ch11/ch11-rllib-arch.png)
92+
:label:`ch011/ch11-rllib-arch`
93+
94+
RLlib supports distributed reinforcement learning training of different
95+
paradigms. Figure
96+
:numref:`ch011/ch11-rllib-distributed` shows the distributed
97+
training architecture of the reinforcement learning algorithm based on
98+
synchronous sampling. Each rollout worker is an independent process and
99+
interacts with the corresponding environment to collect experience.
100+
Multiple rollout workers can interact with the environment in parallel.
101+
Trainers are responsible for coordinating rollout workers, policy
102+
optimization, and synchronization of updated policies to rollout
103+
workers.
104+
105+
![RLlib distributedtraining](../img/ch11/ch11-rllib-distributed.png)
106+
:label:`ch011/ch11-rllib-distributed`
107+
108+
Reinforcement learning is usually based on deep neural networks. For
109+
distributed learning based on such networks, we can combine RLlib with a
110+
deep learning framework such as PyTorch and TensorFlow. Adopting such an
111+
approach means that the deep learning framework takes responsibility for
112+
training and updating the policy network, with RLlib taking over the
113+
computation of the reinforcement learning algorithm. RLlib also supports
114+
interaction with paralleled vectorized environments and pluggable
115+
simulators, as well as offline reinforcement learning.
116+
117+
## Distributed RL System--Reverb and Acme
118+
119+
For management of experience replay buffer, Reverb  is an inevitable
120+
topic. At the beginning of this chapter, we introduced concepts such as
121+
state, action, and reward in reinforcement learning. The data used for
122+
training in real-world applications comes from the data samples stored
123+
in the experience buffer, and the operations performed on the data may
124+
vary depending on the data formats. Common data operations include
125+
concatenation, truncation, product, transposition, partial product, and
126+
mean or extreme value. These operations may be performed on different
127+
dimensions of the data, posing a challenge for existing reinforcement
128+
learning frameworks. In order to flexibly use data of different formats
129+
in reinforcement training, Reverb introduces the concept of *chunk*. All
130+
data used for training is stored as chunks in the buffer for management
131+
and scheduling. This design takes advantage of data being
132+
multidimensional tensors and makes data usage faster and more flexible.
133+
DeepMind recently proposed a distributed reinforcement learning
134+
framework called Acme , which is also designed for academia research and
135+
industrial applications. It provides a faster distributed reinforcement
136+
learning solution based on a distributed sampling structure and Reverb's
137+
sample buffer management. Reverb solves the efficiency problem of data
138+
management and transfer, allowing Acme to fully leverage the efficiency
139+
made possible in distributed computing. Researchers have used Acme to
140+
achieve significant speed gains in many reinforcement learning benchmark
141+
tests.

0 commit comments

Comments
 (0)
Please sign in to comment.