Releases · kengz/SLM-Lab

Specify train mode as train@{predir}, where {predir} is the data directory of the last training run, or simply use latest` to use the latest. e.g.:

python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole train
# terminate run before its completion
# optionally edit the spec file in a past-future-consistent manner

# run resume with either of the commands:
python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole train@latest
# or to use a specific run folder
python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole train@data/reinforce_cartpole_2020_04_13_232521

`enjoy` mode refactor

The train@ resume mode API allows for the enjoy mode to be refactored. Both share similar syntax. Continuing with the example above, to enjoy a train model, we now use:

python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole enjoy@data/reinforce_cartpole_2020_04_13_232521/reinforce_cartpole_t0_s0_spec.json

Plotly and PyTorch update

#453 updates Plotly to 4.5.4 and PyTorch to 1.3.1.
#454 explicitly shuts down Plotly orca server after plotting to prevent zombie processes

PPO batch size optimization

#453 adds chunking to allow PPO to run on larger batch size by breaking up the forward loop.

New OnPolicyCrossEntropy memory

#446 adds a new OnPolicyCrossEntropy memory class. See PR for details. Credits to @ingambe.

Assets 2

13 Nov 08:21

kengz

v4.1.1

1b634c0

Discrete SAC benchmark update


Env. \ Alg.	DQN	DDQN+PER	A2C (GAE)	A2C (n-step)	PPO	SAC
Breakout graph	80.88	182	377	398	443	3.51*
Pong graph	18.48	20.5	19.31	19.56	20.58	19.87*
Seaquest graph	1185	4405	1070	1684	1715	171*
Qbert graph	5494	11426	12405	13590	13460	923*
LunarLander graph	192	233	25.21	68.23	214	276
UnityHallway graph	-0.32	0.27	0.08	-0.96	0.73	0.01
UnityPushBlock graph	4.88	4.93	4.68	4.93	4.97	-0.70

Episode score at the end of training attained by SLM Lab implementations on discrete-action control problems. Reported episode scores are the average over the last 100 checkpoints, and then averaged over 4 Sessions. A Random baseline with score averaged over 100 episodes is included. Results marked with * were trained using the hybrid synchronous/asynchronous version of SAC to parallelize and speed up training time. For SAC, Breakout, Pong and Seaquest were trained for 2M frames instead of 10M frames.

For the full Atari benchmark, see Atari Benchmark

Assets 2

29 Oct 05:11

kengz

v4.1.0

8112907

RAdam+Lookahead optim, TensorBoard, Full Benchmark Upload

This marks a stable release of SLM Lab with full benchmark results

RAdam+Lookahead optimizer

Lookahead + RAdam optimizer significantly improves the performance of some RL algorithms (A2C (n-step), PPO) on continuous domain problems, but does not improve (A2C (GAE), SAC). #416

TensorBoard

Add TensorBoard in body to auto-log summary variables, graph, network parameter histograms, action histogram. To launch TensorBoard, run tensorboard --logdir=data after a session/trial is completed. Example screenshot:

Full Benchmark Upload

Plot Legend

Discrete Benchmark


Env. \ Alg.	DQN	DDQN+PER	A2C (GAE)	A2C (n-step)	PPO	SAC
Breakout graph	80.88	182	377	398	443	-
Pong graph	18.48	20.5	19.31	19.56	20.58	19.87*
Seaquest graph	1185	4405	1070	1684	1715	-
Qbert graph	5494	11426	12405	13590	13460	214*
LunarLander graph	192	233	25.21	68.23	214	276
UnityHallway graph	-0.32	0.27	0.08	-0.96	0.73	-
UnityPushBlock graph	4.88	4.93	4.68	4.93	4.97	-

Episode score at the end of training attained by SLM Lab implementations on discrete-action control problems. Reported episode scores are the average over the last 100 checkpoints, and then averaged over 4 Sessions. Results marked with * were trained using the hybrid synchronous/asynchronous version of SAC to parallelize and speed up training time.

For the full Atari benchmark, see Atari Benchmark

Continuous Benchmark


Env. \ Alg.	A2C (GAE)	A2C (n-step)	PPO	SAC
RoboschoolAnt graph	787	1396	1843	2915
RoboschoolAtlasForwardWalk graph	59.87	88.04	172	800
RoboschoolHalfCheetah graph	712	439	1960	2497
RoboschoolHopper graph	710	285	2042	2045
RoboschoolInvertedDoublePendulum graph	996	4410	8076	8085
RoboschoolInvertedPendulum graph	995	978	986	941
RoboschoolReacher graph	12.9	10.16	19.51	19.99
RoboschoolWalker2d graph	280	220	1660	1894
RoboschoolHumanoid graph	99.31	54.58	2388	2621*
RoboschoolHumanoidFlagrun graph	73.57	178	2014	2056*
RoboschoolHumanoidFlagrunHarder graph	-429	253	680	280*
Unity3DBall graph	33.48	53.46	78.24	98.44
Unity3DBallHard graph	62.92	71.92	91.41	97.06

Episode score at the end of training attained by SLM Lab implementations on continuous control problems. Reported episode scores are the average over the last 100 checkpoints, and then averaged over 4 Sessions. Results marked with * require 50M-100M frames, so we use the hybrid synchronous/asynchronous version of SAC to parallelize and speed up training time.

Atari Benchmark


Env. \ Alg.	DQN	DDQN+PER	A2C (GAE)	A2C (n-step)	PPO
Adventure graph	-0.94	-0.92	-0.77	-0.85	-0.3
AirRaid graph	1876	3974	4202	3557	4028
Alien graph	822	1574	1519	1627	1413
Amidar graph	90.95	431	577	418	795
Assault graph	1392	2567	3366	3312	3619
Asterix graph	1253	6866	5559	5223	6132
Asteroids graph	439	426	2951	2147	2186
Atlantis graph	68679	644810	2747371	2259733	2148077
BankHeist graph	131	623	855	1170	1183
BattleZone graph	6564	6395	4336	4533	13649
BeamRider graph <img src="https://user-images.githubusercontent.com/8...

Assets 2

11 Aug 18:14

kengz

v4.0.1

4fb2efe

v4.0.1: Soft Actor-Critic

This release adds a new algorithm: Soft Actor-Critic (SAC).

Soft Actor-Critic

-implement the original paper: "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" https://arxiv.org/abs/1801.01290 #398

implement the improvement of SAC paper: "Soft Actor-Critic Algorithms and Applications" https://arxiv.org/abs/1812.05905 #399
extend SAC to work directly for discrete environment using GumbelSoftmax distribution (custom)

Roboschool (continuous control) Benchmark

Note that the Roboschool reward scales are different from MuJoCo's.

Env. \ Alg.	SAC
RoboschoolAnt	2451.55 graph
RoboschoolHalfCheetah	2004.27 graph
RoboschoolHopper	2090.52 graph
RoboschoolWalker2d	1711.92 graph

LunarLander (discrete control) Benchmark



Trial graph	Moving average

Assets 2

31 Jul 17:19

kengz

v4.0.0

5da1c98

v4.0.0: Algorithm Benchmark, Analysis, API simplification

This release corrects and optimizes all the algorithms from benchmarking on Atari. New metrics are introduced. The lab's API is also redesigned for simplicity.

Benchmark

full algorithm benchmark on 4 core Atari environments #396
LunarLander benchmark #388 and BipedalWalker benchmark #377

This benchmark table is pulled from PR396. See the full benchmark results here.

Env. \ Alg.	A2C (GAE)	A2C (n-step)	PPO	DQN	DDQN+PER
Breakout graph	389.99 graph	391.32 graph	425.89 graph	65.04 graph	181.72 graph
Pong graph	20.04 graph	19.66 graph	20.09 graph	18.34 graph	20.44 graph
Qbert graph	13,328.32 graph	13,259.19 graph	13,691.89 graph	4,787.79 graph	11,673.52 graph
Seaquest graph	892.68 graph	1,686.08 graph	1,583.04 graph	1,118.50 graph	3,751.34 graph

Algorithms

correct and optimize all algorithms with benchmarking #315 #327 #328 #361
introduce "shared" and "synced" Hogwild modes for distributed training #337 #340
streamline and optimize agent components too

Now, the full list of algorithms are:

SARSA
DQN, distributed-DQN
Double-DQN, Dueling-DQN, PER-DQN
REINFORCE
A2C, A3C (N-step & GAE)
PPO, distributed-PPO
SIL (A2C, PPO)
All the algorithms can be ran in distributed mode also; which in some cases they have their special names (mentioned above)

Environments

implement vector environments #302
implement more environment wrappers for preprocessing. Some replay memories are retired. #303 #330 #331 #342
make Lab Env wrapper interface identical to gym #304, #305, #306, #307

API

all the Space objects (AgentSpace, EnvSpace, AEBSpace, InfoSpace) are retired, to opt for a much simpler interface. #335 #348
major API simplification throughout

Analysis

rework analysis, introduce new metrics: strength, sample efficiency, training efficiency, stability, consistency #347 #349
fast evaluation using vectorized env for rigorous_eval #390 , and using inference for fast eval #391

Search

update and rework Ray search #350 #351

Assets 2

17 Apr 16:49

kengz

v3.2.1

f8567e3

Improve installation, functions, add out layer activation

Improve installation

#288 split out yarn installation as extra step

Improve functions

#283 #284 redesign fitness slightly
#281 simplify PER sample index
#287 #290 improve DQN polyak and network switching
#291 refactor advantage functions
#295 #296 refactor various utils, fix PyTorch inplace ops

Add out layer activation

#300 add out layer activation

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Improve Installation Stability

Google Colab/Jupyter

Windows setup

Update installation

Resume mode

`train@` usage example

`enjoy` mode refactor

Plotly and PyTorch update

PPO batch size optimization

New OnPolicyCrossEntropy memory

Discrete SAC benchmark update

RAdam+Lookahead optimizer

TensorBoard

Full Benchmark Upload

Plot Legend

Discrete Benchmark

Continuous Benchmark

Atari Benchmark

Soft Actor-Critic

Roboschool (continuous control) Benchmark

LunarLander (discrete control) Benchmark

Benchmark

Algorithms

Environments

API

Analysis

Search

Improve installation

Improve functions

Add out layer activation

Releases: kengz/SLM-Lab

upgrade plotly, replace orca with kaleido

What's Changed

Contributors

fix GPU installation and assignment issue

What's Changed

New Contributors

Contributors

Improve Installation / Colab notebook

Improve Installation Stability

Google Colab/Jupyter

Windows setup

Update installation

Update installation

Resume mode, Plotly and PyTorch update, OnPolicyCrossEntropy memory

Resume mode

train@ usage example

enjoy mode refactor

Plotly and PyTorch update

PPO batch size optimization

New OnPolicyCrossEntropy memory

Discrete SAC benchmark update

Discrete SAC benchmark update

RAdam+Lookahead optim, TensorBoard, Full Benchmark Upload

RAdam+Lookahead optimizer

TensorBoard

Full Benchmark Upload

Plot Legend

Discrete Benchmark

Continuous Benchmark

Atari Benchmark

v4.0.1: Soft Actor-Critic

Soft Actor-Critic

Roboschool (continuous control) Benchmark

LunarLander (discrete control) Benchmark

v4.0.0: Algorithm Benchmark, Analysis, API simplification

Benchmark

Algorithms

Environments

API

Analysis

Search

Improve installation, functions, add out layer activation

Improve installation

Improve functions

Add out layer activation

`train@` usage example

`enjoy` mode refactor