-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hi, thanks very much for sharing this inspiring work!
I have attempted to execute the code to replicate the results using the following scripts:
python train_rl_il.py --env CartPole --algo ppo --steps 1000000
python collect_demo.py --env CartPole --agent-path ./logs/CartPole/ppo/seed0_20240808225352
python train_clf.py --env CartPole --buffer ./demos/CartPole/size20000_std0.1_bias0.1_reward11.pkl
However, the rewards I achieved were significantly lower than those reported in the paper. For instance, here are the training logs from my CartPole environment:
python train_rl_il.py --env CartPole --algo ppo --steps 1000000
> Training with cuda
> No pre-training
> Training PPO...
Num steps: 20000 , Return: 27.9 , Min/Max Return: 20.4 /31.7
Num steps: 40000 , Return: 38.3 , Min/Max Return: 27.3 /52.4
Num steps: 60000 , Return: 30.4 , Min/Max Return: 25.3 /36.7
Num steps: 80000 , Return: 37.1 , Min/Max Return: 29.6 /45.1
Num steps: 100000, Return: 19.2 , Min/Max Return: 18.2 /20.8
Num steps: 120000, Return: 21.7 , Min/Max Return: 19.2 /26.5
Num steps: 140000, Return: 29.6 , Min/Max Return: 24.2 /39.9
Num steps: 160000, Return: 38.9 , Min/Max Return: 31.3 /48.5
Num steps: 180000, Return: 30.1 , Min/Max Return: 26.5 /34.3
Num steps: 200000, Return: 51.7 , Min/Max Return: 32.7 /94.9
Num steps: 220000, Return: 43.5 , Min/Max Return: 37.8 /54.0
Num steps: 240000, Return: 31.6 , Min/Max Return: 28.3 /33.9
Num steps: 260000, Return: 36.2 , Min/Max Return: 29.2 /42.1
Num steps: 280000, Return: 30.3 , Min/Max Return: 27.9 /35.6
Num steps: 300000, Return: 31.1 , Min/Max Return: 27.5 /36.1
Num steps: 320000, Return: 19.7 , Min/Max Return: 18.6 /20.9
Num steps: 340000, Return: 46.6 , Min/Max Return: 34.7 /64.5
Num steps: 360000, Return: 40.5 , Min/Max Return: 30.2 /53.2
Num steps: 380000, Return: 28.8 , Min/Max Return: 24.9 /32.1
Num steps: 400000, Return: 48.6 , Min/Max Return: 30.7 /76.2
Num steps: 420000, Return: 55.4 , Min/Max Return: 43.2 /78.2
Num steps: 440000, Return: 31.7 , Min/Max Return: 29.0 /35.4
Num steps: 460000, Return: 39.0 , Min/Max Return: 30.2 /43.9
Num steps: 480000, Return: 34.8 , Min/Max Return: 29.2 /41.2
Num steps: 500000, Return: 39.7 , Min/Max Return: 34.2 /44.9
Num steps: 520000, Return: 50.4 , Min/Max Return: 40.1 /64.2
Num steps: 540000, Return: 34.2 , Min/Max Return: 27.9 /43.6
Num steps: 560000, Return: 35.3 , Min/Max Return: 29.2 /43.6
Num steps: 580000, Return: 43.1 , Min/Max Return: 36.1 /46.3
Num steps: 600000, Return: 41.8 , Min/Max Return: 34.1 /49.6
Num steps: 620000, Return: 46.7 , Min/Max Return: 34.6 /58.7
Num steps: 640000, Return: 38.3 , Min/Max Return: 30.2 /41.1
Num steps: 660000, Return: 46.7 , Min/Max Return: 39.5 /56.8
Num steps: 680000, Return: 41.1 , Min/Max Return: 32.2 /54.3
Num steps: 700000, Return: 40.6 , Min/Max Return: 31.7 /53.0
Num steps: 720000, Return: 41.5 , Min/Max Return: 30.8 /56.2
Num steps: 740000, Return: 35.5 , Min/Max Return: 30.2 /44.1
Num steps: 760000, Return: 49.5 , Min/Max Return: 38.3 /69.2
Num steps: 780000, Return: 55.8 , Min/Max Return: 42.3 /80.5
Num steps: 800000, Return: 59.2 , Min/Max Return: 34.4 /109.7
Num steps: 820000, Return: 40.1 , Min/Max Return: 34.6 /54.1
Num steps: 840000, Return: 41.6 , Min/Max Return: 28.8 /53.7
Num steps: 860000, Return: 40.9 , Min/Max Return: 31.1 /56.4
Num steps: 880000, Return: 45.9 , Min/Max Return: 38.9 /50.8
Num steps: 900000, Return: 50.6 , Min/Max Return: 33.9 /108.5
Num steps: 920000, Return: 56.2 , Min/Max Return: 35.7 /123.4
Num steps: 940000, Return: 31.3 , Min/Max Return: 27.7 /35.2
Num steps: 960000, Return: 50.9 , Min/Max Return: 35.4 /63.1
Num steps: 980000, Return: 38.2 , Min/Max Return: 31.3 /58.3
Num steps: 1000000, Return: 45.5 , Min/Max Return: 35.2 /76.3
100%|████████████████████████████████| 1000000/1000000 [42:34<00:00, 391.46it/s]
> Done in 2555s
python collect_demo.py --env CartPole --agent-path ./logs/CartPole/ppo/seed0_20240808225352
epi: 0, reward: 11, steps: 20
epi: 1, reward: 21, steps: 73
epi: 2, reward: 8, steps: 33
epi: 3, reward: 10, steps: 50
epi: 4, reward: 17, steps: 27
epi: 5, reward: 6, steps: 15
epi: 6, reward: 22, steps: 52
epi: 7, reward: 10, steps: 24
epi: 8, reward: 11, steps: 18
epi: 9, reward: 3, steps: 9
epi: 10, reward: 17, steps: 35
epi: 11, reward: 15, steps: 32
epi: 12, reward: 10, steps: 18
epi: 13, reward: 7, steps: 17
epi: 14, reward: 4, steps: 11
epi: 15, reward: 14, steps: 30
epi: 16, reward: 11, steps: 21
epi: 17, reward: 8, steps: 17
epi: 18, reward: 12, steps: 24
epi: 19, reward: 10, steps: 56
epi: 20, reward: 13, steps: 28
epi: 21, reward: 6, steps: 12
epi: 22, reward: 14, steps: 26
epi: 23, reward: 11, steps: 17
epi: 24, reward: 12, steps: 23
epi: 25, reward: 18, steps: 39
epi: 26, reward: 11, steps: 25
epi: 27, reward: 8, steps: 52
epi: 28, reward: 17, steps: 32
...
epi: 666, reward: 19, steps: 73
epi: 667, reward: 9, steps: 19
epi: 668, reward: 6, steps: 12
epi: 669, reward: 4, steps: 9
epi: 670, reward: 11, steps: 65
epi: 671, reward: 9, steps: 19
epi: 672, reward: 11, steps: 29
epi: 673, reward: 7, steps: 15
epi: 674, reward: 13, steps: 25
epi: 675, reward: 6, steps: 13
epi: 676, reward: 12, steps: 37
epi: 677, reward: 14, steps: 24
epi: 678, reward: 9, steps: 30
epi: 679, reward: 17, steps: 35
epi: 680, reward: 11, steps: 22
epi: 681, reward: 15, steps: 46
epi: 682, reward: 6, steps: 34
epi: 683, reward: 24, steps: 45
epi: 684, reward: 18, steps: 52
epi: 685, reward: 14, steps: 27
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20000/20000 [01:24<00:00, 236.97it/s]
> buffer saved, mean reward: 11.35, std: 6.36
> video saved
python train_clf.py --env CartPole --buffer ./demos/CartPole/size20000_std0.1_bias0.1_reward11.pkl
> Training with cuda
> Using tuned hyper-parameters
> Training BC policy...
iter: 0, loss: 5.90e-01
iter: 500, loss: 4.12e-01
iter: 1000, loss: 3.98e-01
iter: 1500, loss: 3.48e-01
iter: 2000, loss: 3.62e-01
iter: 2500, loss: 4.65e-01
iter: 3000, loss: 3.31e-01
iter: 3500, loss: 3.60e-01
iter: 4000, loss: 3.94e-01
iter: 4500, loss: 3.55e-01
iter: 4999, loss: 3.23e-01
100%|██████████████████████████████████████| 5000/5000 [00:34<00:00, 146.67it/s]
> Done
> Evaluating BC policy
average reward: 20.88, average length: 28.40
> Done
> ----------- Outer iter: 1 -----------
> Training environment model...
iter: 0, loss: 1.11e+01
iter: 500, loss: 3.52e-01
iter: 1000, loss: 1.24e+00
iter: 1500, loss: 4.83e-01
...
> ----------- Outer iter: 2 -----------
> Training environment model...
iter: 0, loss: 2.45e-01
iter: 500, loss: 9.09e-02
iter: 1000, loss: 1.77e-01
iter: 1500, loss: 9.78e-02
iter: 2000, loss: 2.33e-01
iter: 2500, loss: 2.25e-01
iter: 3000, loss: 8.33e-02
iter: 3500, loss: 1.53e-01
iter: 4000, loss: 1.63e-01
iter: 4500, loss: 9.96e-02
iter: 4999, loss: 1.57e-01
100%|██████████████████████████████████████| 5000/5000 [00:28<00:00, 174.13it/s]
> Done
> Training CLF controller...
iter: 200, loss: 3.63e+01
iter: 400, loss: 2.67e+01
iter: 600, loss: 2.71e+01
iter: 800, loss: 2.68e+01
iter: 1000, loss: 2.09e+01
iter: 1200, loss: 1.99e+01
iter: 1400, loss: 1.89e+01
iter: 1600, loss: 1.81e+01
iter: 1800, loss: 2.69e+01
iter: 1999, loss: 2.06e+01
100%|███████████████████████████████████████| 1999/1999 [09:43<00:00, 3.42it/s]
> Evaluating policy...
average reward: 26.57, average length: 38.80
> Done
> Collecting demos...
mean reward: 17.36
> Done
> Final demo saved
> Iter time: 637s, total time: 1288s
...
> ----------- Outer iter: 48 -----------
> Training environment model...
iter: 0, loss: 5.08e-03
iter: 500, loss: 2.13e-03
iter: 1000, loss: 3.25e-01
iter: 1500, loss: 7.04e-03
iter: 2000, loss: 1.06e-02
iter: 2500, loss: 1.67e-03
iter: 3000, loss: 5.57e-02
iter: 3500, loss: 2.24e-03
iter: 4000, loss: 2.24e-03
iter: 4500, loss: 1.56e-01
iter: 4999, loss: 9.13e-03
100%|██████████████████████████████████████| 5000/5000 [00:23<00:00, 211.62it/s]
> Done
> Training CLF controller...
iter: 200, loss: 5.55e+00
iter: 400, loss: 5.22e+00
iter: 600, loss: 6.63e+00
iter: 800, loss: 5.17e+00
iter: 1000, loss: 5.09e+00
iter: 1200, loss: 4.63e+00
iter: 1400, loss: 8.60e+00
iter: 1600, loss: 5.35e+00
iter: 1800, loss: 4.95e+00
iter: 1999, loss: 5.37e+00
100%|███████████████████████████████████████| 1999/1999 [01:47<00:00, 18.52it/s]
> Evaluating policy...
average reward: 69.63, average length: 97.40
> Done
> Collecting demos...
mean reward: 34.59
> Done
> Final demo saved
> Iter time: 146s, total time: 10898s
> ----------- Outer iter: 49 -----------
> Training environment model...
iter: 0, loss: 7.01e-02
iter: 500, loss: 9.32e-02
iter: 1000, loss: 5.55e-02
iter: 1500, loss: 7.36e-03
iter: 2000, loss: 1.15e-02
iter: 2500, loss: 1.96e-03
iter: 3000, loss: 1.28e-02
iter: 3500, loss: 3.23e-03
iter: 4000, loss: 2.97e-03
iter: 4500, loss: 1.22e-03
iter: 4999, loss: 3.33e-03
100%|██████████████████████████████████████| 5000/5000 [00:24<00:00, 207.89it/s]
> Done
> Training CLF controller...
iter: 200, loss: 5.38e+00
iter: 400, loss: 6.41e+00
iter: 600, loss: 5.36e+00
iter: 800, loss: 5.71e+00
iter: 1000, loss: 5.38e+00
iter: 1200, loss: 4.79e+00
iter: 1400, loss: 5.99e+00
iter: 1600, loss: 5.13e+00
iter: 1800, loss: 4.94e+00
iter: 1999, loss: 5.89e+00
100%|███████████████████████████████████████| 1999/1999 [01:47<00:00, 18.66it/s]
> Evaluating policy...
average reward: 74.99, average length: 101.80
> Done
> Collecting demos...
mean reward: 34.93
> Done
> Final demo saved
> Iter time: 145s, total time: 11043s
For both the PPO and LYGE policies, the outcomes are substantially lower than those reported in the paper (refer to Figure 5 below).
For the first step of training a PPO initial policy, I have also experimented with other environments, such as InvertedPendulum, F16GCAS, and NeuralLander, each for 1 million steps. However, I observed that the rewards across all these environments were notably below the figures presented in Figure 5 of the paper (InvertedPendulum: approximately 100-200, F16GCAS: approximately 100, NeuralLander: approximately 2000).
I would like to inquire about any potential missteps I may have taken or if there might be some misinterpretation of the paper and code on my part. Thanks very much for your attention and time!