Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AtariWrapper does not use recommended defaults #635

Open
RyanNavillus opened this issue Oct 27, 2021 · 24 comments
Open

AtariWrapper does not use recommended defaults #635

RyanNavillus opened this issue Oct 27, 2021 · 24 comments
Labels
enhancement New feature or request

Comments

@RyanNavillus
Copy link

The current AtariWrapper by default has terminate_on_life_loss set to True. This goes against the recommendations of Revisiting the Arcade Learning Environment (https://arxiv.org/pdf/1709.06009.pdf). I believe this should be set to False by default. They also recommend using sticky actions instead of noop resets, but I think that problem is outside the scope of this wrapper.

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 27, 2021

That is set to True by default to follow the original implementation of baselines which had terminate_on_life_loss enabled here. While I agree it would be better to not have used it (to reflect the "real" end of the game as originally intended by the game developers, and also pick the easier option of two), it would cause major hidden changes in results if it were changed at this point. @araffin ?

@RyanNavillus
Copy link
Author

RyanNavillus commented Oct 27, 2021

I just realized that I should have put this issue in the actual stable baselines 3 repo, but I guess it's relevant here as well. I definitely understand the trade-off between using newer recommendations and preserving fair comparisons to previous work.

@jkterry1
Copy link
Contributor

My concern, and I'm sure @JesseFarebro (the maintainer of the ALE) would agree, is that the settings in Gym environments for Atari was never really done to begin with and that for people doing future work with then should use what have been the recommended practices with then for years. This actually caused an issue with us working with Atari games for an ICML paper, which is why Ryan created the issue.

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 27, 2021

Right, I totally agree with the point :). We could consider changing the default setting in zoo and SB3, and leave a big warning there to indicate of this change. It would be bad if a popular library would hinder the progression just by sticking to "old stuff" for the weak reason of "that's what has been done before".

@araffin
Copy link
Member

araffin commented Oct 28, 2021

Hello,

This goes against the recommendations of Revisiting the Arcade Learning Environment (https://arxiv.org/pdf/1709.06009.pdf).

Yes, I'm aware of that.
We kept it to be able to compare results against SB2.
Looking at the actual paragraph (see below) and my personal experience, this does not affect much results (your experience is probably different, otherwise there would not be an issue).

But we should at least update the doc and put a warning there with the recommended settings (this could be a flag in the make_atari_env deactivated at first but with a warning and then activated in a future version of SB3)

Episode termination. In the initial ALE benchmark results (Bellemare et al., 2013),
episodes terminate when the game is over. However, in some games the player has a number
of “lives” which are lost one at a time. Terminating only when the game is over often makes
it difficult for agents to learn the significance of losing a life. Mnih et al. (2015) terminated
training episodes when the agent lost a life, rather than when the game is over (evaluation
episodes still lasted for the entire game). While this approach has the potential to teach an
agent to avoid “death,” Bellemare et al. (2016b) noted that it can in fact be detrimental
to an agent’s performance. Currently, both approaches are still common in the literature.
We often see episodes terminating when the game is over (e.g., Hausknecht et al., 2014;
Liang et al., 2016; Lipovetzky et al., 2015; Martin et al., 2017), as well as when the agent
loses a life (e.g., Nair et al., 2015; Schaul et al. 2016; van Hasselt et al., 2016). Considering
the ideal of minimizing the use of game-specific information and the questionable utility of
termination using the “lives” signal, we recommend that only the game over signal be used
for termination.

@araffin araffin transferred this issue from DLR-RM/rl-baselines3-zoo Oct 28, 2021
@araffin araffin added the enhancement New feature or request label Oct 28, 2021
@araffin
Copy link
Member

araffin commented May 1, 2022

The current AtariWrapper by default has terminate_on_life_loss set to True. This goes against the recommendations of Revisiting the Arcade Learning Environment (https://arxiv.org/pdf/1709.06009.pdf). I believe this should be set to False by default. They also recommend using sticky actions instead of noop resets, but I think that problem is outside the scope of this wrapper.

I did quick experiments on Breakout with PPO, with and without terminal on life loss activated, and the current default seems to have a good impact on performance:
https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/Atari-defaults-PPO---VmlldzoxOTA4NjUz

Results are not significant at all yet, only 3 runs, but it should be investigated further.

@qgallouedec
Copy link
Collaborator

What about sticky actions?

@araffin
Copy link
Member

araffin commented Jan 2, 2023

What about sticky actions?

you mean its influence on performance?

I don't know, I think the main issue is that the changes were made without benchmark (in the paper, comparison is only partial) and it looks like it's still missing.
@RyanNavillus @JesseFarebro @pseudo-rnd-thoughts am I wrong?

@qgallouedec
Copy link
Collaborator

qgallouedec commented Jan 3, 2023

I meant, is it implemented? After digging in the code, I realize that sticky actions are enabled by default directly by ALE, but only for v0 and v5 (not for v4, which is used by sb3), see the gym documentation and the ALE source code. It seems to me that sticky actions is used by most of the works, so shouldn't we provide a StickyActionWrapper for v4?
EDIT: Or just add default repeat_action_probability=0.25 to env_kwargs in common.env_util.make_atari_env function.

Somehow related: DLR-RM/rl-baselines3-zoo#133 (comment)

@pseudo-rnd-thoughts
Copy link
Contributor

I believe the primary difference between v0 and v4 is if sticky actions is enabled. This is a confusing change that ale made

https://gymnasium.farama.org/environments/atari/#version-history-and-naming-schemes

@RyanNavillus
Copy link
Author

@araffin I think the goal of sticky actions was not to improve performance w.r.t reward but to produce a robust policy instead of a deterministic action sequence. They show that it training with sticky actions causes DQN to have roughly the same performance when you evaluate with sticky actions on and off, while methods designed to exploit determinism in the environment perform much worse when evaluated with stick actions on. They argue that sticky actions are better than other methods for various reasons in section 5.3.

TL;DR, sticky actions are the recommended way to prevent agents from abusing determinism, not a way to improve rewards.

@RyanNavillus
Copy link
Author

RyanNavillus commented Jan 8, 2023

There is a much weaker argument in the paper that we should not terminate_on_life_loss because that's environment-specific knowledge, so algorithms evaluated with that setting will overfit more to Atari. They also argue that terminate_on_life_loss has a debatable effect on performance, but it seems like your experiments show that it can help.

@araffin
Copy link
Member

araffin commented Jan 9, 2023

TL;DR, sticky actions are the recommended way to prevent agents from abusing determinism, not a way to improve rewards.

thanks for you answer =)
My question was not about checking that it improves performance, but rather just know the impact (positive or negative) on performance.
Because if you don't have a baseline (I think the paper mostly benchmarked DQN), then it's hard to compare things.
My point is more that the changes were made to Atari defaults, but we still have no idea what is the difference in performance and what are baselines results for those defaults. As shown by my quick experiments, there might be significant changes, but we currently don't know...

@RyanNavillus
Copy link
Author

we still have no idea what is the difference in performance and what are baselines results for those defaults

I might not be following exactly, why do we care about performance if the point of these changes is to prevent the agent from "cheating". I assume you would expect some drop in performance because both changes make it harder for the agent.

Is the issue that people are using existing SB3 results in their papers, and might mistakenly attribute the charts that you have now (without sticky actions) to the agents with stick actions if you change the defaults? I personally don't use results from a repository without rerunning the baselines myself, but I see why that could be a concern.

I think you're correct that the paper only benchmarks DQN/Sarsa on Atari games when evaluating sticky actions (In Table 8/9 they test their recommended settings on every Atari environment for those two algorithms). I'm sure there are other papers that use sticky actions for Atari baselines, but it would be a lot of work to find and verify those results. Probably faster to just run them yourself at that point. On that note, how much compute would it take to just run these baselines yourself?

@araffin
Copy link
Member

araffin commented Jan 9, 2023

Is the issue that people are using existing SB3 results in their papers, and might mistakenly attribute the charts that you have now

yes, that's the issue. And not only about SB3 results but all results published so far. You want to be able to compare apples to apples.

I personally don't use results from a repository without rerunning the baselines myself, but I see why that could be a concern.

this is indeed good practice but not everybody can afford doing so (students/independent researchers/small labs/countries where compute is actually much more expensive/...).

On that note, how much compute would it take to just run these baselines yourself?

if we want to run on all atari games, a lot... (you need 10 experiments x n_games x n_algorithms and this grows quickly...)
@qgallouedec is currently running a full benchmark of the RL Zoo, and we are using a subset of Atari games (openrlbenchmark/openrlbenchmark#7), all runs is logged using W&B and help is welcomed =)

@RyanNavillus
Copy link
Author

RyanNavillus commented Jan 9, 2023

this is indeed good practice but not everybody can afford doing so (students/independent researchers/small labs/countries where compute is actually much more expensive/...).

Sorry, I didn't intend for that to sound like a moral argument, I meant to say that I haven't worked on any papers that required it. I agree that having reliable baselines available is important for those reasons.

if we want to run on all atari games, a lot... (you need 10 experiments x n_games x n_algorithms and this grows quickly...)

That is certainly a lot. It's unfortunate but without baselines like this, it's hard for the community to adopt the recommendations from the paper. Would it be possible to set up a system for people to contribute individual runs? I certainly can't afford all that compute myself, but I'd be willing to contribute runs where I can. Aside from sticky actions, I feel like that would be helpful for a bunch of things. If it's possible to export runs on wandb it shouldn't be too hard to do.

Aside from that, would it make sense to change the defaults and put a disclaimer on the existing Atari results that they were run with sticky actions and terminate on life loss disabled? That way new results will use the good defaults, but if people want to compare to old results they'll see the disclaimer and change their settings accordingly. Not perfect but it might be a step in the right direction.

@araffin
Copy link
Member

araffin commented Jan 9, 2023

Would it be possible to set up a system for people to contribute individual runs?

yes =D, that's the whole point of the openrl benchmark initiative by @vwxyzjn (best is probably to open an issue there to keep track of things and ask for permission to add runs to the sb3 namespace).

Aside from that, would it make sense to change the defaults and put a disclaimer on the existing Atari results that they were run with sticky actions and terminate on life loss disabled? That way new results will use the good defaults,

my whole point for not changing is that I want the result first and change the default once we got the baselines (I also wouldn't call them "good default" but "new defaults").
But we can later tag all atari runs ran with v4 (see discussion in DLR-RM/rl-baselines3-zoo#336).

@RyanNavillus
Copy link
Author

yes =D, that's the whole point of the openrl benchmark initiative by @vwxyzjn (best is probably to open an issue there to keep track of things and ask for permission to add runs to the sb3 namespace).

Oh awesome, I didn't realize anyone could contribute to that. I'll look into contributing my limited compute where I can :)

I also wouldn't call them "good default" but "new defaults")

I think it's safe to call them good from the perspective of evaluation fairness and policy robustness, at least according to the (only?) paper discussing the impact of different ALE settings. Probably not for performance.

I'm not sure I agree with waiting to change the defaults, I think a disclaimer on the current results would be enough, but that's just my 2 cents. You know how this would impact SB3 users better than me.

@qgallouedec
Copy link
Collaborator

I want the result first and change the default once we got the baselines

Agree, it's a good idea to have both in the benchmark (even if it duplicates the number of runs to do).

I'm not sure I agree with waiting to change the defaults, I think a disclaimer on the current results would be enough

Also agree, I would recommend adding as soon as possible, a "sticky action" param, set to False at first so as not to change the default setting. But I agree with @RyanNavillus, in the shortest possible term, I think we should change default to sticky action = True, probably with a warning in the doc and in the code.

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Jan 9, 2023

FWIW, here is the default setting for openai/baselines and the recommended setting:

envs = envpool.make(
    "Breakout-v5",
    env_type="gym",
    num_envs=8,
    episodic_life=True,
    repeat_action_probability=0.0,
    noop_max=30,
    full_action_space=False,
    # https://github.com/openai/gym/blob/a7b6462136ebaa610c8941e4da8a9c92155b04d1/gym/envs/__init__.py#L799
    # max_frames = 100000 * 4 = 400000
    max_episode_steps=100000,
    reward_clip=True,
    img_height=84,
    img_width=84,
    stack_num=4,
    gray_scale=True,
    frame_skip=4,
)
envs = envpool.make(
    "Breakout-v5",
    env_type="gym",
    num_envs=8,
    episodic_life=False,  # Machado et al. 2017 (Revisitng ALE: Eval protocols) p. 6
    repeat_action_probability=0.25,  # Machado et al. 2017 (Revisitng ALE: Eval protocols) p. 12
    noop_max=1,  # Machado et al. 2017 (Revisitng ALE: Eval protocols) p. 12 (no-op is deprecated in favor of sticky action, right?)
    full_action_space=True,  # Machado et al. 2017 (Revisitng ALE: Eval protocols) Tab. 5
    max_episode_steps=int(108000 / 4),  # Hessel et al. 2018 (Rainbow DQN), Table 3, Max frames per episode
    reward_clip=True,
    img_height=84,
    img_width=84,
    stack_num=4,
    gray_scale=True,
    frame_skip=4,
)

Maybe need to double-check if noop_max should be set to 0, but when I do, it gives me an error... @Trinkle23897

With ALE/breakout-v5, it should reproduce the recommended setting with full_action_space=True with existing wrappers (without the no-op wrapper and episodic life wrapper) because 1) sticky action is automatically applied, 2) the max_epsiode frames is already set to 108000.

@qgallouedec
Copy link
Collaborator

qgallouedec commented Jan 9, 2023

(no-op is deprecated in favor of sticky action, right?)

No. In fact, some papers use no-ops (NGU, Agent57, QR-DQN, R2D2), some use sticky actions (Dreamerv2, Muesli), and some use both, either combinating them (Go-Explore), or reporting result for both settings separated (MuZero).

IMO, the best is to use both as default. Justification from Ecoffet et al. 2021 (Go-Explore):

As it is possible that no-ops add some challenges not encountered with just sticky actions, we ensure that Go-Explore is evaluated under conditions that are at least as difficult as those presented in recent papers by evaluating Go-Explore with both sticky actions and no-ops.

@araffin According to Schrittwieser et al. 2021 (MuZero), sticky actions have a low impact on the score.

EDIT: added Muesli after #635 (comment)
EDIT2: EnvPool uses no-ops by default but not both sticky actions; Ping DLR-RM/rl-baselines3-zoo#307
EDIT3: Fix EDIT2

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Jan 9, 2023

@qgallouedec, thanks for these resources. Note that in Muesli (Table 4) noops is disabled... Bottom line is I think noops is just the first 30 frames and should no impact the results that much, whereas sticky action is a more persistent way to add stochasticity.

reporting result for both settings separated (MuZero) .

Could you point me to the page or table number for both settings?

@araffin According to Schrittwieser et al. 2021 (MuZero), sticky action have low impact on score.

To add a data point, in my experiments, sticky action (along with other recommended changes but still use minimal action space) makes the median hns go down from 97.83% to 73.47%. See (data and calculation)

@araffin
Copy link
Member

araffin commented Jan 9, 2023

Linking discussion with @JesseFarebro as it's relevant to that issue: #572 (comment)
(also relevant: #734)

EDIT: it seems that gymnasium documentation is outdated

@qgallouedec
Copy link
Collaborator

qgallouedec commented Jan 9, 2023

reporting result for both settings separated (MuZero) .

Could you point me to the page or table number for both settings?

Table 12 and section 7. After reading a second time, it is not clear if they use non-op start in "MuZero sticky actions". Anyway, it doesn't matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants