Official implementation of NeurIPS'23 paper, Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets
Abstract: Offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. The primary motivation for using reinforcement learning (RL) instead of supervised learning techniques such as behavior cloning is to find a policy that achieves a higher average return than the trajectories constituting the dataset. However, we empirically find that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset. We argue this is due to an assumption made by current offline RL algorithms of staying close to the trajectories in the dataset. If the dataset primarily consists of sub-optimal trajectories, this assumption forces the policy to mimic the suboptimal actions. We overcome this issue by proposing a sampling strategy that enables the policy to only be constrained to “good data" rather than all actions in the dataset (i.e., uniform sampling). We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms. Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms.
WIP
- Install
python>=3.8
or usingconda
- Install suboptimal_offline_datasets
- pip install -r requirements.txt
- Install customized
d3rlpy
(only required for TD3BC):
cd TD3BC/d3rlpy
pip install -e .
Setup local dependencies by running source setup.sh
and then launch the experiments using the scripts in experiments
. The experiment launching scripts are structured in the following ways
- ./experiments # all experiment launching scripts
- d4rl # all experiment launching scripts for D4RL datasets
- run_cql.py # CQL experiments
- run_iql.py # IQL experiments
- run_td3bc.py # TD3BC experiments
You may launch an experiment by the following command:
python experiments/d4rl/run_{cql | iql | td3bc}.py --mode {mode} --gpus {GPU ID list} --n_jobs {Number of parallel jobs}
The above command generates a bunch of jobs and each job consists of multiple commands to run the experiments on the devices you specify. The following explains the the purpose of each argument:
--mode
: Each launching script generates the commands to run each experiment. You may choose themode
among the following options to execute the commands:local
: Execute each job in the local session.screen
: Exceute each job in new screen sessionsbatch
: Execute each job in ansbatch
job submission (You may need to customize Slurm job template for your cluster inexperiment/__init__.py
. SeeSUPERCLOUD_SLURM_GPU_BASE_SCRIPT
inexperiment/__init__.py
).bash
: Only generate bash commands of running experiments. This is useful when you want to dump the commands and customize them for small experiments. Suggest using this with redirection (e.g.,python experiments/d4rl/run_cql.py --mode bash --gpus 0 > commands.sh
)gcp
: Launch job on GCP withjaynes
. Note that I have not been maintaining this feature for a while. You may need to figure out the configuration details.
--gpus
: A list of GPU IDs you would like to run jobs. For example, if you want to run jobs on GPUs 0, 1, 2, 3, and 4, then you should use--gpus 0 1 2 3 4
and the jobs will be evenly distributed on those GPUs.--n_jobs
: Number of jobs executed in parallel. Each job consists of multiple commands that will be executed sequentially.
WIP
@inproceedings{
hong2023dw,
title={Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets},
author={Zhang-Wei Hong and Aviral Kumar and Sathwik Karnik and Abhishek Bhandwaldar and Akash Srivastava and Joni Pajarinen and Romain Laroche and Abhishek Gupta and Pulkit Agrawal},
booktitle={Proceedings of the 37nd Conference on Neural Information Processing Systems},
year={2023},
url={https://arxiv.org/pdf/2310.04413.pdf}
}