Official codebase for "Going Beyond Heurisitics by Imposing Policy Improvement As A Constraint", NeurIPS'24
Paper link: https://openreview.net/pdf?id=vBGMbFgvsX
Abstract: In many reinforcement learning (RL) applications, incorporating heuristic rewards alongside the task reward is crucial for achieving desirable performance. Heuristics encode prior human knowledge about how a task should be done, providing valuable hints for RL algorithms. However, such hints may not be optimal, limiting the performance of learned policies. The currently established way of using heuristics is to modify the heuristic reward in a manner that ensures that the optimal policy learned with it remains the same as the optimal policy for the task reward (i.e., optimal policy invariance). However, these methods often fail in practical scenarios with limited training data. We found that while optimal policy invariance ensures convergence to the best policy based on task rewards, it doesn't guarantee better performance than policies trained with biased heuristics under a finite data regime, which is impractical. In this paper, we introduce a new principle tailored for finite data settings. Instead of enforcing optimal policy invariance, we train a policy that combines task and heuristic rewards and ensures it outperforms the heuristic-trained policy. As such, we prevent policies from merely exploiting heuristic rewards without improving the task reward. Our experiments on robotic locomotion, helicopter control, and manipulation tasks demonstrate that our method consistently outperforms the heuristic policy, regardless of the heuristic rewards' quality.
1. Clone this repoisitory:
git clone https://github.com/Improbable-AI/hepo
2. Install the dependicies:
cd hepo
conda env create -f environment.yaml
pip install -e .
3. Install IsaacGym
- Download and install Isaac Gym Preview 4 from https://developer.nvidia.com/isaac-gym
- Unzip the file:
tar -xf IsaacGym_Preview_4_Package.tar.gz
- Install the Python package:
cd isaacgym/python && pip install -e .
- Verify the installation by running an example:
python examples/1080_balls_of_solitude.py
- For troubleshooting, check the documentation at
isaacgym/docs/index.html
.
The following are the commands to run the experiments for the methods H-only
, J-only
, and HEPO
(see our paper for details). In short, H-only
is a PPO trained with a heuristic reward provided with the benchmarking task, and J-only
is a PPO trained with the task reward only. HEPO
is trained with both heuristic and task rewards.
python train.py ext_scheme='success' task=$task max_iterations=$max_iterations seed=$seed
python train.py ext_scheme='total' task=$task max_iterations=$max_iterations seed=$seed
python train.py lmbd=1. alpha=0. update_alpha_gae=True use_hepo=True ext_scheme='success' int_scheme='total' alpha_lr=0.0001 task=$task max_iterations=$max_iterations seed=$seed
For $task
and $max_iterations
, please follow the table below and substitute the values to the commands above. The available IsaacGym
and Bi-Dex
tasks and the suggested max_iterations
are the following:
Task | max_iterations |
---|---|
Ant | 3000 |
Anymal | 1500 |
Quadcopter | 3000 |
Ingenuity | 800 |
Humanoid | 10000 |
FrankaCabinet | 1500 |
FrankaCubeStack | 5000 |
AllegroHand | 15000 |
ShadowHand | 15000 |
ShadowHandSpin | 10000 |
ShadowHandUpsideDown | 10000 |
ShadowHandBlockStack | 10000 |
ShadowHandBottleCap | 10000 |
ShadowHandCatchAbreast | 10000 |
ShadowHandCatchOver2Underarm | 10000 |
ShadowHandCatchUnderarm | 10000 |
ShadowHandDoorCloseInward | 10000 |
ShadowHandDoorCloseOutward | 10000 |
ShadowHandDoorOpenInward | 10000 |
ShadowHandDoorOpenOutward | 10000 |
ShadowHandGraspAndPlace | 10000 |
ShadowHandKettle | 10000 |
ShadowHandLiftUnderarm | 10000 |
ShadowHandPen | 10000 |
ShadowHandOver | 10000 |
ShadowHandPushBlock | 10000 |
ShadowHandReOrientation | 10000 |
ShadowHandScissors | 10000 |
ShadowHandSwingCup | 10000 |
ShadowHandSwitch | 10000 |
ShadowHandTwoCatchUnderarm | 10000 |
For logging, we use Weight & Bias. Please add the following arguments to the command if you'd like to log the experimental results:
wandb_activate=True
wandb_project=${W&B project name}
experiment=${W&B run name}
The average task reward of H-only
and J-only
is stored at rewards/ppo_metric
, and that of HEPO
is stored at rewards/hepo_metric
.
One example command with logging should look like:
python train.py ext_scheme='success' task=Ant max_iterations=3000 seed=0 wandb_activate=True wandb_project=Ant experiment=J_only
@inproceedings{
lee2024going,
title={Going Beyond Heuristics by Imposing Policy Improvement as a Constraint},
author={Chi-Chang Lee* and Zhang-Wei Hong* and Pulkit Agrawal},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=vBGMbFgvsX}
}