RL_Experiment

A reinforcement learning project exploring different RL algorithms on a custom-built grid-world environment (BoxEnv).

MFRL algorithms like Q-Learning, DQN, and PPO. We also extend it to MBRL approaches like TreeQN, SAVE, and further to MBPO approach like DQN with Monte Carlo Tree Search (MCTS) for planning-enhanced decision making.

Project Overview

The BoxEnv environment is a grid-world style task where an agent must navigate through walls and obstacles to reach the goal. The sparse reward structure makes it a challenging environment for standard RL algorithms.

Project Structure

.
├── box_env/              # Custom environment and wrappers
├── config/               # Algorithm hyperparameter YAML files
├── models/               # Implementations of Q-Learning, DQN, PPO
├── scripts/              # Training & evaluation scripts
├── results/              # Checkpoints, CSV logs, plots
├── utils/                # Helper functions, plotting, save/load utilities
├── main.py               # CLI entrypoint for training
└── pyproject.toml        # Project metadata

Installation

Clone the repository:

git clone https://github.com/Manohara-Ai/RL_Experiment.git
cd RL_Experiment

Install dependencies:

pip install -e .

Running Experiments

Use the main.py entrypoint to run training with the algorithm of your choice:

Run Q-Learning

python3 main.py --train qlearning

Run DQN

python3 main.py --train dqn

Run PPO

python3 main.py --train ppo

This will:

Load hyperparameters from the corresponding YAML config in config/
Initialize the BoxEnv environment
Train the chosen agent
Save results in results/ (checkpoints, logs, and plots)

Results

All training logs and evaluation metrics are saved under results/. Here we present the key findings from our experiments.

Q-Learning

The Q-Learning agent shows limited learning capability. Its average reward fluctuates around a negative value throughout training and fails to converge toward a consistent optimal policy. This stagnation is expected given the large state space of BoxEnv—tabular Q-learning cannot effectively explore or generalize in this environment. The agent frequently gets stuck in loops, colliding with walls or obstacles, which results in persistently negative rewards.

DQN

The DQN agent demonstrates clearer learning progress. Starting from negative rewards, its performance steadily improves as training progresses. By leveraging a deep neural network with experience replay, DQN successfully generalizes across unseen states and avoids the pitfalls of tabular Q-learning. The reward curve becomes smoother and trends upward, though evaluation results reveal that it still struggles to consistently solve the environment.

PPO

The PPO agent performs poorly in this environment. Its reward curve consistently decreases, reflecting the difficulty of sparse reward tasks for on-policy algorithms. PPO fails to gather enough positive experiences to update its policy effectively. Unlike DQN, PPO cannot rely on off-policy replay, which further limits its ability to bootstrap from rare successful trajectories.

TreeQN

The TreeQN algorithm achieves a significantly higher average reward compared to Q-Learning, indicating that it learns a better policy that yields greater cumulative returns per episode. However, Q-Learning displays lower variance and more stable performance with fewer extreme outcomes. Thus, while TreeQN demonstrates superior learning capability and higher overall rewards, Q-Learning remains more consistent. In typical reinforcement learning terms—where maximizing average return is the main objective—TreeQN is considered the better-performing agent.

SAVE

The SAVE algorithm demonstrates strong learning, with the average reward consistently rising from a negative starting point (around −1 to −2) to a final value near 5 to 6. This significant upward trend confirms the agent is successfully learning a better policy over 2000 episodes. However, the performance is marked by high variance, as shown by the wide shaded region.

DQN Variants

The incorporation of Value-Guided Search significantly improves the stability and performance of any base DQN algorithm. Standard DQN, despite its potential for high reward, is highly unstable or inconsistent on this task. The SAVE mechanism provides some stabilization over plain DQN but is outperformed by the explicit Value-Guided Search planning component.

License

This project is licensed under the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RL_Experiment

Project Overview

Project Structure

Installation

Running Experiments

Run Q-Learning

Run DQN

Run PPO

Results

Q-Learning

DQN

PPO

TreeQN

SAVE

DQN Variants

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
box_env		box_env
config		config
models		models
results		results
scripts		scripts
utils		utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Learning_to_Plan_and_Act.pdf		Learning_to_Plan_and_Act.pdf
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

License

Manohara-Ai/RL_Experiment

Folders and files

Latest commit

History

Repository files navigation

RL_Experiment

Project Overview

Project Structure

Installation

Running Experiments

Run Q-Learning

Run DQN

Run PPO

Results

Q-Learning

DQN

PPO

TreeQN

SAVE

DQN Variants

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages