A reinforcement learning project exploring different RL algorithms on a custom-built grid-world environment (BoxEnv).
MFRL algorithms like Q-Learning, DQN, and PPO. We also extend it to MBRL approaches like TreeQN, SAVE, and further to MBPO approach like DQN with Monte Carlo Tree Search (MCTS) for planning-enhanced decision making.
The BoxEnv environment is a grid-world style task where an agent must navigate through walls and obstacles to reach the goal. The sparse reward structure makes it a challenging environment for standard RL algorithms.
.
├── box_env/ # Custom environment and wrappers
├── config/ # Algorithm hyperparameter YAML files
├── models/ # Implementations of Q-Learning, DQN, PPO
├── scripts/ # Training & evaluation scripts
├── results/ # Checkpoints, CSV logs, plots
├── utils/ # Helper functions, plotting, save/load utilities
├── main.py # CLI entrypoint for training
└── pyproject.toml # Project metadata
- Clone the repository:
git clone https://github.com/Manohara-Ai/RL_Experiment.git
cd RL_Experiment- Install dependencies:
pip install -e .Use the main.py entrypoint to run training with the algorithm of your choice:
python3 main.py --train qlearningpython3 main.py --train dqnpython3 main.py --train ppoThis will:
- Load hyperparameters from the corresponding YAML config in
config/ - Initialize the BoxEnv environment
- Train the chosen agent
- Save results in
results/(checkpoints, logs, and plots)
All training logs and evaluation metrics are saved under results/. Here we present the key findings from our experiments.
The Q-Learning agent shows limited learning capability. Its average reward fluctuates around a negative value throughout training and fails to converge toward a consistent optimal policy. This stagnation is expected given the large state space of BoxEnv—tabular Q-learning cannot effectively explore or generalize in this environment. The agent frequently gets stuck in loops, colliding with walls or obstacles, which results in persistently negative rewards.
The DQN agent demonstrates clearer learning progress. Starting from negative rewards, its performance steadily improves as training progresses. By leveraging a deep neural network with experience replay, DQN successfully generalizes across unseen states and avoids the pitfalls of tabular Q-learning. The reward curve becomes smoother and trends upward, though evaluation results reveal that it still struggles to consistently solve the environment.
The PPO agent performs poorly in this environment. Its reward curve consistently decreases, reflecting the difficulty of sparse reward tasks for on-policy algorithms. PPO fails to gather enough positive experiences to update its policy effectively. Unlike DQN, PPO cannot rely on off-policy replay, which further limits its ability to bootstrap from rare successful trajectories.
The TreeQN algorithm achieves a significantly higher average reward compared to Q-Learning, indicating that it learns a better policy that yields greater cumulative returns per episode. However, Q-Learning displays lower variance and more stable performance with fewer extreme outcomes. Thus, while TreeQN demonstrates superior learning capability and higher overall rewards, Q-Learning remains more consistent. In typical reinforcement learning terms—where maximizing average return is the main objective—TreeQN is considered the better-performing agent.
The SAVE algorithm demonstrates strong learning, with the average reward consistently rising from a negative starting point (around −1 to −2) to a final value near 5 to 6. This significant upward trend confirms the agent is successfully learning a better policy over 2000 episodes. However, the performance is marked by high variance, as shown by the wide shaded region.
The incorporation of Value-Guided Search significantly improves the stability and performance of any base DQN algorithm. Standard DQN, despite its potential for high reward, is highly unstable or inconsistent on this task. The SAVE mechanism provides some stabilization over plain DQN but is outperformed by the explicit Value-Guided Search planning component.
This project is licensed under the LICENSE file.






