Disclaimer: Udacity provided some starter code, but the implementation for these concepts are done by myself. Please contact [email protected] for any questions.
Note: Please refer to the instructions on how to download the dependencies for these projects here.
https://confirm.udacity.com/XLGDCKNX
Deep reinforcement learning (deep RL) is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs (e.g. every pixel rendered to the screen in a video game) and decide what actions to perform to optimize an objective (eg. maximizing the game score). Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare [1].
The following projects focuses on model free reinforcement learning, where the agent has no concept of how its current action will affect its next state. In more technical terms, this family of algorithms do not use the Transition Probability Distribution associated with the Markov Decision Process.
In this segment, I was introduced to concepts such as Markov Decision Process, Monte Carlo Methods, Temporal-Difference Methods and RL in Continous Spaces. I had the chance to solve some openAI Gym environments, such as Cliff Walking Env and Taxi-v2 Env.
This project focuses on the use of Deep Q-Learning (DQN) to train an agent to collect yellow bananas while avoiding the purple ones. Here are more information on the training algorithm and project instructions.
This project focuses on the use of Deep Deterministic Policy Gradient (DDPG) to train a robotic arm to keep its endeffector on a moving target position. Here are more information on the training algorithm and project instructions.
In addition, the Distributed Distributional Deterministic Policy Gradients (D4PG) method was introduced into a multi-arm simulation environment. D4PG utilizes distributional value estimation, n-step returns, prioritized experience replay (PER), distributed K-actor exploration for fast and stable learning. Implementation of PER is omitted as the original paper suggests its lack of efficacy in training speed or stability.
Finally, the Proximal Policy Optimization (PPO) method was introduced into a multi-crawler simulation environment. This implementation of PPO combines clipped surrogate actor loss, critic loss and entropy loss. Generalized Advantage Estimate (GAE) was introduced to stabalize training by balancing bias and variance. There are 12 agents that are learning concurrently in the same environment, where experiences and weights are shared to achieve stable training.
This project focuses on the use of Multi-Agent Deep Deterministic Policy Gradient (MADDPG) to train 2 tennis bats to cooperate with each other in keeping the ball midair for as long as possible. Here are more information on the training algorithm and project instructions.
Additionally, the Proximal Policy Optimization (PPO) method was introduced into a soccer simulation environment, with each team comprising of a striker and a goalie. This implementation of PPO combines clipped surrogate actor loss, critic loss and entropy loss. Generalized Advantage Estimate (GAE) was introduced to stabalize training by balancing bias and variance. By the end of training, both of the agents specialized in their individual roles, leading to 100% wins against opponents taking random actions.