This is a mini research project to learn, apply, and demonstrate the use of reinforcement learning to create a controller for a simulated quadrotor to navigate in a 3D environment. This project uses the Mujoco physics engine to simulate the quadrotor, and uses Proximal Policy Optimization (PPO), based on the cleanRL implementation.
Before we get into the details, let's take a look at some of the results. Currently, model is trained on a fixed goal start and goal location, and the quadrotor is able to navigate to the goal, and hover over it.
- Custom Gymnasium environment based on the Mujoco Skydio X2 model.
- Parallel environments for higher training throughput
- Train using PPO
- Obstacle avoidance based on optimization approach (in progress)
- Formulation and implementation of Euler-Lagrange quadrotor dynamics in canonical form
x_dot = f(x) + g(x)u
- Dynamic environment generation during training episode rollouts using MjSpec
- The navigation problem is formulated as a finite horizon (episodic), continuous time MDP
- The quadrotor state is defined as
x = [p, q, v, w]
wherep
is the position,v
is the velocity,q
is the quaternion, andw
is the angular velocity.
- Reward shaping with wandb hyperparameter sweeps to determine the optimal reward function
- Termination conditions tuned to support efficient data collection
- Supports start/goal location randomization during training to improve generalization; this requires goal conditioned training, which is not implemented yet.
- Dynamic generation of obstacles using the MjSpec API. Can generate new obstacle configurations for each episode during training.
- Reset noise to improve generalization
- Wrapped the Gymnasium observation normalization wrapper to save normalization statistics learned during training. During evaluation on a new environment, it is important to load these normalization statistics and not calculate new ones.
- Reinforcement learning algorithms estimate the solution to an unconstrained optimization problem, and as such, do not provide mathematical guarantees for safety
- I've used a Control Barrier Function (CBF) formulation based on this paper to ensure safety. Determining the control input that satisfies the CBF constraints requires solving a Quadratic Program (QP) at each time step.
- The safety layer acts as a compensator to the potentially unsafe RL control input. As such, it may modify the RL control. Since the policy and value networks are updated using these unsafe actions, the modified controls can make the learning process difficult.
- The authors recommend using a QP solver that is differentiable. To that end, I use the qpth differentiable QP solver library to define a QP layer and integrate it with the actor network.
- The CBF formulation should make the training process independent of obstacle locations. This is a work in progress.
- Quadrotors are underactuated systems with nonlinear dynamics. They have 6 degrees of freedom (3 position, 3 orientation), and only 4 direct control inputs. This makes control design challenging.
- I've implemented the dynamics in canonical form in order to compute the CBF constraints. The implementation supports batched inputs for parallel environments.