RL based robust controller for quadrotor. Proximal Policy Optimization (PPO) is used. This project is a part of the course XAI 601 Applications in Deep Learning in Korea University.
This repository contains MATLAB-based simulation for quadrotor using PPO agent.
Main Contributors: Soowon Kim ([email protected])
Affiliation: Korea University
Slides and Report: Slides Report contain explanations on key ideas including the selection of the states, actions, and rewards as well as particular choices of the networks.
Tested on MATLAB 2021a. RL package is required.
Observation (state) consists of the errors between the reference trajectory and current position, velocity, and acclereation. Addition to that, the state has orientation (quaternion) and angular velocity of the agent. Then the state can be expressed as follows:
numObs = 16;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'Quad States';
Action consists of four thrusts by each rotor in Newton.
numAct = 4;
actInfo = rlNumericSpec([numAct 1]);
actInfo.Name = 'Quad Action';
actInfo.LowerLimit = [0,0,0,0]';
Reward functions are inspired by Peng's work. You can modify them in myStep.m
.
% Rewards
r_pos = exp(-(1/0.5*pos_l2).^2);
r_vel = exp(-(1/1*vel_l2).^2);
r_acc = exp(-(1/1*acc_l2).^2);
r_yaw = exp(-(1/(5/180*pi)*yaw_error).^2);
In the declaration of actor
, you can choose either single
or dual
tanh activation for the mean output of the actions.
% Define environment
env = rlFunctionEnv(obsInfo,actInfo,'myStep','myReset');
% Actor network
actor = actorNetwork(obsInfo, actInfo,'single');
% Critic network
critic = criticNetwork(obsInfo);
Now you can create PPO agent. You can experiment with different parameters but these setup worked for me.
agentOpts = rlPPOAgentOptions('SampleTime',0.01,...
'ExperienceHorizon',2^11,...
'MiniBatchSize',2^11,...
'UseDeterministicExploitation',false,...
'EntropyLossWeight',0.4);
agent = rlPPOAgent(actor,critic,agentOpts);
To train the agent, simply use train
.
trainOpts = rlTrainingOptions(...
'MaxEpisodes',1000000, ...
'MaxStepsPerEpisode',1000000, ...
'Verbose',false, ...
'Plots','training-progress',...
'ScoreAveragingWindowLength',100,...
'StopTrainingCriteria',"AverageReward",...
'UseParallel',true,...
'StopTrainingValue',10000000);
trainingStats = train(agent,env,trainOpts);
To test the agent, run Test
section under PPO.m
. You can create your own trajectory by modifying myReset.m
function by declaring your own path
and tau_vec
. You may want to modify some of the termination conditions that are used for training. For example, in myStep.m
,
if Time >= total_time+1 && ~IsDone
% IsDone = true;
State = State(:,end);
State(1:3) = pos_error;
Time = 0;
tau_vec = 9;
vel_min = 1; vel_max = 5;
vel = (vel_max-vel_min)*rand + vel_min;
dist = vel*tau_vec;
target = dist * random_unit_vector;
path = [zeros(1,3); target'];
Traj = MinimumSnapTrajectory(tau_vec, path);
Fext = 1 * rand * random_unit_vector;
end
comment out all statements and use IsDone = true
.
path
consists of x,y,z
coordinates of way points. It always starts and ends with zero velocity, acceleration, jerk, and snap. Then you have to declare time interval between way points. For example, you can set path=[0,0,0;1,1,1];
. Since it only has start and end points, time interval can be set tau_vec=10;
. It would look something like this.
You can also use time allocation technique to automatically set time intervals tau_vec
using timeAllocation
with γ parameter. For example,
path = [0,0,0;1,1,0;2,0,0;1,-1,0;0,0,0;-1,1,0;-2,0,0;-1,-1,0;0,0,0];
tau_vec = timeAllocation(path, 100)';
The greater the γ is, the higher the time penalty, resulting a faster trajectory.
To try a trained model, you can load 0615_FM.mat
in results folder.
To train an agent to perform aerobatic maneuvers, declare a trajectory using loopTheLoop
, splitS
, or canopyRoll
in myStep.m
. For example,
[tau_vec, path] = splitS();
And modify reward functions so as the body x-axis to follow the direction of the velocity and the body z-axis to follow the direction of the acceleration of the reference trajectory. For example,
x_cos = acos(getCosineSimilarity(xb,desired_state.vel));
z_cos = acos(getCosineSimilarity(zb,desired_state.acc));
r_xb = exp(-(1/(30/180*pi)*x_cos).^2);
r_zb = exp(-(1/(30/180*pi)*z_cos).^2);
rewards = [0.6 0.1 0.1 0.1 0.1] .* [r_pos r_vel r_acc r_xb r_zb];
Each aerobatic maneuver has its own speicificity of the shape of the reward function.
PPO agent trained with random disturbances. Below is demonstrating the agent with sinusoidal vertical force applied.
PPO agent trained to perform Inside loop
PPO agent trained to perform Split S
PPO agent trained to perform Canopy roll