For this project, the first task was to generate a path for a rover to follow utilizing either dynamic programming or reinforcement learning. In order to choose the "best" method to use, both methods were implemented independently to generate paths for the robot to track, and the results were compared to determine which method would be used to complete the project.
Regardless of chosen method, the generated path was required to accomplish the following items:
Once a smooth trajectory was been achieved, an observer was designed and tested; the resulting control was also analyzed to determine if it is optimal.
As mentioned in the introduction, two methods of path generation can be used for this task; the first being dyanmic programming (DP) and the second being reinforcement learning (RL). While these two methods are both Markov decision processes (MDP), they are fundamentally different in terms of how the path algorithm is actually determined based on the environment of the robot1. Both approaches were applied to the task at first so they could be directly compared, but the algorithm utilized by reinforcement learning was unable to determine the path to navigate the environment in a reasonable amount of time when it contained the desired amount of obstacles. For this reason, dynamic programming was used to solve the task.
Prior to testing the path generating algorithms, the initial environment, or "course", the robot is to traverse had to be created. As per the requirements of the project, this required the construction of a 10x10 grid with several obstacles for the robot to avoid. Obstacles were placed in such a way that the robot would have to interact with them to get to the end-point, in order to verify that the chosen algorithm worked properly.
The environment that was built can be seen in the image below; it consists of seven obstacles of varying size, which were initialized on the gird using MATLAB.
The algorithm utilized to generate the path for the rover follows the concepts of dynamic programming; the MATLAB code implemented was based on the code authored by Dr. Vivek Yadav2 and modified to allow for a larger grid, as well as the appropriate initial [1,1] and end positions [9,9]. Correction of the initial postion was achieved by setting the value of x_agent and y_agent to [1] and [1], respectively, and the end point was set by correcting the values of x_goal and y_goal to [9] and [9], respectively. The larger grid also required modification of the function check_ind.m so that the algorithm was iterated for all index points, not just those set by the inital code; initial values would have stopped the iteration after 41 steps, due to the original grid being 4x4. The new 10x10 grid required 101 iteration steps in order to properly determine the path. Finally, the obstacle penalty values (labeled as Act in the program) had to be increased from the initial values, to compensate for the larger grid and properly generate the obstacles.
Application of these changes provided the cost-to-go plot as shown below; here all the obstacles have a high cost to go so that the algorithm avoids the obstacles as it generates the path. As the robot apporaches the goal, the associated cost to go is lowered until the goal is reached.
Once the cost to go of each index was generated, the algorithm utilized these stored values to determine the path the rover should follow in order to guarantee it will avoid the obstacles. The path generated by the dyanmic programming algorithm can be seen below. Due to the way the algorithm indexes the waypoints, the actual path begins at [1.1, 1] and ends at [9.1, 9]
While the generated series of waypoints does achieve the objective of avoiding obstacles and getting the rover from the start point to the end point, the path contains multiple sharp turns. In practive, these sudden and sharp directional changes would be difficult for certain vehicles, such as UAVs, to physically implement. In light of this, the generated path must be smoothed so that all the turns follow a gentle curvature. This is accomplished here using gradient descent, and including a buffer around the obstacles to ensure the rover successfully avoids them when it follows the new path.
The trajectory generated for the rover to follow can be smoothed by performing an optimization which minimizes the following general equation:
In this equation, ${XY}_i$ represents the original $XY$ values at a certain index, $i$, and $\hat{XY}_i$ represents the new $XY$ values for the smoothed trajectory at the same index point, $i$. The first term is used to minimize the distance between the original trajectory generated by dynamic programming and the optimized (smoothed) trajectory. The second term is a weighted by a factor, $\alpha$, and is used to penalize sharp turns in the optimized trajectory. For gradient descent, the general equation is converted to the form of:
$$XY_{i} = XY_{i} + \beta (XY_{i}-\hat{XY}_{i}) + \gamma (\hat{XY}_{i-1} - 2 \hat{XY}_{i} + \hat{XY}_{i+1})$$
where, $\beta = 0.5$ and $\gamma = 0.1$.
The new path generated for this particular task is subject to boundary constraints; the constraints here are the start point, the end point, and obstacle buffers. The starting point is defined as $\hat{XY}_1$, which is equal to the start point of the orginal trajectory, ${XY}_1 = [1,1]$. Similarly, the ending point of the smoothed trajectory, defined as $\hat{XY}_N$, must end at the same point as the orginal trajectory, ${XY}_N = [9,9]$. All remaining waypoints (in the range of $i = 2:N-1$, where $N$ is the number of points in the original trajectory) within the smooted trajectory are determined using the gradient descent equation, and the constraints set by the obstacle buffers. Obstacles buffers will be analyzed between 0.1 and 0.3 units (with a step of 0.1 units), and are required here due to the fact that there is nothing else preventing the optimized path from overlapping with the defined obstacles.
Prior to setting the buffer constraint on the path, the gradient descent algorithm was tested to ensure that it worked as desired. As previously mentioned, here $\alpha = 0.1$ and $\gamma = 0.5$. The result of this can be seen below, where the red line is the smoothed trajectory without buffers, and the blue '*' points are the original trajectory. In addition to the overall figure, a second image is provided below, which shows a magnified view of a turning point in the trajectory.
Here, the trajectory is still fairly angular and this has to do with the value of the smoothing constant, $\alpha$. In this scenario, $\alpha = 0.1$ so the smoothing is not that dramatic. In the figure below, $\alpha = 0.5$ to demonstrate the effect variation of this parameter has on the final path.
![title](Final Images/alpha_05.png)As previously mentioned, "buffers" are required here to ensure that the smoothed path does not inadvertently collide with any of the define obstacles. The buffers were implemented in the program by defining new, virtual obstacles that the initial trajectory must avoid, so the buffers are used to determine the base waypoints (prior to path smoothing). Following this procedure with a correct buffer size guarantees that the smoothed trajectory will not collide with the actual obstacles.
The buffer size was set at values ranging from 0.1 units to 0.3 units, with a step size of 0.1 units, to determine what size would generate ideal results. The ideal size would have to ensure that the path does not cross obstacles, and that the original path does not change significantly. The results from this analysis can be seen in the figures below, along with the value plot associated with each buffer size.
When the buffer size was increased to 0.2, the path the rover was tracking changed completely. As the buffer size grew, the space between the obstacles grew smaller, until there were no longer pathways for the rover to travel through. This is not ideal, so a buffer size of 0.1 units was chosen, which provides full obstacles clearance while closely following the orginal path.
By utilizing the knowledge that the maximum velocity of the rover is 1.5 units/second, the smoothed path was used to convert the points into a trajectory with respect to time. Since the distance traveled between each point is known, the velocity can be used to determine the amount of time it takes to travel between each waypoint. Once the time required to travel between each point was determined, spline interpolation was used to equidistantly space waypoints on the trajectory. The results of this conversion can be seen on the plot below.
![title](Final Images/trajectory_time1.png)Standard convention for determining how to control a system utilizes the system dynamics equation $\dot{X} = AX + Bu$, where $X$ is the vector of the states of the system, $A$ is the state matrix, $B$ is the input matrix, and $u$ is the vector of input variables. The measurement equation, $y = CX$, accompanies the dynamics equation, where $y$ defines the resulting measurements of the system due to $X$, and $C$ is the output matrix.
For this system, the actual states are not known, which poses a significant problem for designing a controller to track the desired path for the rover. Standard pole placement techniques require the knowledge of the system states, which are not provided for this system. Here, the only known is the output position coordinate, $x$ and $y$, so in order to actually tack the desired path for this system, it is required that the separation principle be implemented.
To apply the separation principle, it is first necessary to determine the $A$, $B$, and $C$ matrices, which can be done by assuming the rover behaves as a point mass so that the overall governing equation can be described as:</p>
$$F = ma, \ddot{x}=u_x, \ddot{y}=u_y$$
Where the total acceleration, $a$, is comprised of two components; the acceleration in the x-direction, $a_x = \ddot{x} = u_x$, and the acceleration in the y-direction, $a_y = \ddot{y} = u_y$. These equations, as well as the overall governing equation of motion, can be used to define the state-space representation of the system as follows:
Let:
$q_1 = x$System Dynamics:
\left[ \begin{array}{cccc}
\ 0 & 0 & 1 & 0 \\
\ 0 & 0 & 0 & 1 \\
\ 0 & 0 & 0 & 0 \\
\ 0 & 0 & 0 & 0 \\ \end{array}
\right]
\left[ \begin{array}{cccc}
\ q_1 \\
\ q_2 \\
\ q_3 \\
\ q_4 \\ \end{array}
\right]
+
\left[ \begin{array}{cccc}
\ 0 & 0 \\
\ 0 & 0\\
\ 1 & 0 \\
\ 0 & 1\\ \end{array} \right]
\left[ \begin{array}{cccc}
\ u_x \\
\ u_y\\ \end{array} \right]$
Measurements:
\left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ \end{array} \right] \left[ \begin{array}{cccc} q_1 \\ q_2 \\ q_3 \\ q_4 \\ \end{array} \right]$
In order to ensure the system dynamics processed by the controller do not change before the observer's estimates converge to state values, the eigenvalues resulting from $A-LC$, where $L$ is the observer gain, must be must be more negative than the eigenvalues resulting from $A-BK$, where $K$ is the controller gain. Therefore, the controller gain must be determined before the observer gain. The gain matrix, $K$, can be determined by picking random poles which force all eigenvalues of $A-BK$ to have negative, real parts, but this does not guarantee optimal control. Optimal control can only be achieved when an optimal gain matrix is utilized, which is why LQR was used here to determine $K$.
Utilization of LQR requires choice of two weighting factors which define the weighting between control and state costs, $Q$ and $R$. $R$ indicates the control cost and $Q$ indicates the state cost, so allowing $norm(Q) >> norm(R)$ will penalize trajectory error much more than control error, and vice versa. The chosen $Q$ and $R$ matrices for this system were selected based on this property of LQR controllers, due to the desire to follow the trajectory as closely as possible. For this system, $Q$ and $R$ were defined as follows:
\left[ \begin{array}{cccc}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 \\ \end{array}
\right]$
\left[ \begin{array}{cccc} .01 & 0 \\ 0 & .01\\ \end{array} \right]$
Selection of the R matrix was not arbitrary and required trial and error; intially R was selected with much smaller values but this resulted in a very high controller gain matrix, so the value of R was increased to maintain a reasonable range for the gain matrix. Increasing the values of the diagonal of the Q matrix produced the same result, so while the $norm(Q)$ is not as much greater than the $norm(R)$ as it could be, it produces a viable gain matrix and the the trajectory error will still be weighted more than the control error. Using these values, the following gain matrix was produced:
\left[ \begin{array}{cccc} 10.00 & 0.00 & 10.95 & 0.00 \\ 0.00 & 10.00 & 0.00 & 10.95 \\ \end{array} \right]$
To check the viability of this gain matrix, the eigenvalues of $A-BK_{optimal}$ were calculated, and the resulted in the eigenvalues shown below. The resulting eigenvalues are all negative, so this is a suitable controller gain matrix.
\left[ \begin{array}{cccc}
-1.0051\\
-9.9494 \\
-1.0051\\
-9.9494\\ \end{array}
\right]$
Once the eigenvalues of the controller gain matrix and the system dynamics are known, the observer gain can be determined by appropriate pole placement. As mentioned previously, the eigenvalues of $A-LC$ must be more negative than the eignevalues of $A-BK_{optimal}$ so poles more negative than the eigenvalues of the controller gain were applied to the system. Unlike the controller gain, the values of the observer gain can be very high comparatively; this is due to the fact that the observer is not dependent upon physical system components, like actuators, which can be saturated as the gain gets too high. High observer gain also allows for rapid convergence on the system states, but this does lead to a larger intial error from the observer (peaking) so care must be taken with this approach. This procedure produced the following observer gain matrix:
\left[ \begin{array}{cccc} 35.00 & 0.00 \\ 0.00 & 31.00 \\ 306.00 & 0.00 \\ 0.00 & 240.00 \\ \end{array} \right]$
To check that the observer gain produces eigenvalues more negative than thhsoe produced by the controller gain matrix, the eigenvalues of $A-LC$ were calculated, and resulted in the eigenvalues shown below. The resulting eigenvalues are more negative than those from the controller gain, so this will allow the observer to converge on the states quicker than the system dynamics change.
\left[ \begin{array}{cccc}
-18.00\\
-17.00 \\
-16.00\\
-15.00\\ \end{array}
\right]$
[1] Babuška, Robert, and Frans C. A. Groen. "Approximate Dynamic Programming and Reinforcement Learning." Interactive Collaborative Information Systems. Berlin: Springer-Verlag, 2010. 3-44. Print.
This project utilized one main MATLAB script, and a variety of functions to complete the assigned tasks. In order, the main script implements the following functions and completes the task as follows:
As mentioned, the observer designed here is a full state observer, not a reduced-order observer; this design choice was made due to the simplicity of implementing a full state observer as opposed to a reduced order observer, even though this comes at the expense of accuracy. Based on the plots, the system error is driven to zero within approximately 1.75 seconds for position and within 2 seconds for velocity, and this error is deemed acceptable for this particular system. At the onset of the trajectory, the rover does not encounter any obstacles for the first 1.5 seconds, so the relatively small error in position from 0-1 seconds will not impact the rover's ability to avoid obstacles. The error and fluctuation in the velocity for the first 2 seconds of cart motion is more concerning; the fluctuation from the desired value is dramatic as is the intial error, however it does achieve zero error and stability eventually, so the observer was used as is. Future iterations of this project should utilize a reduced order observer, to ensure error of state estimation is less of an issue.