MiniProject 1 - Robot Path Planning

Courtney L. Armstrong

MEC 560: Advanced Control Systems taught by Dr. Vivek Yadav

Introduction

For this project, the first task was to generate a path for a rover to follow utilizing either dynamic programming or reinforcement learning. In order to choose the "best" method to use, both methods were implemented independently to generate paths for the robot to track, and the results were compared to determine which method would be used to complete the project.

Regardless of chosen method, the generated path was required to accomplish the following items:

  • The path must take the robot from the starting point [1,1] to the end point [9,9]
  • The path must avoid any and all obstacles placed on the 10x10 grid
  • The trajectory must be smooth and avoid sharp ($90^{\circ}$) turns
  • </ul>

    Once a smooth trajectory was been achieved, an observer was designed and tested; the resulting control was also analyzed to determine if it is optimal.

    Path Generation

    As mentioned in the introduction, two methods of path generation can be used for this task; the first being dyanmic programming (DP) and the second being reinforcement learning (RL). While these two methods are both Markov decision processes (MDP), they are fundamentally different in terms of how the path algorithm is actually determined based on the environment of the robot1. Both approaches were applied to the task at first so they could be directly compared, but the algorithm utilized by reinforcement learning was unable to determine the path to navigate the environment in a reasonable amount of time when it contained the desired amount of obstacles. For this reason, dynamic programming was used to solve the task.

    Environment Setup

    Prior to testing the path generating algorithms, the initial environment, or "course", the robot is to traverse had to be created. As per the requirements of the project, this required the construction of a 10x10 grid with several obstacles for the robot to avoid. Obstacles were placed in such a way that the robot would have to interact with them to get to the end-point, in order to verify that the chosen algorithm worked properly.

    The environment that was built can be seen in the image below; it consists of seven obstacles of varying size, which were initialized on the gird using MATLAB.

    Figure 1 - Environment Layout

    Path Creation via Dynamic Programming

    The algorithm utilized to generate the path for the rover follows the concepts of dynamic programming; the MATLAB code implemented was based on the code authored by Dr. Vivek Yadav2 and modified to allow for a larger grid, as well as the appropriate initial [1,1] and end positions [9,9]. Correction of the initial postion was achieved by setting the value of x_agent and y_agent to [1] and [1], respectively, and the end point was set by correcting the values of x_goal and y_goal to [9] and [9], respectively. The larger grid also required modification of the function check_ind.m so that the algorithm was iterated for all index points, not just those set by the inital code; initial values would have stopped the iteration after 41 steps, due to the original grid being 4x4. The new 10x10 grid required 101 iteration steps in order to properly determine the path. Finally, the obstacle penalty values (labeled as Act in the program) had to be increased from the initial values, to compensate for the larger grid and properly generate the obstacles.

    Application of these changes provided the cost-to-go plot as shown below; here all the obstacles have a high cost to go so that the algorithm avoids the obstacles as it generates the path. As the robot apporaches the goal, the associated cost to go is lowered until the goal is reached.


    Figure 2 - "Cost to Go" Value Plot Associated with Chosen Obstacles

    Once the cost to go of each index was generated, the algorithm utilized these stored values to determine the path the rover should follow in order to guarantee it will avoid the obstacles. The path generated by the dyanmic programming algorithm can be seen below. Due to the way the algorithm indexes the waypoints, the actual path begins at [1.1, 1] and ends at [9.1, 9]


    Figure 3 - Robot Path Generated Based on "Cost to Go" Values

    Path Smoothing

    While the generated series of waypoints does achieve the objective of avoiding obstacles and getting the rover from the start point to the end point, the path contains multiple sharp turns. In practive, these sudden and sharp directional changes would be difficult for certain vehicles, such as UAVs, to physically implement. In light of this, the generated path must be smoothed so that all the turns follow a gentle curvature. This is accomplished here using gradient descent, and including a buffer around the obstacles to ensure the rover successfully avoids them when it follows the new path.

    Gradient Descent

    The trajectory generated for the rover to follow can be smoothed by performing an optimization which minimizes the following general equation:


    $$ J(\hat{XY}) =\sum_{i=2}^{N-1} J(\hat{XY}_i) = \sum_{i=2}^{N-1} (XY_i - \hat{XY}_i )^2 + \alpha \sum_{i=2}^{N-1} (\hat{XY}_i - \hat{XY}_{i+1} )^2$$

    In this equation, ${XY}_i$ represents the original $XY$ values at a certain index, $i$, and $\hat{XY}_i$ represents the new $XY$ values for the smoothed trajectory at the same index point, $i$. The first term is used to minimize the distance between the original trajectory generated by dynamic programming and the optimized (smoothed) trajectory. The second term is a weighted by a factor, $\alpha$, and is used to penalize sharp turns in the optimized trajectory. For gradient descent, the general equation is converted to the form of:

    $$\hat{XY}_{i} = \hat{XY}_{i} + \beta (XY_{i}-\hat{XY}_{i}) + \gamma (\hat{XY}_{i-1} - 2 \hat{XY}_{i} + \hat{XY}_{i+1})$$

    where, $\beta = 0.5$ and $\gamma = 0.1$.

    The new path generated for this particular task is subject to boundary constraints; the constraints here are the start point, the end point, and obstacle buffers. The starting point is defined as $\hat{XY}_1$, which is equal to the start point of the orginal trajectory, ${XY}_1 = [1,1]$. Similarly, the ending point of the smoothed trajectory, defined as $\hat{XY}_N$, must end at the same point as the orginal trajectory, ${XY}_N = [9,9]$. All remaining waypoints (in the range of $i = 2:N-1$, where $N$ is the number of points in the original trajectory) within the smooted trajectory are determined using the gradient descent equation, and the constraints set by the obstacle buffers. Obstacles buffers will be analyzed between 0.1 and 0.3 units (with a step of 0.1 units), and are required here due to the fact that there is nothing else preventing the optimized path from overlapping with the defined obstacles.

    Prior to setting the buffer constraint on the path, the gradient descent algorithm was tested to ensure that it worked as desired. As previously mentioned, here $\alpha = 0.1$ and $\gamma = 0.5$. The result of this can be seen below, where the red line is the smoothed trajectory without buffers, and the blue '*' points are the original trajectory. In addition to the overall figure, a second image is provided below, which shows a magnified view of a turning point in the trajectory.

    Figure 4 - Smoothed Trajectory using Gradient Descent ($\alpha = 0.1$; Without Buffers)
    ![title](Final Images/alpha_01_turn.png)
    Figure 5 - Smoothed Turn ($\alpha = 0.1$; Without Buffers)

    Here, the trajectory is still fairly angular and this has to do with the value of the smoothing constant, $\alpha$. In this scenario, $\alpha = 0.1$ so the smoothing is not that dramatic. In the figure below, $\alpha = 0.5$ to demonstrate the effect variation of this parameter has on the final path.

    ![title](Final Images/alpha_05.png)
    Figure 6 - Smoothed Trajectory using Gradient Descent ($\alpha = 0.5$; Without Buffers)
    ![title](Final Images/alpha_05_turn.png)
    Figure 7 - Smoothed Turn ($\alpha = 0.5$; Without Buffers)

    Buffers

    As previously mentioned, "buffers" are required here to ensure that the smoothed path does not inadvertently collide with any of the define obstacles. The buffers were implemented in the program by defining new, virtual obstacles that the initial trajectory must avoid, so the buffers are used to determine the base waypoints (prior to path smoothing). Following this procedure with a correct buffer size guarantees that the smoothed trajectory will not collide with the actual obstacles.

    The buffer size was set at values ranging from 0.1 units to 0.3 units, with a step size of 0.1 units, to determine what size would generate ideal results. The ideal size would have to ensure that the path does not cross obstacles, and that the original path does not change significantly. The results from this analysis can be seen in the figures below, along with the value plot associated with each buffer size.

    Buffer Size = 0.1 Units
    Figure 8 - Value Plot (Buffer = 0.1 Units)
    Figure 9 - Base Trajectory (Buffer = 0.1 Units)
    Figure 10 - Smoothed Trajectory (Buffer = 0.1 Units)
    Buffer Size = 0.2 Units
    Figure 11 - Value Plot (Buffer = 0.2 Units)
    Figure 12 - Base Trajectory (Buffer = 0.2 Units)
    Figure 13 - Smoothed Trajectory (Buffer = 0.2 Units)
    Buffer Size = 0.3 Units
    Figure 14 - Value Plot (Buffer = 0.3 Units)
    Figure 15 - Base Trajectory (Buffer = 0.3 Units)
    Figure 16 - Smoothed Trajectory (Buffer = 0.3 Units)

    When the buffer size was increased to 0.2, the path the rover was tracking changed completely. As the buffer size grew, the space between the obstacles grew smaller, until there were no longer pathways for the rover to travel through. This is not ideal, so a buffer size of 0.1 units was chosen, which provides full obstacles clearance while closely following the orginal path.

    By utilizing the knowledge that the maximum velocity of the rover is 1.5 units/second, the smoothed path was used to convert the points into a trajectory with respect to time. Since the distance traveled between each point is known, the velocity can be used to determine the amount of time it takes to travel between each waypoint. Once the time required to travel between each point was determined, spline interpolation was used to equidistantly space waypoints on the trajectory. The results of this conversion can be seen on the plot below.

    Figure 17 - Time vs. Position of Rover (x & y)

    Observer & Controller Design via Separation Principle

    Standard convention for determining how to control a system utilizes the system dynamics equation $\dot{X} = AX + Bu$, where $X$ is the vector of the states of the system, $A$ is the state matrix, $B$ is the input matrix, and $u$ is the vector of input variables. The measurement equation, $y = CX$, accompanies the dynamics equation, where $y$ defines the resulting measurements of the system due to $X$, and $C$ is the output matrix.

    For this system, the actual states are not known, which poses a significant problem for designing a controller to track the desired path for the rover. Standard pole placement techniques require the knowledge of the system states, which are not provided for this system. Here, the only known is the output position coordinate, $x$ and $y$, so in order to actually tack the desired path for this system, it is required that the separation principle be implemented, which will be discussed in further detail later.

    To apply the separation principle, it is first necessary to determine the $A$, $B$, and $C$ matrices, which can be done by assuming the rover behaves as a point mass so that the overall governing equation can be described as:</p>
    $$F = ma, \ddot{x}=u_x, \ddot{y}=u_y$$

    Where the total acceleration, $a$, is comprised of two components; the acceleration in the x-direction, $a_x = \ddot{x} = u_x$, and the acceleration in the y-direction, $a_y = \ddot{y} = u_y$. These equations, as well as the overall governing equation of motion, can be used to define the state-space representation of the system as follows:

    Let:

    $q_1 = x$
    $q_2 = y$
    $q_3 = \dot{x}$
    $q_4 = \dot{y}$

    System Dynamics:


    $$\left[ \begin{array}{cccc} \dot{q_1} \\ \dot{q_2} \\ \dot{q_3} \\ \dot{q_4} \\ \end{array} \right] = \left[ \begin{array}{cccc} \ 0 & 0 & 1 & 0 \\ \ 0 & 0 & 0 & 1 \\ \ 0 & 0 & 0 & 0 \\ \ 0 & 0 & 0 & 0 \\ \end{array} \right] \left[ \begin{array}{cccc} \ q_1 \\ \ q_2 \\ \ q_3 \\ \ q_4 \\ \end{array} \right] + \left[ \begin{array}{cccc} \ 0 & 0 \\ \ 0 & 0\\ \ 1 & 0 \\ \ 0 & 1\\ \end{array} \right] \left[ \begin{array}{cccc} \ u_x \\ \ u_y\\ \end{array} \right]$$



    Measurements:

    $$ y = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ \end{array} \right] \left[ \begin{array}{cccc} q_1 \\ q_2 \\ q_3 \\ q_4 \\ \end{array} \right]$$

    Controller Gain Matrix

    In order to ensure the system dynamics processed by the controller do not change before the observer's estimates converge to state values, the eigenvalues resulting from $A-LC$, where $L$ is the observer gain, must be must be more negative than the eigenvalues resulting from $A-BK$, where $K$ is the controller gain. Therefore, the controller gain must be determined before the observer gain. The gain matrix, $K$, can be determined by picking random poles which force all eigenvalues of $A-BK$ to have negative, real parts, but this does not guarantee optimal control. Optimal control can only be achieved when an optimal gain matrix is utilized, which is why LQR was used here to determine $K$.

    Utilization of LQR requires choice of two weighting factors which define the weighting between control and state costs, $Q$ and $R$. $R$ indicates the control cost and $Q$ indicates the state cost, so allowing $norm(Q) >> norm(R)$ will penalize trajectory error much more than control error, and vice versa. The chosen $Q$ and $R$ matrices for this system were selected based on this property of LQR controllers, due to the desire to follow the trajectory as closely as possible. For this system, $Q$ and $R$ were defined as follows:


    $ Q = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{array} \right]$

    $ R = \left[ \begin{array}{cccc} .01 & 0 \\ 0 & .01\\ \end{array} \right]$

    Selection of the R matrix was not arbitrary and required trial and error; intially R was selected with much smaller values but this resulted in a very high controller gain matrix, so the value of R was increased to maintain a reasonable range for the gain matrix. Increasing the values of the diagonal of the Q matrix produced the same result, so while the $norm(Q)$ is not as much greater than the $norm(R)$ as it could be, it produces a viable gain matrix and the the trajectory error will still be weighted more than the control error. Using these values, the following gain matrix was produced:


    $ K = \left[ \begin{array}{cccc} 10.00 & 0.00 & 10.95 & 0.00 \\ 0.00 & 10.00 & 0.00 & 10.95 \\ \end{array} \right]$

    To check the viability of this gain matrix, the eigenvalues of $A-BK_{optimal}$ were calculated, and the resulted in the eigenvalues shown below. The resulting eigenvalues are all negative, so this is a suitable controller gain matrix.


    $eig(A-BK_{optimal}) = \left[ \begin{array}{cccc} -1.0051\\ -9.9494 \\ -1.0051\\ -9.9494\\ \end{array} \right]$
    Observer Gain Matrix

    Once the eigenvalues of the controller gain matrix and the system dynamics are known, the observer gain can be determined by appropriate pole placement. As mentioned previously, the eigenvalues of $A-LC$ must be more negative than the eignevalues of $A-BK_{optimal}$ so poles more negative than the eigenvalues of the controller gain were applied to the system. Unlike the controller gain, the values of the observer gain can be very high comparatively; this is due to the fact that the observer is not dependent upon physical system components, like actuators, which can be saturated as the gain gets too high. High observer gain also allows for rapid convergence on the system states, but this does lead to a larger intial error from the observer (peaking) so care must be taken with this approach. This procedure produced the following observer gain matrix:


    $ L = \left[ \begin{array}{cccc} 35.00 & 0.00 \\ 0.00 & 31.00 \\ 306.00 & 0.00 \\ 0.00 & 240.00 \\ \end{array} \right]$

    To check that the observer gain produces eigenvalues more negative than thhsoe produced by the controller gain matrix, the eigenvalues of $A-LC$ were calculated, and resulted in the eigenvalues shown below. The resulting eigenvalues are more negative than those from the controller gain, so this will allow the observer to converge on the states quicker than the system dynamics change.


    $eig(A-LC) = \left[ \begin{array}{cccc} -18.00\\ -17.00 \\ -16.00\\ -15.00\\ \end{array} \right]$
    The Separation Principle

    The separation principle is a method utilized when the full state of the system is not available, as is the case with this system, that allows for the controller of the system to be designed using the estimated states, $\hat{X}$. The separation principle also allows for the observer to be designed independent of the actions of the controller; essentially this allows for the design of the controller and observer to be completely separate from one another. Conventionally, this is represented by:

    The end goal of this system is not to drive the states to zero, but is instead to track the desired path. This goal requires the definition of a new varible, $e_{ref}$, which will be used to define the error between the current state and the desired state. For this system $e_{ref}$ is defined as follows:


    $e_{ref} = \left[ \begin{array}{cccc} q_1 - q_{1,des} \\ q_2 - q_{2,des} \\ q_3 \\ q_4 \\ \end{array} \right]$

    Where $q_{1,des}$ is the desired x-position from the path trajectory, $q_{2,des}$ is the desired y-position from the path, and $q_3, q_4$ are the x- and y-velocities, repsectively. The primary job of the observer is now to drive the error of $e_{ref}$ and the estimate of $e_{ref}$, denoted as $\hat{e_{ref}}$, to zero, which will implicitly drive the estimated state values to the desired values. Utilizing $e_{ref}$ redefines the system as follows:


    $\dot{e_{ref}} = \left[ \begin{array}{cccc} \dot{q_1} - \dot{q_{1,des}} \\ \dot{q_2} - \dot{q_{2,des}} \\ \dot{q_3} \\ \dot{q_4} \\ \end{array} \right] = \left[ \begin{array}{cccc} \dot{q_1} - 0 \\ \dot{q_2} - 0 \\ \dot{q_3} \\ \dot{q_4} \\ \end{array} \right] = \dot{q}$

    By substituting $e_{ref}$ for $q$, the system dynamics and measurements become:

    $\dot{e_{ref}} = Ae+Bu$
    $y = Ce + \left[ \begin{array}{cccc} q_{1,des} \\ q_{2,des} \\ \end{array} \right]$

    Similarly, the observer equation and resulting control input are redefined by $e_{ref}$ as:

    $ \dot{\hat{e_{ref}}} = A \hat{e_{ref}} + Bu + L \left( y-C \hat{e_{ref}}- \left[ \begin{array}{cccc} q_{1,des} \\ q_{2,des} \\ \end{array} \right] \right) $
    $u = -K \hat{e_{ref}} $

    These new system dynamics can now be utilized for design of an observer and controller to track the desired path of the rover. A full state observer is utilized here, primarily due to ease of implementation as compared to a reduced order observer, as well as the fact that the error converged quickly enough for this system that the rover is able to fully avoid all obstacles. The error converges to zero in under 1 second (as shown in the plots below), and the first obstacle is not encountered until 1.5 seconds passes. However, there are small fluctuations in the convergence as time passes, and peaking phenomena can be observed initially, therefore a reduced order observer should be implemented in future interations.

    Figure 18 - Observer State Estimates

    Figure 19 - Observer Error

    In order to verify the output from the observer is correct, the estimated measurements were plotted to be compared to the trajectory. If the estimated valus mimic the desired trajectory, the observer is functioning as intended and will provide accurate data to the controller. As seen in the plot below, once the position error from the observer converges shortly after 1 second, the estimated measurements behave as desired, so the observer is of sound design. It should be noted that the measurement estimates do not start at the desired initial x and y location, but instead start at [0,0]. Implementation of a reduced order observer would better address this issue, further confirming future iterations should incorporate a reduced order observer.

    Figure 20 - Estimated Trajectory

    Once the estimation of $e_{ref}$, denoted as $\hat{e_{ref}}$, is available, a simple controller can now be implemented to actually drive the rover along the desired path by the implementation of the control law, $u = -K \hat{e_{ref}}$. In this law, the gain matrix, $K$, is the optimal gain matrix, $K_{opt}$ as determined by LQR. Implementation of the control law to the system dynamics using the estimated states, $ \hat{e_{ref}}$, resulted in the following plots.

    Figure 21 - Control Input

    Figure 22 - Final trajectory followed by the robot (in black)

    As can been seen from the plot of the control input versus time, the intial control input required is larger than desired and is unattainable by the system; this is due to the lack of convergence and error from the observer's estimates. Once the observer error is converged, the control signal stays between the desired range of $-1 \leq u \leq 1$. This error could be partially corrected by implementing a reduced order observer, which would have no error for the position estimates, as well as adjusting the controller and observer gain values until desired control input is attained.

    References

    [1] Babuška, Robert, and Frans C. A. Groen. "Approximate Dynamic Programming and Reinforcement Learning." Interactive Collaborative Information Systems. Berlin: Springer-Verlag, 2010. 3-44. Print.

    [2] Vivek, Y (2016) DynProg_Obs source code (Version 1.0) [Source code].

    https://github.com/mec560sbu/mec560sbu.github.io/commits/master/mec560_MATLAB_codes/DynProg_Obs

    [3] [mouhknowsbest]. (2013, June 24). 4.8 Practical Considerations. [Video File]. Retrieved from https://www.youtube.com/watch?v=kpWGdbACL1M.

    [4] Vivek, Y. (2016, October 1). Output feedback control, Observability and Observer design. Retrieved October 17, 2016, from https://mec560sbu.github.io/2016/10/01/Observability_and_Observer_Synthesis/

    [5] Vivek, Y. (2016, September 20). Control Synthesis. Retrieved October 21, 2016, from https://mec560sbu.github.io/2016/09/19/Control_synthesis/

    [6] Yadav, V. (2016, September 27). Optimal Control. Retrieved October 19, 2016, from https://mec560sbu.github.io/2016/09/25/Opt_control/

    [7] Vivek, Y (2016) MS_lqr_mintime source code (Version 1.0) [Source code]. https://github.com/mec560sbu/mec560sbu.github.io/tree/master/mec560_MATLAB_codes/MS_lqr_mintime

    Appendix

    Function & Script Overview

    This project utilized one main MATLAB script, and a variety of functions to complete the assigned tasks. In order, the main script implements the following functions and completes the task as follows:

    1. Path Generation - In order to generate the path, the user must define the grid size (grid_x & grid_y), the x & y starting locations, the x & y ending postions, the obstacle locations and the buffer size. These variables are utilized by the "genPath" function, to determine the path the robot should take to avoid obstacles.
    2. Path Smoothing - The path generated by the "genPath" function is not smooth and would be difficult for an autonomous vehicle to follow in practice. To achieve a smooth path via gradient descent, the "smoothPath" function is applied. This function utilizes the waypoints generated by the "pathGen" function and produces a new path with smoothed curves to allow the robot to follow the path easily. Buffers are included around the obstacles in the "genPath" function, which ensures that the smoothed path will not cross the obstacles.
    3. Time Trajectory - The smoothed path is no more than a series of repositioned way points derived from the original path, so it is not possible for a designed controller to actually track this without converting to a time trajectory. Utilizing the knowledge that the total velocity of the rover is 1.5 units/s, the generated waypoints are converted to time dependent position vectors by the "pointToTrajectory" function. Spline interpolation is implemented to redefine the position and time vectors so they are equidistantly spaced.
    4. Observer & Controller Design - To determine the optimal gain for the controller, the function "LQR_K" is utilized, which simply applies the Ricatti equation to the user defined A, B, Q and R equations to determine the optimal gain matrix. The function "fullStateObs" is used to design a full state observer, based on the user defined poles and various system matrices, as well as the desired path coordinates. The function calculates the observer gain matrix and provides the state estimation needed for the controller.