8 Final Projects

Genome size of crustaceans

In the repository you will find a dataset with the genome size (measured in picograms of DNA per haploid cell) of two large groups of crustaceans (decapods and isopods). The cause of variation in genome size has been a puzzle for a long time; we’ll use these data to answer the biological question of whether some groups of crustaceans have different genome sizes than others.

  • Read the data and make a histogram and a boxplot. Do you think the data is normally distributed?
  • Create a new data frame with log10-transformed data and repeat the histogram and the boxplot. How does the data look now?
  • Compute and compare mean, standard deviation and variance of the data before and after the transformation. Can you see any significant differences between the decapods and isopods?
  • Extra: Perform a statistical test to answer the proposed biological question. You can perform a t-test, but be sure that the input data you use is normally distributed and the variances of the groups are equal.

Visualizing the Stock Market

Load some stocks data with from bokeh.sampledata.stocks import AAPL, FB, GOOG, IBM, MSFT and plot the highest value of each day for all of them (the command line will be very useful) using bokeh. This might seem simple given your acquired expertise, but here are a few notes of what you have to take into account:

  • Note that the x axis is time, so you will have to use the command pd.to_datetime to transform the date columns into something plottable. You will also need to set the x_axis_type to "datetime" in the figure environment to keep the dates in the axis ticks.
  • Make that when put on any point, the mouse gives you information about the company, the date and the value. You will need to add a new hover for each line with the command add_tools, and specify in HoverTool the line you will render with the hover. A tour through the documentation might be very useful at this point.
  • Finally, play with legend.click_policy to make that when clicking on a company in the legend, its line dims or disappears.

As a side note, do not worry too much about BokehDeprecationWarning signs that might appear, but feel free to ask if you want to know why they appear.

Store sales

In this exercise we will analyze the sales history of a particular store. All the information needed, for every day of a year can be found here.

  • Download the dataset and load it in a Jupyter Notebook. Get an idea of which information is being stored with the commands head() and describe().
  • You will note that in the Amount column all values come with a $\$ $ sign. Also, for large quantities, the thousands are separated by commas, so you have quantitites like $\$1,041.80$, that can not be transformed automatically into numbers. Create a function that takes a string as input, eliminates any possible $\$ $ and , signs (this can be done with the .replace() method), and outputs the result as a floating-point number. Then, apply this function to the whole column.
  • Also, you see that there is only one store, so the column BranchName is a bit irrelevant. Eliminate that column.
  • What has been the net balance of the year? Has the store won or lost money? Are customers spending more money in cash or in card transactions?
  • Make a 3x4 plot where in every subplot you show the sales of every month as a function of the day of the month. Which is the best month (that with highest earnings)?
  • Make a 1x2 plot in which you plot, as a function of the hour of the day, the total sales on one plot and the average sales on the other. Make an order-4 regression for the first case, and a linear regression for the second one. For doing so, use the regplot function in Seaborn.
  • For every day of the week, plot the total number of transactions made. Ideally, the plot would look like this one. Is there a difference between cash and card transactions?

Breast cancer

This problem is about getting information about a dataset with different properties of breast tumors.

  • Download the dataset. It can be found here. However, it comes without column names. When using read_csv(), pass the following list as the names option:
    names = ['id_number', 'diagnosis', 'radius_mean', 
           'texture_mean', 'perimeter_mean', 'area_mean', 
           'smoothness_mean', 'compactness_mean', 'concavity_mean',
           'concave_points_mean', 'symmetry_mean', 
           'fractal_dimension_mean', 'radius_se', 'texture_se', 
           'perimeter_se', 'area_se', 'smoothness_se', 
           'compactness_se', 'concavity_se', 'concave_points_se', 
           'symmetry_se', 'fractal_dimension_se', 
           'radius_worst', 'texture_worst', 'perimeter_worst',
           'area_worst', 'smoothness_worst', 
           'compactness_worst', 'concavity_worst', 
           'concave_points_worst', 'symmetry_worst', 
           'fractal_dimension_worst']
    
    Also, we will use the first column as that for the indices. To set this, use the option index_col=0 as well.
  • The diagnosis column specifies whether the tumor was benign (B) or malignant (M). B and M may be difficult to interpret, so make a function that substitutes all Bs by Benign and all Ms by Malignant. What is the percentage of benign and malignant tumors?
  • You may have seen that there are mean and worst values for all properties (radius, texture, perimeter, smoothness, compactness, concavity, concave_points, symmetry and fractal dimension) of each tumor. Compute the difference between the mean and the worst values for the texture and the symmetry. Plot histograms for these differences for benign and malign tumors. Can you tell something about benign and malign tumors in these cases?
  • In a pairplot, plot the scatters of the following properties: ['concave_points_worst', 'concavity_mean', 'perimeter_worst', 'radius_worst', 'area_worst']. Give different colors for benign and malignant tumors. This can be done with the option hue='diagnosis'. With this visualization, what can you say about the differences between benign and malignant tumors?
  • You may have noticed that the properties 'perimeter_worst', 'area_worst' and 'radius_worst' are highly correlated. To see how correlated they are, make a plot with two subplots. In the left, do a jointplot of 'perimeter_worst' vs. 'radius_worst', and in the right, do the same with 'area_worst' and 'radius_worst'. Setting the options kind='reg' and order will plot regression curves and correlation data.
  • Extra: Using the function numpy.polyfit(), perform a linear regression between the properties 'radius_worst' and 'perimeter_worst'. With this information, can you find an approximate value for $\pi$? Do the same for a quadratic regression between 'radius_worst' and 'area_worst'.

Fitzhugh-Nagumo Model

The Fitzhugh-Nagumo equations represent a simplified model of neuronal excitation. A particular form of the equations can be stated as follows:

$$ \frac{dV}{dt} = -V(V - a)(V - 1) - W + I_{app} $$$$ \frac{dW}{dt} = \epsilon (V - \gamma W) $$
  1. Implement this system of two coupled equations as a function, and solve them for the following parameter values: $I_{app} = -0.1 $, $\epsilon = 0.1$, $a = 0.1$, $ \gamma = 0.25 $, under a range of initial conditions.
  2. The parameter $I_{app}$ represents an external input current or stimulation of sorts. This excites the neuron, when increased beyond a certain threshold. Find the approximate threshold value of $I_{app}$ at which the system starts continuously oscillating. What happens as you continue increasing $I_{app}$?

Double Pendulum

If you connect a pendulum of length $l_1$ and mass $m_1$ to a second pendulum with length $l_2$ and mass $m_2$ you get a double pendulum. The coordinates of the two masses are given by

$$ \begin{eqnarray*} x_{1} & = & l_{1}\sin\theta_{1}\\ y_{1} & = & -l_{1}\cos\theta_{1}\\ x_{2} & = & l_{1}\sin\theta_{1}+l_{2}\sin\theta_{2}\\ y_{2} & = & -l_{1}\cos\theta_{1}-l_{2}\cos\theta_{2}\end{eqnarray*} $$

and the angles evolve according to the following differential equations

$$ \begin{eqnarray*} \ddot{\theta}_{1} & = & \frac{-g(2m_{1}+m_{2})\sin\theta_{1}-m_{2}g\sin(\theta_{1}-2\theta_{2})-2\sin(\theta_{1}-\theta_{2})m_{2}(\dot{\theta}_{2}^{2}l_{2}+\dot{\theta}_{1}^{2}l_{1}\cos(\theta_{1}-\theta_{2}))}{l_{1}(2m_{1}+m_{2}-m_{2}\cos(2\theta_{1}-2\theta_{2})}\\ \ddot{\theta}_{2} & = & \frac{2\sin(\theta_{1}-\theta_{2})(\dot{\theta}_{1}^{2}l_{1}(m_{1}+m_{2})+g(m_{1}+m_{2})\cos\theta_{1}+\dot{\theta}_{2}^{2}l_{2}m_{2}\cos(\theta_{1}-\theta_{2}))}{l_{2}(2m_{1}+m_{2}-m_{2}\cos(2\theta_{1}-2\theta_{2})}\end{eqnarray*}$$

where $g=9.81m/s^{2}$.

Show numerically that such a systems is chaotic, i.e. tiny changes in the initial conditions can lead to completely different evolutions over time.

N-Body Problem

The force $\vec{F}_{ij}$ between two bodies $i$ and $j$ is

$$ \vec{F}_{ij}=-G\frac{m_{i}m_{j}}{|\vec{r}_{ij}|^{3}}\cdot\vec{r}_{ij} $$

where $\vec{r}_{ij}=\vec{x}_{i}-\vec{x}_{j}$ and $G$ is Newton's constant. Integrate

$$ m_i\ddot{\vec{x_i}}=\vec{F_i} $$

where $\vec{F_i}$ is the sum of the pairwise forces for each body to get a moon orbiting a planet orbiting a sun.

Turing Pattern

Alan Turing found that in a system with two diffusing morphogenes $u$ and $v$

$$ \begin{eqnarray*} \dot u & = & D_1 \Delta u + f(u, v) \\ \dot u & = & D_1 \Delta u + f(u, v) \end{eqnarray*}$$

diffusion can drive instabilities to pattern tissues if

$$ \begin{eqnarray*} f_u + g_v & < & 0 \\ f_u g_v - f_v g_u & > & 0 \\ D_1 g_v + D_2 f_u & > & 0. \end{eqnarray*}$$

One Turing system is the Schnakenberg model

$$ \begin{eqnarray*} f(u, v) & = & k_1 - k_2u + k_3u^2v \\ g(u, v) & = & k_4 - k_3u^2v. \end{eqnarray*}$$

Find parameters $k_i$ satisfying the conditions and solve the system to get some waves (1D) or spots and stripes (2D).