Exercises


In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
diamonds = pd.read_csv('diamonds.csv',index_col=0)

Task 1: Basics

  • have a look at the diamonds dataset: how many rows do we have?, what are the different columns?
  • create a DataFrame consisting only of the x, y and z columns
  • access row 5 to 15 in diamonds
  • create a DataFrame consisting only of row 5 to 15 and name the rows "A" to "K" (hint: each DataFrame has an .index attribute which can be modified)
  • access row "C" in the DataFrame you just created
  • use the mixed access operator (.ix) to get the price of the 500th diamond
  • group the diamnods by color and compute the mean of the price
  • find all the diamonds with more than 2 carat and plot their price distribution in a histogram
  • compute and plot the standard deviation of the x dimension for the different cuts

In [ ]:

Task 2: Gene expression data

In this exercise we will work with a realistic gene expression data set. Buettner et al. used single cell RNA-sequencing data, to identify sub-populations of cells with similar gene-expression profiles. RNA-sequencing (RNA seq) data uses next generation sequencing techniques to quantify RNA abundance.

Go to the link given above and download the dataset "Corrected and uncorrected expression values for T-cell data." from the supplementary information section. Have a look at the dataset with Libreoffice and import the second datasheet (Cell-cycle corrected gene expr) as a pandas DataFrame. Use the read_excel() function together with the sheetname attribute.


In [ ]:

In the imported table, each column represents a gene, while each row represents a single cell. This table structure is great for an overview, but it is for example not easy to plot a histogramm of the expression of all genes in all cells. For this it would be better to have all expression values in a single column.

Convert the table to a table with 3 columns: cell_id, gene and expression.

Example:

Original table:

Gnai3 Cdc45 Narf Klf6
Cell 1 3.2322 3.1981 0.29411 1.7343
Cell 2 1.9832 1.173 0.49389 3.8505
Cell 3 2.2482 3.1705 1.6279 1.6306

converted table:

cell_id gene expression
Cell 1 Gnai3 3.2322
Cell 1 Cdc45 3.1981
Cell 1 Narf 0.29411
Cell 1 Klf6 1.7343
Cell 2 Gnai3 1.9832
Cell 2 Cdc45 1.173
... ... ...

this conversion can be done using the function DataFrame.stack(), which yields a multi-indexed DataFrame. This multi-indexed DataFrame can be converted to a conventional DataFrame using the method reset_index().


In [ ]:

In the original paper, each cell was assigned to a cluster, based on a principal component analysis of their gene expression profile. The information to which cluster a cell belongs is given in the third sheet of the Excel file (Cluster Assignment). Read in also this sheet and combine the two tables to have expression and cluster assignment in one table.


In [ ]:

Create a boxplot of the expression of all genes for both clusters.


In [ ]:

Draw a boxplot of the expression of the different cells in each cluster only for the gene Gata3.


In [ ]:

Create a scatter-plot of the two principal components for all Gata3 measurements.


In [ ]:

Color the points in the scatter plots according to whether they belong to cluster 0 or 1.


In [ ]:


In [ ]:


In [ ]:


In [ ]: