In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
diamonds = pd.read_csv('diamonds.csv',index_col=0)
In [ ]:
In this exercise we will work with a realistic gene expression data set. Buettner et al. used single cell RNA-sequencing data, to identify sub-populations of cells with similar gene-expression profiles. RNA-sequencing (RNA seq) data uses next generation sequencing techniques to quantify RNA abundance.
Go to the link given above and download the dataset "Corrected and uncorrected expression values for T-cell data." from the supplementary information section. Have a look at the dataset with Libreoffice and import the second datasheet (Cell-cycle corrected gene expr) as a pandas DataFrame. Use the read_excel()
function together with the sheetname attribute.
In [ ]:
In the imported table, each column represents a gene, while each row represents a single cell. This table structure is great for an overview, but it is for example not easy to plot a histogramm of the expression of all genes in all cells. For this it would be better to have all expression values in a single column.
Convert the table to a table with 3 columns: cell_id
, gene
and expression
.
Example:
Original table:
Gnai3 | Cdc45 | Narf | Klf6 | |
---|---|---|---|---|
Cell 1 | 3.2322 | 3.1981 | 0.29411 | 1.7343 |
Cell 2 | 1.9832 | 1.173 | 0.49389 | 3.8505 |
Cell 3 | 2.2482 | 3.1705 | 1.6279 | 1.6306 |
converted table:
cell_id | gene | expression |
---|---|---|
Cell 1 | Gnai3 | 3.2322 |
Cell 1 | Cdc45 | 3.1981 |
Cell 1 | Narf | 0.29411 |
Cell 1 | Klf6 | 1.7343 |
Cell 2 | Gnai3 | 1.9832 |
Cell 2 | Cdc45 | 1.173 |
... | ... | ... |
this conversion can be done using the function DataFrame.stack()
, which yields a multi-indexed DataFrame. This multi-indexed DataFrame can be converted to a conventional DataFrame using the method reset_index()
.
In [ ]:
In the original paper, each cell was assigned to a cluster, based on a principal component analysis of their gene expression profile. The information to which cluster a cell belongs is given in the third sheet of the Excel file (Cluster Assignment). Read in also this sheet and combine the two tables to have expression and cluster assignment in one table.
In [ ]:
Create a boxplot of the expression of all genes for both clusters.
In [ ]:
Draw a boxplot of the expression of the different cells in each cluster only for the gene Gata3.
In [ ]:
Create a scatter-plot of the two principal components for all Gata3 measurements.
In [ ]:
Color the points in the scatter plots according to whether they belong to cluster 0 or 1.
In [ ]:
In [ ]:
In [ ]:
In [ ]: