Something you may want to do in the future is compare your results to papers that came before you. Today we'll go through how to find these data and how to analyze them
At the end of most recent papers, they'll put a section called "Accession Codes" or "Accession Numbers" which will list a uniquely identifying number and letter combination.
In the US, the Gene Expression Omnibus (GEO) is a website funded by the NIH to store the expression data associated with papers. Many journals require you to submit your data to GEO to be able to publish.
Let's do this for the Shalek2013 paper.
Note: For some "older" papers (pre 2014), the accession code may not be on the PDF version of the paper but on the online version only. What I usually do then is search for the title of the paper and go to the journal website.
For your homework, you'll need to find another dataset to use and the expression matrix that you want may not be on a database, but rather posted in supplementary data on the journal's website.
If you search for the database and the accession number, the first result will usually be the database with the paper info and the deposited data! Below is an example search for "Array Express E-MTAB-2805."
Search for its database and accession number and you should get to a page that looks like this:
Lately, for many papers, they do give a processed expression matrix in the accession database that you can use directly. Luckily for us, that's exactly what the authors of the Shalek 2013 dataset did. If you notice at the bottom of the page, there's a table of Supplementary files and one of them is called "GSE41265_allGenesTPM.txt.gz
". The link below is the "(ftp)" link copied down with the command "wget
" which I think of as short for "web-get" so you can download files from the internet with the command line.
In addition to the gene expression file, we'll also look at the metadata in the "Series Matrix" file.
GSE41265_allGenesTPM.txt.gz
" file. All the "Series" file formats contain the same information in different formats. I find the matrix one is the easiest to understand.
Open the "Series Matrix" in Excel (or equivalent) on your laptop, and look at the format and what's described. What line does the actual matrix of metadata start? You can find it where it says in the first column ,"!!Sample_title." It's after an empty line.
Follow this link to jump directly to the GEO page for this data. Scroll down to the bottom in supplemental material. And download the link for the table called GSE41265_allGenesTPM.txt.gz.
We also need the link to the metadata. It is here. Download the file called GSE41265_series_matrix.txt.gz.
Where did those files go on your computer? Maybe you moved it somewhere. Figure out what the full path of those files are and we will read that in directly below.
To read the gene expression matrix, we'll use "pandas
" a Python package for "Panel Data Analysis" (as in panels of data), which is a fantastic library for working with dataframes, and is Python's answer to R's dataframes. We'll take this opportunity to import ALL of the python libaries that we'll use today.
We'll be using several additional libraries in Python:
matplotlib
- This is the base plotting library in Python.numpy
- (pronounced "num-pie") which is basis for most scientific packages. It's basically a nice-looking Python interface to C code. It's very fast.pandas
- This is the "DataFrames in Python." (like R's nice dataframes) They're a super convenient form that's based on numpy
so they're fast. And you can do convenient things like calculate mea n and variance very easily.scipy
- (pronounced "sigh-pie") "Scientific Python" - Contains statistical methods and calculationsseaborn
- Statistical plotting library. To be completely honest, R's plotting and graphics capabilities are much better than Python's. However, Python is a really nice langauge to learn and use, it's very memory efficient, can be parallized well, and has a very robust machine learning library, scikit-learn
, which has a very nice and consistent interface. So this is Python's answer to ggplot2
(very popular R library for plotting) to try and make plotting in Python nicer looking and to make statistical plots easier to do.
In [ ]:
# Alphabetical order is standard
# We're doing "import superlongname as abbrev" for our laziness - this way we don't have to type out the whole thing each time.
# Python plotting library
import matplotlib.pyplot as plt
# Numerical python library (pronounced "num-pie")
import numpy as np
# Dataframes in Python
import pandas as pd
# Statistical plotting library we'll use
import seaborn as sns
# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline
We'll read in the data using pandas
and look at the first 5 rows of the dataframe with the dataframe-specific function .head()
. Whenever I read a new table or modify a dataframe, I ALWAYS look at it to make sure it was correctly imported and read in, and I want you to get into the same habit.
In [62]:
# Read the data table
# You may need to change the path to the file (what's in quotes below) relative
# to where you downloaded the file and where this notebook is
shalek2013_expression = pd.read_table('/home/ecwheele/cshl2017/GSE41265_allGenesTPM.txt.gz',
# Sets the first (Python starts counting from 0 not 1) column as the row names
index_col=0,
# Tells pandas to decompress the gzipped file
compression='gzip')
print(shalek2013_expression.shape)
shalek2013_expression.head()
Out[62]:
That's kind of annoying ... we don't see all the samples.
So we have 21 columns but looks like here pandas
by default is showing a maximum of 20 so let's change the setting so we can see ALL of the samples instead of just skipping single cell 11 (S11). Let's change to 50 for good measure.
In [ ]:
pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
shalek2013_expression.head()
Now we can see all the samples!
Let's take a look at the full size of the matrix with .shape
:
In [ ]:
shalek2013_expression.shape
In [63]:
shalek2013_metadata = pd.read_table('/home/ecwheele/cshl2017/GSE41265_series_matrix.txt.gz',
compression = 'gzip',
skiprows=33,
index_col=0)
print(shalek2013_metadata.shape)
shalek2013_metadata
Out[63]:
Let's transpose this matrix so the samples are the rows, and the features are the columns. We'll do that with .T
In [64]:
shalek2013_metadata = shalek2013_metadata.T
shalek2013_metadata
Out[64]:
Now we'll do some mild data cleaning. Notice that the columns have the exclamation point at the beginning, so let's get rid of that. In computer science, you keep letters between quotes, and you call those "strings." Let's talk about the string function .strip()
. This removes any characters that are on the outer edges of the string. For example, let's take the string "Whoooo!!!!!!!"
In [65]:
"Whoooo!!!!!!!"
Out[65]:
Now let's remove the exclamation points:
In [66]:
'Whoooo!!!!!!!'.strip('!')
Out[66]:
In [67]:
# YOUR CODE HERE
In [68]:
'Whoooo!!!!!!!'.strip('o')
Out[68]:
In [69]:
'Whoooo!!!!!!!'.replace("o","")
Out[69]:
We can access the column names with dataframe.columns
, like below:
In [70]:
shalek2013_metadata.columns
Out[70]:
We can map the stripping function to every item of the columns. In Python, the square brackets ([
and ]
) show that we're making a list. What we're doing below is called a "list comprehension."
In [71]:
[x.strip('!') for x in shalek2013_metadata.columns]
Out[71]:
In pandas
, we can do the same thing by map
-ping a lambda
, which is a small, anonymous function that does one thing. It's called "anonymous" because it doesn't have a name. map
runs the function on every element of the columns.
In [72]:
shalek2013_metadata.columns.map(lambda x: x.strip('!'))
Out[72]:
The above lambda
is the same as if we had written a named function called remove_exclamation
, as below.
In [73]:
def remove_exclamation(x):
return x.strip('!')
shalek2013_metadata.columns.map(remove_exclamation)
Out[73]:
Now we can assign the new column names to our matrix:
In [74]:
shalek2013_metadata.columns = shalek2013_metadata.columns.map(lambda x: x.strip('!'))
shalek2013_metadata.head()
Out[74]:
In [75]:
shalek2013_expression.head()
Out[75]:
To specify a certain number of rows, put a number between the parentheses.
In [76]:
shalek2013_expression.head(8)
Out[76]:
In [77]:
# YOUR CODE HERE
In [78]:
shalek2013_expression.head(17)
Out[78]:
Let's get a sense of this data by plotting the distributions using boxplot
from seaborn. To save the output, we'll need to get access to the current figure, and save this to a variable using plt.gcf()
. And then we'll save this figure with fig.savefig("filename.pdf")
. You can use other extensions (e.g. ".png
", ".tiff
" and it'll automatically save as that forma)
In [79]:
sns.boxplot(shalek2013_expression)
# gcf = Get current figure
fig = plt.gcf()
fig.savefig('shalek2013_expression_boxplot.pdf')
Notice the 140,000 maximum ... Oh right we have expression data and the scales are enormous... Let's add 1 to all values and take the log2 of the data. We add one because log(0) is undefined and then all our logged values start from zero too. This "$\log_2(TPM + 1)$" is a very common transformation of expression data so it's easier to analyze.
In [80]:
expression_logged = np.log2(shalek2013_expression+1)
expression_logged.head()
Out[80]:
In [81]:
sns.boxplot(expression_logged)
# gcf = Get current figure
fig = plt.gcf()
fig.savefig('expression_logged_boxplot.pdf')
YOUR ANSWER HERE
In [83]:
at_most_2 = expression_logged < 2
at_most_2
Out[83]:
What's nice about booleans is that False is 0 and True is 1, so we can sum to get the number of "Trues." This is a simple, clever way that we can filter on a count for the data. We could use this boolean dataframe to filter our original dataframe, but then we lose information. For all values that are greater than 2, it puts in a "not a number" - "NaN."
In [ ]:
expression_at_most_2 = expression_logged[expression_logged < 2]
print(expression_at_most_2.shape)
expression_at_most_2.head()
In [ ]:
# YOUR CODE HERE
In [ ]:
expression_logged.head()
In [ ]:
expression_greater_than_5 = expression_logged[expression_logged > 5]
expression_greater_than_5.head()
The crude filtering above is okay, but we're smarter than that. We want to use the filtering in the paper:
... discarded genes that were not appreciably expressed (transcripts per million (TPM) > 1) in at least three individual cells, retaining 6,313 genes for further analysis.
We want to do THAT, but first we need a couple more concepts. The first one is summing booleans.
Remember that booleans are really 0s (False
) and 1s (True
)? This turns out to be VERY convenient and we can use this concept in clever ways.
We can use .sum()
on a boolean matrix to get the number of genes with expression greater than 10 for each sample:
In [ ]:
(expression_logged > 10).sum()
pandas
is column-oriented and by default, it will give you a sum for each column. But we want a sum for each row. How do we do that?
We can sum the boolean matrix we created with "expression_logged < 10
" along axis=1
(along the samples) to get for each gene, how many samples have expression less than 10. In pandas
, this column is called a "Series
" because it has only one dimension - its length. Internally, pandas
stores dataframes as a bunch of columns - specifically these Series
ssssss.
This turns out to be not that many.
In [ ]:
(expression_logged > 10).sum(axis=1)
Now we can apply ANOTHER filter and find genes that are "present" (expression greater than 10) in at least 5 samples. We'll save this as the variable genes_of_interest
. Notice that this doesn't the genes_of_interest
but rather the list at the bottom. This is because what you see under a code cell is the output of the last thing you called. The "hash mark"/"number sign" "#
" is called a comment character and makes the rest of the line after it not read by the Python language.
To see genes_of_interest
, "uncomment" the line by removing the hash sign, and commenting out the list [1, 2, 3]
.
In [ ]:
genes_of_interest = (expression_logged > 10).sum(axis=1) >= 5
#genes_of_interest
[1, 2, 3]
Now we have some genes that we want to use - how do you pick just those? This can also be called "subsetting" and in pandas
has the technical name indexing
In pandas
, to get the rows (genes) you want using their name (gene symbol) or boolean matrix, you use .loc[rows_you_want]
. Check it out below.
In [ ]:
expression_filtered = expression_logged.loc[genes_of_interest]
print(expression_filtered.shape) # shows (nrows, ncols) - like in manhattan you do the Street then the Avenue
expression_filtered.head()
Wow, our matrix is very small - 197 genes! We probably don't want to filter THAT much... I'd say a range of 5,000-15,000 genes after filtering is a good ballpark. Not too big so it's impossible to work with but not too small that you can't do any statistics.
We'll get closer to the expression data created by the paper. Remember that they filtered on genes that had expression greater than 1 in at least 3 single cells. We'll filter for expression greater than 1 in at least 3 samples for now - we'll get to the single stuff in a bit. For now, we'll filter on all samples.
Create a dataframe called expression_filtered_by_all_samples
that consists only of genes that have expression greater than 1 in at least 3 samples.
IndexingError: Unalignable boolean Series key provided
If you're getting this error, double-check your .sum()
command. Did you remember to specify that you want to get the number of cells (columns) that express each gene (row)? Remember that .sum()
by default gives you the sum over columns, but since genes are the rows .... How do you get the sum over rows?
In [ ]:
# YOUR CODE HERE
print(expression_filtered_by_all_samples.shape)
expression_filtered_by_all_samples.head()
In [ ]:
genes_of_interest = (expression_logged > 1).sum(axis=1) >= 3
expression_filtered_by_all_samples = expression_logged.loc[genes_of_interest]
print(expression_filtered_by_all_samples.shape)
expression_filtered_by_all_samples.head()
Just for fun, let's see how our the distributions in our expression matrix have changed. If you want to save the figure, you can:
In [ ]:
sns.boxplot(expression_filtered_by_all_samples)
# gcf = Get current figure
fig = plt.gcf()
fig.savefig('expression_filtered_by_all_samples_boxplot.pdf')
In the next exercise, we'll get just the single cells
For the next step, we're going to pull out just the pooled - which are conveniently labeled as "P#". We'll do this using a list comprehension, which means we'll create a new list based on the items in shalek2013_expression.columns
and whether or not they start with the letter 'P'
.
In Python, things in square brackets ([]
) are lists unless indicated otherwise. We are using a list comprehension here instead of a map, because we only want a subset of the columns, rather than all of them.
In [ ]:
pooled_ids = [x for x in expression_logged.columns if x.startswith('P')]
pooled_ids
We'll access the columns we want using this bracket notation (note that this only works for columns, not rows)
In [ ]:
pooled = expression_logged[pooled_ids]
pooled.head()
We could do the same thing using .loc
but we would need to put a colon ":
" in the "rows" section (first place) to show that we want "all rows."
In [ ]:
expression_logged.loc[:, pooled_ids].head()
In [ ]:
# YOUR CODE HERE
print(singles.shape)
singles.head()
In [ ]:
single_ids = [x for x in expression_logged.columns if x.startswith('S')]
singles = expression_logged[single_ids]
print(singles.shape)
singles.head()
Now we'll actually do the filtering done by the paper. Using the singles
dataframe you just created, get the genes that have expression greater than 1 in at least 3 single cells, and use that to filter expression_logged
. Call this dataframe expression_filtered_by_singles
.
In [ ]:
# YOUR CODE HERE
print(expression_filtered_by_singles.shape)
expression_filtered_by_singles.head()
In [ ]:
rows = (singles > 1).sum(axis=1) > 3
expression_filtered_by_singles = expression_logged.loc[rows]
print(expression_filtered_by_singles.shape)
expression_filtered_by_singles.head()
Let's make a boxplot again to see how the data has changed.
In [ ]:
sns.boxplot(expression_filtered_by_singles)
fig = plt.gcf()
fig.savefig('expression_filtered_by_singles_boxplot.pdf')
This is much nicer because now we don't have so many zeros and each sample has a reasonable dynamic range.
You may be wondering, we did all this work to remove some zeros..... so the FPKM what? Let's take a look at how this affects the relationships between samples using sns.jointplot
from seaborn, which will plot a correlation scatterplot. This also calculates the Pearson correlation, a linear correlation metric.
Let's first do this on the unlogged data.
In [ ]:
sns.jointplot(shalek2013_expression['S1'], shalek2013_expression['S2'])
Pretty funky looking huh? That's why we logged it :)
Now let's try this on the logged data.
In [ ]:
sns.jointplot(expression_logged['S1'], expression_logged['S2'])
Hmm our pearson correlation increased from 0.62 to 0.64. Why could that be?
Let's look at this same plot using the filtered data.
In [ ]:
sns.jointplot(expression_filtered_by_singles['S1'], expression_filtered_by_singles['S2'])
YOUR ANSWER HERE