Downloading public data

Something you may want to do in the future is compare your results to papers that came before you. Today we'll go through how to find these data and how to analyze them

Reading list

1. Find the database and accession codes

At the end of most recent papers, they'll put a section called "Accession Codes" or "Accession Numbers" which will list a uniquely identifying number and letter combination.

In the US, the Gene Expression Omnibus (GEO) is a website funded by the NIH to store the expression data associated with papers. Many journals require you to submit your data to GEO to be able to publish.

Example data accession section from a Cell paper

Example data accession section from a Nature Biotech paper

Let's do this for the Macosko2015 paper.

Note: For some "older" papers (pre 2014), the accession code may not be on the PDF version of the paper but on the online version only. What I usually do then is search for the title of the paper and go to the journal website.

For your homework, you'll need to find another dataset to use and the expression matrix that you want may not be on a database, but rather posted in supplementary data on the journal's website.

  • What database was the data deposited to?
  • What is its' accession number?

2. Go to the data in the database

If you search for the database and the accession number, the first result will usually be the database with the paper info and the deposited data! Below is an example search for "Array Express E-MTAB-2805."

Here's the GEO overview page for the Macosko2015 paper

3. Find the gene expression matrix

Lately, for many papers, they do give a processed expression matrix in the accession database that you can use directly. Luckily for us, that's exactly what the authors of the Macosko 2015 dataset did. If you notice at the bottom of the page, there's a HUGE (150GB!!) supplementary file called GSE63473_RAW.tar. We don't want all of that, so click "(custom)" to get a menu of all the files in the compressed folder (.tar is a type of compression)

and one of them is called "GSM1626793_P14Retina_1.digital_expression.txt.gz".

  1. Download GSM1626793_P14Retina_1.digital_expression.txt.gz to your computer
  2. Move this file into the folder `
cshl-singlecell-2017/
├── LICENSE
├── README.md
├── data/00_original <---- Put the GSM1626793_P14Retina_1.digital_expression.txt.gz file into this folder
├── environment_in_jupyter_notebook_kernels.png
├── notebooks
├── robust-pca
└── tshirt

All the "Series" file formats contain the same information in different formats. I find the matrix one is the easiest to understand.

Open the "Series Matrix" in Excel (or equivalent) on your laptop, and look at the format and what's described. What line does the actual matrix of metadata start? You can find it where it says in the first column ,"!!Sample_title." It's after an empty line.

4. Reading in the data file

To read the gene expression matrix, we'll use "pandas" a Python package for "Panel Data Analysis" (as in panels of data), which is a fantastic library for working with dataframes, and is Python's answer to R's dataframes. We'll take this opportunity to import ALL of the python libaries that we'll use today.

We'll be using several additional libraries in Python:

  1. matplotlib - This is the base plotting library in Python.
  2. numpy - (pronounced "num-pie") which is basis for most scientific packages. It's basically a nice-looking Python interface to C code. It's very fast.
  3. pandas - This is the "DataFrames in Python." (like R's nice dataframes) They're a super convenient form that's based on numpy so they're fast. And you can do convenient things like calculate mea n and variance very easily.
  4. scipy - (pronounced "sigh-pie") "Scientific Python" - Contains statistical methods and calculations
  5. seaborn - Statistical plotting library. To be completely honest, R's plotting and graphics capabilities are much better than Python's. However, Python is a really nice langauge to learn and use, it's very memory efficient, can be parallized well, and has a very robust machine learning library, scikit-learn, which has a very nice and consistent interface. So this is Python's answer to ggplot2 (very popular R library for plotting) to try and make plotting in Python nicer looking and to make statistical plots easier to do.

We'll read in the data using pandas and look at the first 5 rows of the dataframe with the dataframe-specific function .head(). Whenever I read a new table or modify a dataframe, I ALWAYS look at it to make sure it was correctly imported and read in, and I want you to get into the same habit.


In [6]:
# Import libraries that are standard in Python first
import os

# Alphabetical order is standard
# We're doing "import superlongname as abbrev" for 
# our laziness - this way we don't have to type out the whole thing each time.

# Python plotting library
import matplotlib.pyplot as plt

# Numerical python library (pronounced "num-pie")
import numpy as np

# Dataframes in Python
import pandas as pd

# Statistical plotting library we'll use
import seaborn as sns

# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline

# --- Read the data table ---

# Glue together the folders and filename that we need
filename = os.path.join('..', 'data', '00_original', 'GSM1626793_P14Retina_1.digital_expression.txt.gz')
filename


Out[6]:
'../data/00_original/GSM1626793_P14Retina_1.digital_expression.txt.gz'

In [7]:
# We'll use "read_table" because the data is separated by tabs, 
# but we could also use 'read_csv' for comma-separated data
expression = pd.read_table(filename,
                               
                           # Sets the first (Python starts counting from 0 not 1) column as the row names
                           index_col=0, 
 
                           # Tells pandas to decompress the gzipped file
                           compression='gzip')
expression.head()


Out[7]:
GGCCGCAGTCCG CTTGTGCGGGAA GCGCAACTGCTC GATTGGGAGGCA CCTCCTAGTTGG AGTCAAGCCCTC GTGCCGCCTCTC CCTGTGACACAC AATCTCGTTAAT GATTTCCTCTGA ... GATTTAATGGTA TGTAAGGATCCG GAGTGGCTTGAT GCATCTTTCAGG ACACGAGTTTGG CACCCAGTTTCG CCTGGAGAGTTT TCTTCACTCTTA GCCGTCTTACTA GACCAAACTAAT
gene
10:100015630-100100413:Kitl 0 0 1 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
10:100443902-100487350:Tmtc3 3 0 0 0 2 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
10:100488289-100573655:Cep290 1 3 0 2 1 18 10 3 4 3 ... 4 0 3 0 0 0 0 2 0 1
10:100572274-100589259:4930430F08Rik 2 1 2 0 1 1 0 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
10:100592386-100618391:1700017N19Rik 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 6600 columns

That's kind of annoying ... we don't see all the samples.

So we have 21 columns but looks like here pandas by default is showing a maximum of 20 so let's change the setting so we can see more of the samples:


In [8]:
pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
expression.head()


Out[8]:
GGCCGCAGTCCG CTTGTGCGGGAA GCGCAACTGCTC GATTGGGAGGCA CCTCCTAGTTGG AGTCAAGCCCTC GTGCCGCCTCTC CCTGTGACACAC AATCTCGTTAAT GATTTCCTCTGA TGTCCATGCGTA GAAGGCTGGAAC TTCCGGCTGTAC TCGACTTCTTCT TGTGGATCTGCT GGGCCACGATCA GATTACTGAGCT CGAGATATAGCA GGGTGTCAGTGG TGCGAGAGCTTG GCTTCTTGTCCT GTTTATATGCGC TCTTCACTGGCT CTCCACGCCTTT ATCTTTCCCGTG ... GCATGCGTGGGT AACGAGGCATCC CGTGCCATAAAT GTAATGGTAATG TGCTCTCAGCCC CAACCGGGAAGG CCGACAAGTACC AACCATTTTTAC TGCGAGCAGCGA TGATCTATATAC CCAGTTATGGGT TATGGGGGCATA GATTCGTAATGA TTGGTCCTAAGA AATTGTTAGGCC GATTTAATGGTA TGTAAGGATCCG GAGTGGCTTGAT GCATCTTTCAGG ACACGAGTTTGG CACCCAGTTTCG CCTGGAGAGTTT TCTTCACTCTTA GCCGTCTTACTA GACCAAACTAAT
gene
10:100015630-100100413:Kitl 0 0 1 0 0 0 0 1 0 0 1 3 0 0 2 0 2 1 0 0 2 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10:100443902-100487350:Tmtc3 3 0 0 0 2 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 2 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10:100488289-100573655:Cep290 1 3 0 2 1 18 10 3 4 3 0 3 1 3 1 0 5 4 0 10 1 3 0 1 5 ... 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 4 0 3 0 0 0 0 2 0 1
10:100572274-100589259:4930430F08Rik 2 1 2 0 1 1 0 1 1 1 2 0 0 1 1 1 0 1 0 1 0 0 0 2 0 ... 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10:100592386-100618391:1700017N19Rik 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 7 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 6600 columns

Now we can see all the samples!

Let's take a look at the full size of the matrix with .shape:


In [ ]:
expression.shape

Wow, ~28k rows! That must be the genes, while there are 18 single cell samples and 3 pooled samples as the columns. We'll do some filtering in the next few steps.

5. Reading in the metadata


In [ ]:
shalek2013_metadata = pd.read_table('~/Downloads/GSE41265_series_matrix.txt', 
                                    skiprows=33, 
                                    index_col=0)
shalek2013_metadata

Let's transpose this matrix so the samples are the rows, and the features are the columns. We'll do that with .T


In [ ]:
shalek2013_metadata = shalek2013_metadata.T
shalek2013_metadata

Now we'll do some mild data cleaning. Notice that the columns have the exclamation point at the beginning, so let's get rid of that. In computer science, you keep letters between quotes, and you call those "strings." Let's talk about the string function .strip(). This removes any characters that are on the outer edges of the string. For example, let's take the string "Whoooo!!!!!!!"


In [ ]:
"Whoooo!!!!!!!"

Now let's remove the exclamation points:


In [ ]:
'Whoooo!!!!!!!'.strip('!')

Exercise 1: Stripping strings

What happens if you try to remove the 'o's?


In [ ]:
# YOUR CODE HERE


In [ ]:
'Whoooo!!!!!!!'.strip('o')

We can access the column names with dataframe.columns, like below:


In [ ]:
shalek2013_metadata.columns

We can map the stripping function to every item of the columns. In Python, the square brackets ([ and ]) show that we're making a list. What we're doing below is called a "list comprehension."


In [ ]:
[x.strip('!') for x in shalek2013_metadata.columns]

In pandas, we can do the same thing by map-ping a lambda, which is a small, anonymous function that does one thing. It's called "anonymous" because it doesn't have a name. map runs the function on every element of the columns.


In [ ]:
shalek2013_metadata.columns.map(lambda x: x.strip('!'))

The above lambda is the same as if we had written a named function called remove_exclamation, as below.


In [ ]:
def remove_exclamation(x):
    return x.strip('!')

shalek2013_metadata.columns.map(remove_exclamation)

Now we can assign the new column names to our matrix:


In [ ]:
shalek2013_metadata.columns = shalek2013_metadata.columns.map(lambda x: x.strip('!'))
shalek2013_metadata.head()

Okay, now we're ready to do some analysis!

We've looked at the top of the dataframe by using head(). By default, this shows the first 5 rows.


In [ ]:
shalek2013_expression.head()

To specify a certain number of rows, put a number between the parentheses.


In [ ]:
shalek2013_expression.head(8)

Exercise 2: using .head()

Show the first 17 rows of shalek2013_expression


In [ ]:
# YOUR CODE HERE


In [ ]:
shalek2013_expression.head(17)

Let's get a sense of this data by plotting the distributions using boxplot from seaborn. To save the output, we'll need to get access to the current figure, and save this to a variable using plt.gcf(). And then we'll save this figure with fig.savefig("filename.pdf"). You can use other extensions (e.g. ".png", ".tiff" and it'll automatically save as that forma)


In [ ]:
sns.boxplot(shalek2013_expression)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('shalek2013_expression_boxplot.pdf')

Notice the 140,000 maximum ... Oh right we have expression data and the scales are enormous... Let's add 1 to all values and take the log2 of the data. We add one because log(0) is undefined and then all our logged values start from zero too. This "$\log_2(TPM + 1)$" is a very common transformation of expression data so it's easier to analyze.


In [ ]:
expression_logged = np.log2(shalek2013_expression+1)
expression_logged.head()

In [ ]:
sns.boxplot(expression_logged)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('expression_logged_boxplot.pdf')

Exercise 3: Interpreting distributions

Now that these are on moreso on the same scale ...

Q: What do you notice about the pooled samples (P1, P2, P3) that is different from the single cells?

YOUR ANSWER HERE

Filtering expression data

Seems like a lot of genes are near zero, which means we need to filter our genes.

We can ask which genes have log2 expression values are less than 10 (weird example I know - stay with me). This creates a dataframe of boolean values of True/False.


In [ ]:
expression_logged < 10

What's nice about booleans is that False is 0 and True is 1, so we can sum to get the number of "Trues." This is a simple, clever way that we can filter on a count for the data. We could use this boolean dataframe to filter our original dataframe, but then we lose information. For all values that are less than 10, it puts in a "not a number" - "NaN."


In [ ]:
expression_at_most_10 = expression_logged[expression_logged < 10]
print(expression_at_most_10.shape)
expression_at_most_10.head()

Exercise 4: Crude filtering on expression data

Create a dataframe called "expression_greater_than_5" which contains only values that are greater than 5 from expression_logged.


In [ ]:
# YOUR CODE HERE


In [ ]:
expression_greater_than_5 = expression_logged[expression_logged > 5]
print(expression_greater_than_5.shape)
expression_greater_than_5.head()

The crude filtering above is okay, but we're smarter than that. We want to use the filtering in the paper:

... discarded genes that were not appreciably expressed (transcripts per million (TPM) > 1) in at least three individual cells, retaining 6,313 genes for further analysis.

We want to do THAT, but first we need a couple more concepts. The first one is summing booleans.

A smarter way to filter

Remember that booleans are really 0s (False) and 1s (True)? This turns out to be VERY convenient and we can use this concept in clever ways.

We can use .sum() on a boolean matrix to get the number of genes with expression greater than 10 for each sample:


In [ ]:
(expression_logged > 10).sum()

pandas is column-oriented and by default, it will give you a sum for each column. But we want a sum for each row. How do we do that?

We can sum the boolean matrix we created with "expression_logged < 10" along axis=1 (along the samples) to get for each gene, how many samples have expression less than 10. In pandas, this column is called a "Series" because it has only one dimension - its length. Internally, pandas stores dataframes as a bunch of columns - specifically these Seriesssssss.

This turns out to be not that many.


In [ ]:
(expression_logged > 10).sum(axis=1)

Now we can apply ANOTHER filter and find genes that are "present" (expression greater than 10) in at least 5 samples. We'll save this as the variable genes_of_interest. Notice that this doesn't the genes_of_interest but rather the list at the bottom. This is because what you see under a code cell is the output of the last thing you called. The "hash mark"/"number sign" "#" is called a comment character and makes the rest of the line after it not read by the Python language.

Exercise 5: Commenting and uncommenting

To see genes_of_interest, "uncomment" the line by removing the hash sign, and commenting out the list [1, 2, 3].


In [ ]:
genes_of_interest = (expression_logged > 10).sum(axis=1) >= 5
# genes_of_interest
[1, 2, 3]

In [ ]:
assert isinstance(_, pd.Series)

Getting only rows that you want (aka subsetting)

Now we have some genes that we want to use - how do you pick just those? This can also be called "subsetting" and in pandas has the technical name indexing

In pandas, to get the rows (genes) you want using their name (gene symbol) or boolean matrix, you use .loc[rows_you_want]. Check it out below.


In [ ]:
expression_filtered = expression_logged.loc[genes_of_interest]
print(expression_filtered.shape)  # shows (nrows, ncols) - like in manhattan you do the Street then the Avenue
expression_filtered.head()

Wow, our matrix is very small - 197 genes! We probably don't want to filter THAT much... I'd say a range of 5,000-15,000 genes after filtering is a good ballpark. Not too big so it's impossible to work with but not too small that you can't do any statistics.

We'll get closer to the expression data created by the paper. Remember that they filtered on genes that had expression greater than 1 in at least 3 single cells. We'll filter for expression greater than 1 in at least 3 samples for now - we'll get to the single stuff in a bit. For now, we'll filter on all samples.

Exercise 6: Filtering on the presence of genes

Create a dataframe called expression_filtered_by_all_samples that consists only of genes that have expression greater than 1 in at least 3 samples.

Hint for IndexingError: Unalignable boolean Series key provided

If you're getting this error, double-check your .sum() command. Did you remember to specify that you want to get the number of cells (columns) that express each gene (row)? Remember that .sum() by default gives you the sum over columns, but since genes are the rows .... How do you get the sum over rows?


In [ ]:
# YOUR CODE HERE

print(expression_filtered_by_all_samples.shape)
expression_filtered_by_all_samples.head()

In [ ]:
assert expression_filtered_by_all_samples.shape == (9943, 21)

Just for fun, let's see how our the distributions in our expression matrix have changed. If you want to save the figure, you can:


In [ ]:
sns.boxplot(expression_filtered_by_all_samples)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('expression_filtered_by_all_samples_boxplot.pdf')

Discussion

  1. How did the gene expression distributions change? Why?
  2. Were the single and pooled samples' distributions affected differently? Why or why not?

Getting only the columns you want

In the next exercise, we'll get just the single cells

For the next step, we're going to pull out just the pooled - which are conveniently labeled as "P#". We'll do this using a list comprehension, which means we'll create a new list based on the items in shalek2013_expression.columns and whether or not they start with the letter 'P'.

In Python, things in square brackets ([]) are lists unless indicated otherwise. We are using a list comprehension here instead of a map, because we only want a subset of the columns, rather than all of them.


In [ ]:
pooled_ids = [x for x in expression_logged.columns if x.startswith('P')]
pooled_ids

We'll access the columns we want using this bracket notation (note that this only works for columns, not rows)


In [ ]:
pooled = expression_logged[pooled_ids]
pooled.head()

We could do the same thing using .loc but we would need to put a colon ":" in the "rows" section (first place) to show that we want "all rows."


In [ ]:
expression_logged.loc[:, pooled_ids].head()

Exercise 7: Make a dataframe of only single samples

Use list comprehensions to make a list called single_ids that consists only of single cells, and use that list to subset expression_logged and create a dataframe called singles. (Hint - how are the single cells ids different from the pooled ids?)


In [ ]:
# YOUR CODE HERE

print(singles.shape)
singles.head()

In [ ]:
assert singles.shape == (27723, 18)

Using two different dataframes for filtering

Exercise 8: Filter the full dataframe using the singles dataframe

Now we'll actually do the filtering done by the paper. Using the singles dataframe you just created, get the genes that have expression greater than 1 in at least 3 single cells, and use that to filter expression_logged. Call this dataframe expression_filtered_by_singles.


In [ ]:
# YOUR CODE HERE

print(expression_filtered_by_singles.shape)
expression_filtered_by_singles.head()

In [ ]:
assert expression_filtered_by_singles.shape == (6312, 21)

Let's make a boxplot again to see how the data has changed.


In [ ]:
sns.boxplot(expression_filtered_by_singles)

fig = plt.gcf()
fig.savefig('expression_filtered_by_singles_boxplot.pdf')

This is much nicer because now we don't have so many zeros and each sample has a reasonable dynamic range.

Why did this filtering even matter?

You may be wondering, we did all this work to remove some zeros..... so the FPKM what? Let's take a look at how this affects the relationships between samples using sns.jointplot from seaborn, which will plot a correlation scatterplot. This also calculates the Pearson correlation, a linear correlation metric.

Let's first do this on the unlogged data.


In [ ]:
sns.jointplot(shalek2013_expression['S1'], shalek2013_expression['S2'])

Pretty funky looking huh? That's why we logged it :)

Now let's try this on the logged data.


In [ ]:
sns.jointplot(expression_logged['S1'], expression_logged['S2'])

Hmm our pearson correlation increased from 0.62 to 0.64. Why could that be?

Let's look at this same plot using the filtered data.


In [ ]:
sns.jointplot(expression_filtered_by_singles['S1'], expression_filtered_by_singles['S2'])

And now our correlation went DOWN!? Why would that be?

Exercise 9: Discuss changes in correlation

Take 2-5 sentences to explain why the correlation changed between the different datasets.

YOUR ANSWER HERE