Downloading public data

Something you may want to do in the future is compare your results to papers that came before you. Today we'll go through how to find these data and how to analyze them

Reading list

1. Find the database and accession codes

At the end of most recent papers, they'll put a section called "Accession Codes" or "Accession Numbers" which will list a uniquely identifying number and letter combination.

In the US, the Gene Expression Omnibus (GEO) is a website funded by the NIH to store the expression data associated with papers. Many journals require you to submit your data to GEO to be able to publish.

Example data accession section from a Cell paper

Example data accession section from a Nature Biotech paper

Let's do this for the Shalek2013 paper.

Note: For some "older" papers (pre 2014), the accession code may not be on the PDF version of the paper but on the online version only. What I usually do then is search for the title of the paper and go to the journal website.

For your homework, you'll need to find another dataset to use and the expression matrix that you want may not be on a database, but rather posted in supplementary data on the journal's website.

  • What database was the data deposited to?
  • What is its' accession number?

2. Go to the data in the database

If you search for the database and the accession number, the first result will usually be the database with the paper info and the deposited data! Below is an example search for "Array Express E-MTAB-2805."

Search for its database and accession number and you should get to a page that looks like this:

3. Find the gene expression matrix

Lately, for many papers, they do give a processed expression matrix in the accession database that you can use directly. Luckily for us, that's exactly what the authors of the Shalek 2013 dataset did. If you notice at the bottom of the page, there's a table of Supplementary files and one of them is called "GSE41265_allGenesTPM.txt.gz". The link below is the "(ftp)" link copied down with the command "wget" which I think of as short for "web-get" so you can download files from the internet with the command line.

In addition to the gene expression file, we'll also look at the metadata in the "Series Matrix" file.

  1. Download the "Series Matrix" to your laptop and
  2. Download the GSE41265_allGenesTPM.txt.gz" file.

All the "Series" file formats contain the same information in different formats. I find the matrix one is the easiest to understand.

Open the "Series Matrix" in Excel (or equivalent) on your laptop, and look at the format and what's described. What line does the actual matrix of metadata start? You can find it where it says in the first column ,"!!Sample_title." It's after an empty line.

Get the data easy here:

Follow this link to jump directly to the GEO page for this data. Scroll down to the bottom in supplemental material. And download the link for the table called GSE41265_allGenesTPM.txt.gz.

We also need the link to the metadata. It is here. Download the file called GSE41265_series_matrix.txt.gz.

Where did those files go on your computer? Maybe you moved it somewhere. Figure out what the full path of those files are and we will read that in directly below.

4. Reading in the data file

To read the gene expression matrix, we'll use "pandas" a Python package for "Panel Data Analysis" (as in panels of data), which is a fantastic library for working with dataframes, and is Python's answer to R's dataframes. We'll take this opportunity to import ALL of the python libaries that we'll use today.

We'll be using several additional libraries in Python:

  1. matplotlib - This is the base plotting library in Python.
  2. numpy - (pronounced "num-pie") which is basis for most scientific packages. It's basically a nice-looking Python interface to C code. It's very fast.
  3. pandas - This is the "DataFrames in Python." (like R's nice dataframes) They're a super convenient form that's based on numpy so they're fast. And you can do convenient things like calculate mea n and variance very easily.
  4. scipy - (pronounced "sigh-pie") "Scientific Python" - Contains statistical methods and calculations
  5. seaborn - Statistical plotting library. To be completely honest, R's plotting and graphics capabilities are much better than Python's. However, Python is a really nice langauge to learn and use, it's very memory efficient, can be parallized well, and has a very robust machine learning library, scikit-learn, which has a very nice and consistent interface. So this is Python's answer to ggplot2 (very popular R library for plotting) to try and make plotting in Python nicer looking and to make statistical plots easier to do.

In [ ]:
# Alphabetical order is standard
# We're doing "import superlongname as abbrev" for our laziness - this way we don't have to type out the whole thing each time.

# Python plotting library
import matplotlib.pyplot as plt

# Numerical python library (pronounced "num-pie")
import numpy as np

# Dataframes in Python
import pandas as pd

# Statistical plotting library we'll use
import seaborn as sns

# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline

We'll read in the data using pandas and look at the first 5 rows of the dataframe with the dataframe-specific function .head(). Whenever I read a new table or modify a dataframe, I ALWAYS look at it to make sure it was correctly imported and read in, and I want you to get into the same habit.


In [62]:
# Read the data table
# You may need to change the path to the file (what's in quotes below) relative 
# to where you downloaded the file and where this notebook is
shalek2013_expression = pd.read_table('/home/ecwheele/cshl2017/GSE41265_allGenesTPM.txt.gz', 
                               
                                     # Sets the first (Python starts counting from 0 not 1) column as the row names
                                      index_col=0, 

                                     # Tells pandas to decompress the gzipped file
                                      compression='gzip')




print(shalek2013_expression.shape)
shalek2013_expression.head()


(27723, 21)
Out[62]:
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 P1 P2 P3
GENE
XKR4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.019906 0.000000
AB338584 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
B3GAT2 0.000000 0.000000 0.023441 0.000000 0.000000 0.029378 0.000000 0.055452 0.000000 0.029448 0.024137 0.000000 0.000000 0.031654 0.000000 0.000000 0.000000 42.150208 0.680327 0.022996 0.110236
NPL 72.008590 0.000000 128.062012 0.095082 0.000000 0.000000 112.310234 104.329122 0.119230 0.000000 0.000000 0.000000 0.116802 0.104200 0.106188 0.229197 0.110582 0.000000 7.109356 6.727028 14.525447
T2 0.109249 0.172009 0.000000 0.000000 0.182703 0.076012 0.078698 0.000000 0.093698 0.076583 0.000000 0.693459 0.010137 0.081936 0.000000 0.000000 0.086879 0.068174 0.062063 0.000000 0.050605

That's kind of annoying ... we don't see all the samples.

So we have 21 columns but looks like here pandas by default is showing a maximum of 20 so let's change the setting so we can see ALL of the samples instead of just skipping single cell 11 (S11). Let's change to 50 for good measure.


In [ ]:
pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
shalek2013_expression.head()

Now we can see all the samples!

Let's take a look at the full size of the matrix with .shape:


In [ ]:
shalek2013_expression.shape

Wow, ~28k rows! That must be the genes, while there are 18 single cell samples and 3 pooled samples as the columns. We'll do some filtering in the next few steps.

5. Reading in the metadata


In [63]:
shalek2013_metadata = pd.read_table('/home/ecwheele/cshl2017/GSE41265_series_matrix.txt.gz',
                                    compression = 'gzip',
                                    skiprows=33, 
                                    index_col=0)
print(shalek2013_metadata.shape)
shalek2013_metadata


(49, 24)
Out[63]:
Single cell S1 Single cell S2 Single cell S3 Single cell S4 Single cell S5 Single cell S6 Single cell S7 Single cell S8 Single cell S9 Single cell S10 Single cell S11 Single cell S12 Single cell S13 Single cell S14 Single cell S15 Single cell S16 Single cell S17 Single cell S18 10,000 cell population P1 10,000 cell population P2 10,000 cell population P3 Molecular barcode single cell MB1 Molecular barcode single cell MB2 Molecular barcode single cell MB3
!Sample_title
!Sample_geo_accession GSM1012777 GSM1012778 GSM1012779 GSM1012780 GSM1012781 GSM1012782 GSM1012783 GSM1012784 GSM1012785 GSM1012786 GSM1012787 GSM1012788 GSM1012789 GSM1012790 GSM1012791 GSM1012792 GSM1012793 GSM1012794 GSM1012795 GSM1012796 GSM1012797 GSM1110889 GSM1110890 GSM1110891
!Sample_status Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013 Public on May 19 2013
!Sample_submission_date Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Oct 01 2012 Mar 29 2013 Mar 29 2013 Mar 29 2013
!Sample_last_update_date May 19 2013 May 19 2013 May 19 2013 May 19 2013 May 19 2013 May 19 2013 May 19 2013 May 19 2013 May 19 2013 May 19 2013 May 19 2013 Dec 23 2013 Dec 23 2013 Dec 23 2013 Dec 23 2013 Dec 23 2013 Dec 23 2013 Dec 23 2013 May 19 2013 May 19 2013 May 19 2013 May 21 2013 May 21 2013 May 21 2013
!Sample_type SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA SRA
!Sample_channel_count 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
!Sample_source_name_ch1 BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim) BMDC (4h LPS stim)
!Sample_organism_ch1 Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus
!Sample_characteristics_ch1 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6 strain: C57BL/6
!Sample_characteristics_ch1 cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ... cell type: Bone Marrow-derived Dendritic Cell ...
!Sample_characteristics_ch1 treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation treatment: LPS-stimulation
!Sample_characteristics_ch1 cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 1 cell cell count: 10,000 cells cell count: 10,000 cells cell count: 10,000 cells cell count: 1 cell cell count: 1 cell cell count: 1 cell
!Sample_characteristics_ch1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN protocol: molecular barcodes (MB) using a modi... protocol: molecular barcodes (MB) using a modi... protocol: molecular barcodes (MB) using a modi...
!Sample_growth_protocol_ch1 Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as... Cells were cultured and stimulated with LPS as...
!Sample_molecule_ch1 polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA polyA RNA
!Sample_extract_protocol_ch1 cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA synthesis and amplification: cDNA from each sample was prepared using the S... cDNA from each sample was prepared using the S... cDNA from each sample was prepared using the S...
!Sample_extract_protocol_ch1 We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... We used the SMARTer Ultra Low RNA Kit (Clontec... For the three molecular barcode (MB) libraries... For the three molecular barcode (MB) libraries... For the three molecular barcode (MB) libraries...
!Sample_extract_protocol_ch1 We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... We created Illumina sequencing libraries from ... NaN NaN NaN
!Sample_extract_protocol_ch1 cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: cDNA shearing and library construction: NaN NaN NaN
!Sample_extract_protocol_ch1 We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... We added the purification buffer (Clontech) to... NaN NaN NaN
!Sample_extract_protocol_ch1 We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... We prepared indexed paired-end libraries for I... NaN NaN NaN
!Sample_taxid_ch1 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090
!Sample_description S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19_10k S20_10k S21_10k MB1 MB2 MB3
!Sample_data_processing We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn... We created a Bowtie index based on the UCSC kn...
!Sample_data_processing Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter... Next, we ran RSEM v1.11 with default parameter...
!Sample_data_processing Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9 Genome_build: mm9
!Sample_data_processing Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a... Supplementary_files_format_and_content: File a...
!Sample_platform_id GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112 GPL13112
!Sample_contact_name Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija Rahul,,Satija
!Sample_contact_email rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org rsatija@nygenome.org
!Sample_contact_phone 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468 6177022468
!Sample_contact_laboratory Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab Satija Lab
!Sample_contact_institute New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center New York Genome Center
!Sample_contact_address 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas 101 Avenue of the Americas
!Sample_contact_city New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City New York City
!Sample_contact_state NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY NY
!Sample_contact_zip/postal_code 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013 10013
!Sample_contact_country USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA
!Sample_data_row_count 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
!Sample_instrument_model Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2000
!Sample_library_selection cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA
!Sample_library_source transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic transcriptomic
!Sample_library_strategy RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq RNA-Seq
!Sample_relation SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...
!Sample_relation BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam... BioSample: https://www.ncbi.nlm.nih.gov/biosam...
!Sample_supplementary_file_1 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta...
!series_matrix_table_begin NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ID_REF GSM1012777 GSM1012778 GSM1012779 GSM1012780 GSM1012781 GSM1012782 GSM1012783 GSM1012784 GSM1012785 GSM1012786 GSM1012787 GSM1012788 GSM1012789 GSM1012790 GSM1012791 GSM1012792 GSM1012793 GSM1012794 GSM1012795 GSM1012796 GSM1012797 GSM1110889 GSM1110890 GSM1110891
!series_matrix_table_end NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Let's transpose this matrix so the samples are the rows, and the features are the columns. We'll do that with .T


In [64]:
shalek2013_metadata = shalek2013_metadata.T
shalek2013_metadata


Out[64]:
!Sample_title !Sample_geo_accession !Sample_status !Sample_submission_date !Sample_last_update_date !Sample_type !Sample_channel_count !Sample_source_name_ch1 !Sample_organism_ch1 !Sample_characteristics_ch1 !Sample_characteristics_ch1 !Sample_characteristics_ch1 !Sample_characteristics_ch1 !Sample_characteristics_ch1 !Sample_growth_protocol_ch1 !Sample_molecule_ch1 !Sample_extract_protocol_ch1 !Sample_extract_protocol_ch1 !Sample_extract_protocol_ch1 !Sample_extract_protocol_ch1 !Sample_extract_protocol_ch1 !Sample_extract_protocol_ch1 !Sample_taxid_ch1 !Sample_description !Sample_data_processing !Sample_data_processing !Sample_data_processing !Sample_data_processing !Sample_platform_id !Sample_contact_name !Sample_contact_email !Sample_contact_phone !Sample_contact_laboratory !Sample_contact_institute !Sample_contact_address !Sample_contact_city !Sample_contact_state !Sample_contact_zip/postal_code !Sample_contact_country !Sample_data_row_count !Sample_instrument_model !Sample_library_selection !Sample_library_source !Sample_library_strategy !Sample_relation !Sample_relation !Sample_supplementary_file_1 !series_matrix_table_begin ID_REF !series_matrix_table_end
Single cell S1 GSM1012777 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S1 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012777 NaN
Single cell S2 GSM1012778 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S2 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012778 NaN
Single cell S3 GSM1012779 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S3 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012779 NaN
Single cell S4 GSM1012780 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S4 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012780 NaN
Single cell S5 GSM1012781 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S5 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012781 NaN
Single cell S6 GSM1012782 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S6 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012782 NaN
Single cell S7 GSM1012783 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S7 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012783 NaN
Single cell S8 GSM1012784 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S8 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012784 NaN
Single cell S9 GSM1012785 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S9 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012785 NaN
Single cell S10 GSM1012786 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S10 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012786 NaN
Single cell S11 GSM1012787 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S11 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012787 NaN
Single cell S12 GSM1012788 Public on May 19 2013 Oct 01 2012 Dec 23 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S12 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012788 NaN
Single cell S13 GSM1012789 Public on May 19 2013 Oct 01 2012 Dec 23 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S13 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012789 NaN
Single cell S14 GSM1012790 Public on May 19 2013 Oct 01 2012 Dec 23 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S14 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012790 NaN
Single cell S15 GSM1012791 Public on May 19 2013 Oct 01 2012 Dec 23 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S15 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012791 NaN
Single cell S16 GSM1012792 Public on May 19 2013 Oct 01 2012 Dec 23 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S16 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012792 NaN
Single cell S17 GSM1012793 Public on May 19 2013 Oct 01 2012 Dec 23 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S17 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012793 NaN
Single cell S18 GSM1012794 Public on May 19 2013 Oct 01 2012 Dec 23 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S18 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012794 NaN
10,000 cell population P1 GSM1012795 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 10,000 cells NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S19_10k We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012795 NaN
10,000 cell population P2 GSM1012796 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 10,000 cells NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S20_10k We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012796 NaN
10,000 cell population P3 GSM1012797 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 10,000 cells NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S21_10k We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012797 NaN
Molecular barcode single cell MB1 GSM1110889 Public on May 19 2013 Mar 29 2013 May 21 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell protocol: molecular barcodes (MB) using a modi... Cells were cultured and stimulated with LPS as... polyA RNA cDNA from each sample was prepared using the S... For the three molecular barcode (MB) libraries... NaN NaN NaN NaN 10090 MB1 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1110889 NaN
Molecular barcode single cell MB2 GSM1110890 Public on May 19 2013 Mar 29 2013 May 21 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell protocol: molecular barcodes (MB) using a modi... Cells were cultured and stimulated with LPS as... polyA RNA cDNA from each sample was prepared using the S... For the three molecular barcode (MB) libraries... NaN NaN NaN NaN 10090 MB2 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1110890 NaN
Molecular barcode single cell MB3 GSM1110891 Public on May 19 2013 Mar 29 2013 May 21 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell protocol: molecular barcodes (MB) using a modi... Cells were cultured and stimulated with LPS as... polyA RNA cDNA from each sample was prepared using the S... For the three molecular barcode (MB) libraries... NaN NaN NaN NaN 10090 MB3 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1110891 NaN

Now we'll do some mild data cleaning. Notice that the columns have the exclamation point at the beginning, so let's get rid of that. In computer science, you keep letters between quotes, and you call those "strings." Let's talk about the string function .strip(). This removes any characters that are on the outer edges of the string. For example, let's take the string "Whoooo!!!!!!!"


In [65]:
"Whoooo!!!!!!!"


Out[65]:
'Whoooo!!!!!!!'

Now let's remove the exclamation points:


In [66]:
'Whoooo!!!!!!!'.strip('!')


Out[66]:
'Whoooo'

Exercise 1: Stripping strings

What happens if you try to remove the 'o's?


In [67]:
# YOUR CODE HERE


In [68]:
'Whoooo!!!!!!!'.strip('o')


Out[68]:
'Whoooo!!!!!!!'

In [69]:
'Whoooo!!!!!!!'.replace("o","")


Out[69]:
'Wh!!!!!!!'

We can access the column names with dataframe.columns, like below:


In [70]:
shalek2013_metadata.columns


Out[70]:
Index(['!Sample_geo_accession', '!Sample_status', '!Sample_submission_date',
       '!Sample_last_update_date', '!Sample_type', '!Sample_channel_count',
       '!Sample_source_name_ch1', '!Sample_organism_ch1',
       '!Sample_characteristics_ch1', '!Sample_characteristics_ch1',
       '!Sample_characteristics_ch1', '!Sample_characteristics_ch1',
       '!Sample_characteristics_ch1', '!Sample_growth_protocol_ch1',
       '!Sample_molecule_ch1', '!Sample_extract_protocol_ch1',
       '!Sample_extract_protocol_ch1', '!Sample_extract_protocol_ch1',
       '!Sample_extract_protocol_ch1', '!Sample_extract_protocol_ch1',
       '!Sample_extract_protocol_ch1', '!Sample_taxid_ch1',
       '!Sample_description', '!Sample_data_processing',
       '!Sample_data_processing', '!Sample_data_processing',
       '!Sample_data_processing', '!Sample_platform_id',
       '!Sample_contact_name', '!Sample_contact_email',
       '!Sample_contact_phone', '!Sample_contact_laboratory',
       '!Sample_contact_institute', '!Sample_contact_address',
       '!Sample_contact_city', '!Sample_contact_state',
       '!Sample_contact_zip/postal_code', '!Sample_contact_country',
       '!Sample_data_row_count', '!Sample_instrument_model',
       '!Sample_library_selection', '!Sample_library_source',
       '!Sample_library_strategy', '!Sample_relation', '!Sample_relation',
       '!Sample_supplementary_file_1', '!series_matrix_table_begin', 'ID_REF',
       '!series_matrix_table_end'],
      dtype='object', name='!Sample_title')

We can map the stripping function to every item of the columns. In Python, the square brackets ([ and ]) show that we're making a list. What we're doing below is called a "list comprehension."


In [71]:
[x.strip('!') for x in shalek2013_metadata.columns]


Out[71]:
['Sample_geo_accession',
 'Sample_status',
 'Sample_submission_date',
 'Sample_last_update_date',
 'Sample_type',
 'Sample_channel_count',
 'Sample_source_name_ch1',
 'Sample_organism_ch1',
 'Sample_characteristics_ch1',
 'Sample_characteristics_ch1',
 'Sample_characteristics_ch1',
 'Sample_characteristics_ch1',
 'Sample_characteristics_ch1',
 'Sample_growth_protocol_ch1',
 'Sample_molecule_ch1',
 'Sample_extract_protocol_ch1',
 'Sample_extract_protocol_ch1',
 'Sample_extract_protocol_ch1',
 'Sample_extract_protocol_ch1',
 'Sample_extract_protocol_ch1',
 'Sample_extract_protocol_ch1',
 'Sample_taxid_ch1',
 'Sample_description',
 'Sample_data_processing',
 'Sample_data_processing',
 'Sample_data_processing',
 'Sample_data_processing',
 'Sample_platform_id',
 'Sample_contact_name',
 'Sample_contact_email',
 'Sample_contact_phone',
 'Sample_contact_laboratory',
 'Sample_contact_institute',
 'Sample_contact_address',
 'Sample_contact_city',
 'Sample_contact_state',
 'Sample_contact_zip/postal_code',
 'Sample_contact_country',
 'Sample_data_row_count',
 'Sample_instrument_model',
 'Sample_library_selection',
 'Sample_library_source',
 'Sample_library_strategy',
 'Sample_relation',
 'Sample_relation',
 'Sample_supplementary_file_1',
 'series_matrix_table_begin',
 'ID_REF',
 'series_matrix_table_end']

In pandas, we can do the same thing by map-ping a lambda, which is a small, anonymous function that does one thing. It's called "anonymous" because it doesn't have a name. map runs the function on every element of the columns.


In [72]:
shalek2013_metadata.columns.map(lambda x: x.strip('!'))


Out[72]:
Index(['Sample_geo_accession', 'Sample_status', 'Sample_submission_date',
       'Sample_last_update_date', 'Sample_type', 'Sample_channel_count',
       'Sample_source_name_ch1', 'Sample_organism_ch1',
       'Sample_characteristics_ch1', 'Sample_characteristics_ch1',
       'Sample_characteristics_ch1', 'Sample_characteristics_ch1',
       'Sample_characteristics_ch1', 'Sample_growth_protocol_ch1',
       'Sample_molecule_ch1', 'Sample_extract_protocol_ch1',
       'Sample_extract_protocol_ch1', 'Sample_extract_protocol_ch1',
       'Sample_extract_protocol_ch1', 'Sample_extract_protocol_ch1',
       'Sample_extract_protocol_ch1', 'Sample_taxid_ch1', 'Sample_description',
       'Sample_data_processing', 'Sample_data_processing',
       'Sample_data_processing', 'Sample_data_processing',
       'Sample_platform_id', 'Sample_contact_name', 'Sample_contact_email',
       'Sample_contact_phone', 'Sample_contact_laboratory',
       'Sample_contact_institute', 'Sample_contact_address',
       'Sample_contact_city', 'Sample_contact_state',
       'Sample_contact_zip/postal_code', 'Sample_contact_country',
       'Sample_data_row_count', 'Sample_instrument_model',
       'Sample_library_selection', 'Sample_library_source',
       'Sample_library_strategy', 'Sample_relation', 'Sample_relation',
       'Sample_supplementary_file_1', 'series_matrix_table_begin', 'ID_REF',
       'series_matrix_table_end'],
      dtype='object', name='!Sample_title')

The above lambda is the same as if we had written a named function called remove_exclamation, as below.


In [73]:
def remove_exclamation(x):
    return x.strip('!')

shalek2013_metadata.columns.map(remove_exclamation)


Out[73]:
Index(['Sample_geo_accession', 'Sample_status', 'Sample_submission_date',
       'Sample_last_update_date', 'Sample_type', 'Sample_channel_count',
       'Sample_source_name_ch1', 'Sample_organism_ch1',
       'Sample_characteristics_ch1', 'Sample_characteristics_ch1',
       'Sample_characteristics_ch1', 'Sample_characteristics_ch1',
       'Sample_characteristics_ch1', 'Sample_growth_protocol_ch1',
       'Sample_molecule_ch1', 'Sample_extract_protocol_ch1',
       'Sample_extract_protocol_ch1', 'Sample_extract_protocol_ch1',
       'Sample_extract_protocol_ch1', 'Sample_extract_protocol_ch1',
       'Sample_extract_protocol_ch1', 'Sample_taxid_ch1', 'Sample_description',
       'Sample_data_processing', 'Sample_data_processing',
       'Sample_data_processing', 'Sample_data_processing',
       'Sample_platform_id', 'Sample_contact_name', 'Sample_contact_email',
       'Sample_contact_phone', 'Sample_contact_laboratory',
       'Sample_contact_institute', 'Sample_contact_address',
       'Sample_contact_city', 'Sample_contact_state',
       'Sample_contact_zip/postal_code', 'Sample_contact_country',
       'Sample_data_row_count', 'Sample_instrument_model',
       'Sample_library_selection', 'Sample_library_source',
       'Sample_library_strategy', 'Sample_relation', 'Sample_relation',
       'Sample_supplementary_file_1', 'series_matrix_table_begin', 'ID_REF',
       'series_matrix_table_end'],
      dtype='object', name='!Sample_title')

Now we can assign the new column names to our matrix:


In [74]:
shalek2013_metadata.columns = shalek2013_metadata.columns.map(lambda x: x.strip('!'))
shalek2013_metadata.head()


Out[74]:
!Sample_title Sample_geo_accession Sample_status Sample_submission_date Sample_last_update_date Sample_type Sample_channel_count Sample_source_name_ch1 Sample_organism_ch1 Sample_characteristics_ch1 Sample_characteristics_ch1 Sample_characteristics_ch1 Sample_characteristics_ch1 Sample_characteristics_ch1 Sample_growth_protocol_ch1 Sample_molecule_ch1 Sample_extract_protocol_ch1 Sample_extract_protocol_ch1 Sample_extract_protocol_ch1 Sample_extract_protocol_ch1 Sample_extract_protocol_ch1 Sample_extract_protocol_ch1 Sample_taxid_ch1 Sample_description Sample_data_processing Sample_data_processing Sample_data_processing Sample_data_processing Sample_platform_id Sample_contact_name Sample_contact_email Sample_contact_phone Sample_contact_laboratory Sample_contact_institute Sample_contact_address Sample_contact_city Sample_contact_state Sample_contact_zip/postal_code Sample_contact_country Sample_data_row_count Sample_instrument_model Sample_library_selection Sample_library_source Sample_library_strategy Sample_relation Sample_relation Sample_supplementary_file_1 series_matrix_table_begin ID_REF series_matrix_table_end
Single cell S1 GSM1012777 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S1 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012777 NaN
Single cell S2 GSM1012778 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S2 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012778 NaN
Single cell S3 GSM1012779 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S3 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012779 NaN
Single cell S4 GSM1012780 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S4 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012780 NaN
Single cell S5 GSM1012781 Public on May 19 2013 Oct 01 2012 May 19 2013 SRA 1 BMDC (4h LPS stim) Mus musculus strain: C57BL/6 cell type: Bone Marrow-derived Dendritic Cell ... treatment: LPS-stimulation cell count: 1 cell NaN Cells were cultured and stimulated with LPS as... polyA RNA cDNA synthesis and amplification: We used the SMARTer Ultra Low RNA Kit (Clontec... We created Illumina sequencing libraries from ... cDNA shearing and library construction: We added the purification buffer (Clontech) to... We prepared indexed paired-end libraries for I... 10090 S5 We created a Bowtie index based on the UCSC kn... Next, we ran RSEM v1.11 with default parameter... Genome_build: mm9 Supplementary_files_format_and_content: File a... GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468 Satija Lab New York Genome Center 101 Avenue of the Americas New York City NY 10013 USA 0 Illumina HiSeq 2000 cDNA transcriptomic RNA-Seq SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX... BioSample: https://www.ncbi.nlm.nih.gov/biosam... ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-insta... NaN GSM1012781 NaN

Okay, now we're ready to do some analysis!

We've looked at the top of the dataframe by using head(). By default, this shows the first 5 rows.


In [75]:
shalek2013_expression.head()


Out[75]:
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 P1 P2 P3
GENE
XKR4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.019906 0.000000
AB338584 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
B3GAT2 0.000000 0.000000 0.023441 0.000000 0.000000 0.029378 0.000000 0.055452 0.000000 0.029448 0.024137 0.000000 0.000000 0.031654 0.000000 0.000000 0.000000 42.150208 0.680327 0.022996 0.110236
NPL 72.008590 0.000000 128.062012 0.095082 0.000000 0.000000 112.310234 104.329122 0.119230 0.000000 0.000000 0.000000 0.116802 0.104200 0.106188 0.229197 0.110582 0.000000 7.109356 6.727028 14.525447
T2 0.109249 0.172009 0.000000 0.000000 0.182703 0.076012 0.078698 0.000000 0.093698 0.076583 0.000000 0.693459 0.010137 0.081936 0.000000 0.000000 0.086879 0.068174 0.062063 0.000000 0.050605

To specify a certain number of rows, put a number between the parentheses.


In [76]:
shalek2013_expression.head(8)


Out[76]:
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 P1 P2 P3
GENE
XKR4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.019906 0.000000
AB338584 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
B3GAT2 0.000000 0.000000 0.023441 0.000000 0.000000 0.029378 0.000000 0.055452 0.000000 0.029448 0.024137 0.000000 0.000000 0.031654 0.000000 0.000000 0.000000 42.150208 0.680327 0.022996 0.110236
NPL 72.008590 0.000000 128.062012 0.095082 0.000000 0.000000 112.310234 104.329122 0.119230 0.000000 0.000000 0.000000 0.116802 0.104200 0.106188 0.229197 0.110582 0.000000 7.109356 6.727028 14.525447
T2 0.109249 0.172009 0.000000 0.000000 0.182703 0.076012 0.078698 0.000000 0.093698 0.076583 0.000000 0.693459 0.010137 0.081936 0.000000 0.000000 0.086879 0.068174 0.062063 0.000000 0.050605
T 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
PDE10A 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.018610 0.011152
1700010I14RIK 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.806956 0.000000 0.000000

Exercise 2: using .head()

Show the first 17 rows of shalek2013_expression


In [77]:
# YOUR CODE HERE


In [78]:
shalek2013_expression.head(17)


Out[78]:
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 P1 P2 P3
GENE
XKR4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.019906 0.000000
AB338584 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
B3GAT2 0.000000 0.000000 0.023441 0.000000 0.000000 0.029378 0.000000 0.055452 0.000000 0.029448 0.024137 0.000000 0.000000 0.031654 0.000000 0.000000 0.000000 42.150208 0.680327 0.022996 0.110236
NPL 72.008590 0.000000 128.062012 0.095082 0.000000 0.000000 112.310234 104.329122 0.119230 0.000000 0.000000 0.000000 0.116802 0.104200 0.106188 0.229197 0.110582 0.000000 7.109356 6.727028 14.525447
T2 0.109249 0.172009 0.000000 0.000000 0.182703 0.076012 0.078698 0.000000 0.093698 0.076583 0.000000 0.693459 0.010137 0.081936 0.000000 0.000000 0.086879 0.068174 0.062063 0.000000 0.050605
T 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
PDE10A 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.018610 0.011152
1700010I14RIK 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.806956 0.000000 0.000000
6530411M01RIK 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
PABPC6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
AK019626 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
AK020722 0.000000 0.192712 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 74.367926 0.000000 0.000000 0.000000 0.000000 0.067214 0.064752 0.000000 0.554724 0.285452 0.529103
QK 153.234931 64.586547 45.892336 0.069079 26.273491 0.121005 40.690376 14.644237 0.143838 1.139030 0.935433 0.000000 28.841057 8.919043 0.351372 33.732921 7.898052 0.041772 46.899358 39.084183 38.324132
B930003M22RIK 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.183993 0.820872 0.298553
RGS8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
PACRG 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
AK038428 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

Let's get a sense of this data by plotting the distributions using boxplot from seaborn. To save the output, we'll need to get access to the current figure, and save this to a variable using plt.gcf(). And then we'll save this figure with fig.savefig("filename.pdf"). You can use other extensions (e.g. ".png", ".tiff" and it'll automatically save as that forma)


In [79]:
sns.boxplot(shalek2013_expression)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('shalek2013_expression_boxplot.pdf')


/home/ecwheele/anaconda2/envs/cshl-sca-2017/lib/python3.6/site-packages/seaborn/categorical.py:2171: UserWarning: The boxplot API has been changed. Attempting to adjust your arguments for the new API (which might not work). Please update your code. See the version 0.6 release notes for more info.
  warnings.warn(msg, UserWarning)
/home/ecwheele/anaconda2/envs/cshl-sca-2017/lib/python3.6/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Notice the 140,000 maximum ... Oh right we have expression data and the scales are enormous... Let's add 1 to all values and take the log2 of the data. We add one because log(0) is undefined and then all our logged values start from zero too. This "$\log_2(TPM + 1)$" is a very common transformation of expression data so it's easier to analyze.


In [80]:
expression_logged = np.log2(shalek2013_expression+1)
expression_logged.head()


Out[80]:
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 P1 P2 P3
GENE
XKR4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.028436 0.000000
AB338584 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
B3GAT2 0.000000 0.000000 0.033427 0.000000 0.000000 0.041774 0.000000 0.077861 0.000000 0.041871 0.034409 0.000000 0.000000 0.044960 0.000000 0.000000 0.000000 5.431296 0.748742 0.032801 0.150866
NPL 6.189994 0.000000 7.011921 0.131039 0.000000 0.000000 6.824134 6.718761 0.162507 0.000000 0.000000 0.000000 0.159374 0.143002 0.145597 0.297716 0.151316 0.000000 3.019587 2.949914 3.956563
T2 0.149583 0.228984 0.000000 0.000000 0.242088 0.105695 0.109290 0.000000 0.129215 0.106459 0.000000 0.759973 0.014551 0.113616 0.000000 0.000000 0.120192 0.095146 0.086869 0.000000 0.071221

In [81]:
sns.boxplot(expression_logged)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('expression_logged_boxplot.pdf')


/home/ecwheele/anaconda2/envs/cshl-sca-2017/lib/python3.6/site-packages/seaborn/categorical.py:2171: UserWarning: The boxplot API has been changed. Attempting to adjust your arguments for the new API (which might not work). Please update your code. See the version 0.6 release notes for more info.
  warnings.warn(msg, UserWarning)
/home/ecwheele/anaconda2/envs/cshl-sca-2017/lib/python3.6/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Exercise 3: Interpreting distributions

Now that these are on moreso on the same scale ...

Q: What do you notice about the pooled samples (P1, P2, P3) that is different from the single cells?

YOUR ANSWER HERE

Filtering expression data

Seems like a lot of genes are near zero, which means we need to filter our genes.

We can ask which genes have log2 expression values are less than 2 (weird example I know - stay with me). This creates a dataframe of boolean values of True/False.


In [83]:
at_most_2 = expression_logged < 2
at_most_2


Out[83]:
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 P1 P2 P3
GENE
XKR4 True True True True True True True True True True True True True True True True True True True True True
AB338584 True True True True True True True True True True True True True True True True True True True True True
B3GAT2 True True True True True True True True True True True True True True True True True False True True True
NPL False True False True True True False False True True True True True True True True True True False False False
T2 True True True True True True True True True True True True True True True True True True True True True
T True True True True True True True True True True True True True True True True True True True True True
PDE10A True True True True True True True True True True True True True True True True True True True True True
1700010I14RIK True True True True True True True True True True True True True True True True True True True True True
6530411M01RIK True True True True True True True True True True True True True True True True True True True True True
PABPC6 True True True True True True True True True True True True True True True True True True True True True
AK019626 True True True True True True True True True True True True True True True True True True True True True
AK020722 True True True True True True True True True True False True True True True True True True True True True
QK False False False True False True False False True True True True False False True False False True False False False
B930003M22RIK True True True True True True True True True True True True True True True True True True True True True
RGS8 True True True True True True True True True True True True True True True True True True True True True
PACRG True True True True True True True True True True True True True True True True True True True True True
AK038428 True True True True True True True True True True True True True True True True True True True True True
AK163153 True False True True True True True True True True True False False True True False True True True True True
PARK2 True True True True True True True True True True True False True True True True True True True True True
AK080902 True True True True True True True True True True True True True True True True True True True True True
AGPAT4 True False False True True False True False False True False True True False True True False True False False False
MAP3K4 True True True True True True True True True True True True True True True True True True True True True
AK029100 True True True True True True True True True True True True True True True True True True True True True
PLG True True True True True True True True True True True True True True True True True True True True True
SLC22A3 True True True True True True True True True True True True True True True True True True True True True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
DYNLT1C False True True True True False False False False True True True True False True False True False False False False
AK178082 True True True True True True True True True True True True True True True True True True True True True
TMEM181C-PS True True True True True True True True True True True False True True True True True True True True True
EZR False True True False True False True True False True True True True False False False True False False False False
AK037830 True True True True True True True True True True True True True True True True True True True True True
MIR692-1 False False False False False False False False False False False True False False False False False False False False False
AK007238 True True True True True True True True True True True True True True True True True True True True True
RSPH3B True False False True True True True False True False True False False False True False False True False False False
TAGAP1 True True True False False False True True True False True True True True True True True True False False False
1700012A16RIK True True True True True True True True True True True True True True True True True True True True True
RNASET2A False True True False True False False False False False False True True False True True True False False False False
GM1604B False True True True True True True True True True True True True True True True True True True False True
RPS6KA2 True True False False True False False False True True False True True True False False True True False False False
TCP10B True True True True True True True True True True True True True True True True True True True True True
GM9992 True True True True True True True True True False True True True True True True True True True True True
AK085062 True True True True True True True True True True True True True True True True True True True True True
DHX9 True True True True True True True False True False True True True False True False True False False False False
RNASET2B False True True True True False False True False False True True True False False False True False False False False
FGFR1OP True True True True False True False False False False False True True False False True True True False False False
CCR6 True True True True True True True True True True True True True True True True True True True True True
BRP44L False False False False False False False False False False False False False False False False False False False False False
AK014435 True True True True True True True True True True True True True True True True True True True True True
AK015714 True True True True True True True True True True True True True False True True True True True True True
SFT2D1 False False True True False True False False False True True True False True True True False False False False False
PRR18 True True True True True True True True True True True True True True True True True True True True True

27723 rows × 21 columns

What's nice about booleans is that False is 0 and True is 1, so we can sum to get the number of "Trues." This is a simple, clever way that we can filter on a count for the data. We could use this boolean dataframe to filter our original dataframe, but then we lose information. For all values that are greater than 2, it puts in a "not a number" - "NaN."


In [ ]:
expression_at_most_2 = expression_logged[expression_logged < 2]
print(expression_at_most_2.shape)
expression_at_most_2.head()

Exercise 4: Crude filtering on expression data

Create a dataframe called "expression_greater_than_5" which contains only values that are greater than 5 from expression_logged.


In [ ]:
# YOUR CODE HERE


In [ ]:
expression_logged.head()

In [ ]:
expression_greater_than_5 = expression_logged[expression_logged > 5]
expression_greater_than_5.head()

The crude filtering above is okay, but we're smarter than that. We want to use the filtering in the paper:

... discarded genes that were not appreciably expressed (transcripts per million (TPM) > 1) in at least three individual cells, retaining 6,313 genes for further analysis.

We want to do THAT, but first we need a couple more concepts. The first one is summing booleans.

A smarter way to filter

Remember that booleans are really 0s (False) and 1s (True)? This turns out to be VERY convenient and we can use this concept in clever ways.

We can use .sum() on a boolean matrix to get the number of genes with expression greater than 10 for each sample:


In [ ]:
(expression_logged > 10).sum()

pandas is column-oriented and by default, it will give you a sum for each column. But we want a sum for each row. How do we do that?

We can sum the boolean matrix we created with "expression_logged < 10" along axis=1 (along the samples) to get for each gene, how many samples have expression less than 10. In pandas, this column is called a "Series" because it has only one dimension - its length. Internally, pandas stores dataframes as a bunch of columns - specifically these Seriesssssss.

This turns out to be not that many.


In [ ]:
(expression_logged > 10).sum(axis=1)

Now we can apply ANOTHER filter and find genes that are "present" (expression greater than 10) in at least 5 samples. We'll save this as the variable genes_of_interest. Notice that this doesn't the genes_of_interest but rather the list at the bottom. This is because what you see under a code cell is the output of the last thing you called. The "hash mark"/"number sign" "#" is called a comment character and makes the rest of the line after it not read by the Python language.

Exercise 5: Commenting and uncommenting

To see genes_of_interest, "uncomment" the line by removing the hash sign, and commenting out the list [1, 2, 3].


In [ ]:
genes_of_interest = (expression_logged > 10).sum(axis=1) >= 5
#genes_of_interest
[1, 2, 3]

Getting only rows that you want (aka subsetting)

Now we have some genes that we want to use - how do you pick just those? This can also be called "subsetting" and in pandas has the technical name indexing

In pandas, to get the rows (genes) you want using their name (gene symbol) or boolean matrix, you use .loc[rows_you_want]. Check it out below.


In [ ]:
expression_filtered = expression_logged.loc[genes_of_interest]
print(expression_filtered.shape)  # shows (nrows, ncols) - like in manhattan you do the Street then the Avenue
expression_filtered.head()

Wow, our matrix is very small - 197 genes! We probably don't want to filter THAT much... I'd say a range of 5,000-15,000 genes after filtering is a good ballpark. Not too big so it's impossible to work with but not too small that you can't do any statistics.

We'll get closer to the expression data created by the paper. Remember that they filtered on genes that had expression greater than 1 in at least 3 single cells. We'll filter for expression greater than 1 in at least 3 samples for now - we'll get to the single stuff in a bit. For now, we'll filter on all samples.

Exercise 6: Filtering on the presence of genes

Create a dataframe called expression_filtered_by_all_samples that consists only of genes that have expression greater than 1 in at least 3 samples.

Hint for IndexingError: Unalignable boolean Series key provided

If you're getting this error, double-check your .sum() command. Did you remember to specify that you want to get the number of cells (columns) that express each gene (row)? Remember that .sum() by default gives you the sum over columns, but since genes are the rows .... How do you get the sum over rows?


In [ ]:
# YOUR CODE HERE

print(expression_filtered_by_all_samples.shape)
expression_filtered_by_all_samples.head()


In [ ]:
genes_of_interest = (expression_logged > 1).sum(axis=1) >= 3

expression_filtered_by_all_samples = expression_logged.loc[genes_of_interest]
print(expression_filtered_by_all_samples.shape)
expression_filtered_by_all_samples.head()

Just for fun, let's see how our the distributions in our expression matrix have changed. If you want to save the figure, you can:


In [ ]:
sns.boxplot(expression_filtered_by_all_samples)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('expression_filtered_by_all_samples_boxplot.pdf')

Discussion

  1. How did the gene expression distributions change? Why?
  2. Were the single and pooled samples' distributions affected differently? Why or why not?

Getting only the columns you want

In the next exercise, we'll get just the single cells

For the next step, we're going to pull out just the pooled - which are conveniently labeled as "P#". We'll do this using a list comprehension, which means we'll create a new list based on the items in shalek2013_expression.columns and whether or not they start with the letter 'P'.

In Python, things in square brackets ([]) are lists unless indicated otherwise. We are using a list comprehension here instead of a map, because we only want a subset of the columns, rather than all of them.


In [ ]:
pooled_ids = [x for x in expression_logged.columns if x.startswith('P')]
pooled_ids

We'll access the columns we want using this bracket notation (note that this only works for columns, not rows)


In [ ]:
pooled = expression_logged[pooled_ids]
pooled.head()

We could do the same thing using .loc but we would need to put a colon ":" in the "rows" section (first place) to show that we want "all rows."


In [ ]:
expression_logged.loc[:, pooled_ids].head()

Exercise 7: Make a dataframe of only single samples

Use list comprehensions to make a list called single_ids that consists only of single cells, and use that list to subset expression_logged and create a dataframe called singles. (Hint - how are the single cells ids different from the pooled ids?)


In [ ]:
# YOUR CODE HERE

print(singles.shape)
singles.head()


In [ ]:
single_ids = [x for x in expression_logged.columns if x.startswith('S')]
singles = expression_logged[single_ids]
print(singles.shape)
singles.head()

Using two different dataframes for filtering

Exercise 8: Filter the full dataframe using the singles dataframe

Now we'll actually do the filtering done by the paper. Using the singles dataframe you just created, get the genes that have expression greater than 1 in at least 3 single cells, and use that to filter expression_logged. Call this dataframe expression_filtered_by_singles.


In [ ]:
# YOUR CODE HERE

print(expression_filtered_by_singles.shape)
expression_filtered_by_singles.head()


In [ ]:
rows = (singles > 1).sum(axis=1) > 3

expression_filtered_by_singles = expression_logged.loc[rows]
print(expression_filtered_by_singles.shape)
expression_filtered_by_singles.head()

Let's make a boxplot again to see how the data has changed.


In [ ]:
sns.boxplot(expression_filtered_by_singles)

fig = plt.gcf()
fig.savefig('expression_filtered_by_singles_boxplot.pdf')

This is much nicer because now we don't have so many zeros and each sample has a reasonable dynamic range.

Why did this filtering even matter?

You may be wondering, we did all this work to remove some zeros..... so the FPKM what? Let's take a look at how this affects the relationships between samples using sns.jointplot from seaborn, which will plot a correlation scatterplot. This also calculates the Pearson correlation, a linear correlation metric.

Let's first do this on the unlogged data.


In [ ]:
sns.jointplot(shalek2013_expression['S1'], shalek2013_expression['S2'])

Pretty funky looking huh? That's why we logged it :)

Now let's try this on the logged data.


In [ ]:
sns.jointplot(expression_logged['S1'], expression_logged['S2'])

Hmm our pearson correlation increased from 0.62 to 0.64. Why could that be?

Let's look at this same plot using the filtered data.


In [ ]:
sns.jointplot(expression_filtered_by_singles['S1'], expression_filtered_by_singles['S2'])

And now our correlation went DOWN!? Why would that be?

Exercise 9: Discuss changes in correlation

Take 2-5 sentences to explain why the correlation changed between the different datasets.

YOUR ANSWER HERE