Applying classifiers to Macaulay2016

We're going to use the classifier knowledge that we've learned so far and apply it to the shalek2013 and macaulay2016 datasets.

For the GO analysis, we'll need a few other packages:

  • mygene for looking up the gene ontology categories of genes
  • goatools for performing gene ontology enrichment analysis
  • fishers_exact_test for goatools

Use the following commands at your terminal to install the packages. Some of them are on Github so it's important to get the whole command right.

$ pip install mygene
$ pip install git+git://github.com/olgabot/goatools.git
$ pip install git+https://github.com/brentp/fishers_exact_test.git

In [1]:
# Alphabetical order is standard
# We're doing "import superlongname as abbrev" for our laziness - this way we don't have to type out the whole thing each time.

# From python standard library
import collections

# Python plotting library
import matplotlib.pyplot as plt

# Numerical python library (pronounced "num-pie")
import numpy as np

# Dataframes in Python
import pandas as pd

# Statistical plotting library we'll use
import seaborn as sns
sns.set(style='whitegrid')

# Matrix decomposition
from sklearn.decomposition import PCA, FastICA

# Matrix decomposition
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# Manifold learning
from sklearn.manifold import MDS, TSNE

# Gene ontology
import goatools
import mygene

# Initialize the "mygene.info" (http://mygene.info/) interface
mg = mygene.MyGeneInfo()


# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline


/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

In [2]:
shalek2013_metadata = pd.read_csv('../data/shalek2013/metadata.csv', 
                               
                                     # Sets the first (Python starts counting from 0 not 1) column as the row names
                                      index_col=0)
shalek2013_expression = pd.read_csv('../data/shalek2013/expression.csv', 
                               
                                     # Sets the first (Python starts counting from 0 not 1) column as the row names
                                      index_col=0)
shalek2013_expression_feature = pd.read_csv('../data/shalek2013/expression_feature.csv', 
                               
                                     # Sets the first (Python starts counting from 0 not 1) column as the row names
                                      index_col=0)

Gene Ontology (GO) Enrichment

Gene ontology is a tree (directed acyclic graph) of gene annotations. The topmost node is the most general, and the bottommost nodes are the most specific. Here is an example GO graph

Perform GO enrichment analysis (GOEA)

GOEA Step 1: Download GO graph file of "obo" type (same for all species)

This will download the file "go-basic.obo" if it doesn't already exist. This only needs to be done once.


In [3]:
from goatools.base import download_go_basic_obo
obo_fname = download_go_basic_obo()

# Show the filename
obo_fname


  EXISTS: go-basic.obo
Out[3]:
'go-basic.obo'

GOEA Step 2: Create the GO graph (same for all species)


In [4]:
obo_dag = goatools.obo_parser.GODag(obo_file='go-basic.obo')


load obo file go-basic.obo
go-basic.obo: format-version(1.2) data-version(releases/2016-06-15)
46823 nodes imported

GOEA Step 3: Get gene ID to GO id mapping (species-specific and experiment-specific)

Here we are establishing the background for our GOEA. Defining your background is very important because, for example, tehre are lots of neural genes so if you use all human genes as background in your study of which genes are upregulated in Neuron Type X vs Neuron Type Y, you'll get a bunch of neuron genes (which is true) but not the smaller differences between X and Y. Typicall, you use all expressed genes as the background.

For our data, we can access all expressed genes very simply by getting the column names in the dataframe: shalek2013_expression.columns.


In [ ]:
GO_KEYS = 'go.BP', 'go.MF', 'go.CC'


mygene_output = mg.querymany(shalek2013_expression.columns, 
                             scopes='symbol', fields=GO_KEYS, species='mouse', 
                             returnall=True)

def parse_mygene_output(mygene_output):
    """Convert mygene.querymany output to a gene id to go term mapping
    
    Parameters
    ----------
    mygene_output : dict or list
        Dictionary (returnall=True) or list (returnall=False) of 
        output from mygene.querymany
        
    Output
    ------
    gene_name_to_go : dict
        Mapping of gene name to a set of GO ids    
    """
    # if "returnall=True" was specified, need to get just the "out" key
    if isinstance(mygene_output, dict):
        mygene_output = mygene_output['out']

    gene_name_to_go = collections.defaultdict(set)

    for line in mygene_output:
        gene_name = line['query']
        for go_key in GO_KEYS:
            try:
                go_terms = line[go_key]
            except KeyError:
                continue
            if isinstance(go_terms, dict):
                go_ids = set([go_terms['id']])
            else:
                go_ids = set(x['id'] for x in go_terms)
            gene_name_to_go[gene_name] |= go_ids
    return gene_name_to_go

gene_name_to_go = parse_mygene_output(mygene_output)

GOEA Step 4: Create a GO enrichment calculator object go_enricher (species- and experiment-specific)

In this step, we are using the two objects we've created (obo_dag from Step 2 and gene_name_to_go from Step 3) plus the gene ids to create a go_enricher object


In [7]:
go_enricher = goatools.GOEnrichmentStudy(shalek2013_expression.columns, gene_name_to_go, obo_dag)


Propagating term counts to parents ..
 5,425 out of  6,312 population items found in association

In [54]:
# cl = gene_annotation.sort(col, ascending=False)[gene_annotation[col] > 5e-4].index

go_enricher = goatools.GOEnrichmentStudy(genes, gene_name_to_go, obo_dag)

results = go.run_study(genes[:5])

go_enrichment = pd.DataFrame([r.__dict__ for r in results])
go_enrichment


Propagating term counts to parents ..
     9 out of     10 population items found in association
Calculating uncorrected p-values using fisher
     4 out of      5 study items found in association
Running multitest correction: local bonferroni
Running multitest correction: local sidak
Running multitest correction: local holm
1,004 GO terms are associated with 4 of 5 study items in a population of 10
Out[54]:
GO NS _methods enrichment goterm name p_bonferroni p_holm p_sidak p_uncorrected pop_count pop_items pop_n ratio_in_pop ratio_in_study study_count study_items study_n
0 GO:0044710 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0044710\tlevel-02\tdepth-02\tsingle-organis... single-organism metabolic process 1.0 1.0 1.0 0.047619 4 {PARK2, AGPAT4, NPL, QK} 10 (4, 10) (4, 5) 4 {PARK2, AGPAT4, NPL, QK} 5
1 GO:0045597 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0045597\tlevel-05\tdepth-05\tpositive regul... positive regulation of cell differentiation 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
2 GO:0044707 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0044707\tlevel-02\tdepth-02\tsingle-multice... single-multicellular organism process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
3 GO:0009890 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0009890\tlevel-05\tdepth-05\tnegative regul... negative regulation of biosynthetic process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
4 GO:0031344 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0031344\tlevel-05\tdepth-05\tregulation of ... regulation of cell projection organization 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
5 GO:0044255 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0044255\tlevel-03\tdepth-04\tcellular lipid... cellular lipid metabolic process 1.0 1.0 1.0 0.444444 2 {AGPAT4, QK} 10 (2, 10) (2, 5) 2 {AGPAT4, QK} 5
6 GO:0009892 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0009892\tlevel-04\tdepth-04\tnegative regul... negative regulation of metabolic process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
7 GO:0006082 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0006082\tlevel-03\tdepth-04\torganic acid m... organic acid metabolic process 1.0 1.0 1.0 0.444444 2 {NPL, QK} 10 (2, 10) (2, 5) 2 {NPL, QK} 5
8 GO:0010976 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0010976\tlevel-07\tdepth-09\tpositive regul... positive regulation of neuron projection devel... 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
9 GO:0044248 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0044248\tlevel-03\tdepth-03\tcellular catab... cellular catabolic process 1.0 1.0 1.0 0.444444 2 {PARK2, NPL} 10 (2, 10) (2, 5) 2 {PARK2, NPL} 5
10 GO:0051172 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0051172\tlevel-05\tdepth-05\tnegative regul... negative regulation of nitrogen compound metab... 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
11 GO:0050767 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0050767\tlevel-06\tdepth-06\tregulation of ... regulation of neurogenesis 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
12 GO:0051094 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0051094\tlevel-04\tdepth-04\tpositive regul... positive regulation of developmental process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
13 GO:2000113 BP [(local, bonferroni, bonferroni), (local, sida... e GO:2000113\tlevel-07\tdepth-07\tnegative regul... negative regulation of cellular macromolecule ... 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
14 GO:0051960 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0051960\tlevel-05\tdepth-05\tregulation of ... regulation of nervous system development 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
15 GO:0044281 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0044281\tlevel-03\tdepth-03\tsmall molecule... small molecule metabolic process 1.0 1.0 1.0 0.444444 2 {NPL, QK} 10 (2, 10) (2, 5) 2 {NPL, QK} 5
16 GO:0051248 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0051248\tlevel-06\tdepth-06\tnegative regul... negative regulation of protein metabolic process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
17 GO:0010629 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0010629\tlevel-06\tdepth-06\tnegative regul... negative regulation of gene expression 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
18 GO:0043436 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0043436\tlevel-04\tdepth-05\toxoacid metabo... oxoacid metabolic process 1.0 1.0 1.0 0.444444 2 {NPL, QK} 10 (2, 10) (2, 5) 2 {NPL, QK} 5
19 GO:0032269 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0032269\tlevel-06\tdepth-07\tnegative regul... negative regulation of cellular protein metabo... 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
20 GO:0032501 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0032501\tlevel-01\tdepth-01\tmulticellular ... multicellular organismal process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
21 GO:1901575 BP [(local, bonferroni, bonferroni), (local, sida... e GO:1901575\tlevel-03\tdepth-03\torganic substa... organic substance catabolic process 1.0 1.0 1.0 0.444444 2 {PARK2, NPL} 10 (2, 10) (2, 5) 2 {PARK2, NPL} 5
22 GO:0010605 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0010605\tlevel-05\tdepth-05\tnegative regul... negative regulation of macromolecule metabolic... 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
23 GO:0044712 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0044712\tlevel-03\tdepth-03\tsingle-organis... single-organism catabolic process 1.0 1.0 1.0 0.444444 2 {PARK2, NPL} 10 (2, 10) (2, 5) 2 {PARK2, NPL} 5
24 GO:0050769 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0050769\tlevel-06\tdepth-07\tpositive regul... positive regulation of neurogenesis 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
25 GO:0050793 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0050793\tlevel-03\tdepth-03\tregulation of ... regulation of developmental process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
26 GO:0051246 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0051246\tlevel-05\tdepth-05\tregulation of ... regulation of protein metabolic process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
27 GO:0045664 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0045664\tlevel-07\tdepth-07\tregulation of ... regulation of neuron differentiation 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
28 GO:0006629 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0006629\tlevel-03\tdepth-03\tlipid metaboli... lipid metabolic process 1.0 1.0 1.0 0.444444 2 {AGPAT4, QK} 10 (2, 10) (2, 5) 2 {AGPAT4, QK} 5
29 GO:0043066 BP [(local, bonferroni, bonferroni), (local, sida... e GO:0043066\tlevel-07\tdepth-07\tnegative regul... negative regulation of apoptotic process 1.0 1.0 1.0 0.444444 2 {PARK2, QK} 10 (2, 10) (2, 5) 2 {PARK2, QK} 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
974 GO:0004871 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0004871\tlevel-01\tdepth-01\tsignal transdu... signal transducer activity 1.0 1.0 1.0 1.000000 1 {IGF2R} 10 (1, 10) (0, 5) 0 {} 5
975 GO:0003735 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0003735\tlevel-02\tdepth-02\tstructural con... structural constituent of ribosome 1.0 1.0 1.0 1.000000 1 {MRPL18} 10 (1, 10) (0, 5) 0 {} 5
976 GO:0019901 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0019901\tlevel-05\tdepth-05\tprotein kinase... protein kinase binding 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
977 GO:0016879 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0016879\tlevel-03\tdepth-03\tligase activit... ligase activity, forming carbon-nitrogen bonds 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
978 GO:0019900 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0019900\tlevel-04\tdepth-04\tkinase binding... kinase binding 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
979 GO:0004521 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0004521\tlevel-06\tdepth-06\tendoribonuclea... endoribonuclease activity 1.0 1.0 1.0 1.000000 1 {RNASEL} 10 (1, 10) (0, 5) 0 {} 5
980 GO:0016881 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0016881\tlevel-04\tdepth-04\tacid-amino aci... acid-amino acid ligase activity 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
981 GO:0005520 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0005520\tlevel-04\tdepth-04\tinsulin-like g... insulin-like growth factor binding 1.0 1.0 1.0 1.000000 1 {IGF2R} 10 (1, 10) (0, 5) 0 {} 5
982 GO:0042826 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0042826\tlevel-04\tdepth-04\thistone deacet... histone deacetylase binding 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
983 GO:0032296 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0032296\tlevel-06\tdepth-06\tdouble-strande... double-stranded RNA-specific ribonuclease acti... 1.0 1.0 1.0 1.000000 1 {RNASEL} 10 (1, 10) (0, 5) 0 {} 5
984 GO:0001664 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0001664\tlevel-04\tdepth-04\tG-protein coup... G-protein coupled receptor binding 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
985 GO:0099600 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0099600\tlevel-03\tdepth-03\ttransmembrane ... transmembrane receptor activity 1.0 1.0 1.0 1.000000 1 {IGF2R} 10 (1, 10) (0, 5) 0 {} 5
986 GO:0004672 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0004672\tlevel-05\tdepth-05\tprotein kinase... protein kinase activity 1.0 1.0 1.0 1.000000 1 {RNASEL} 10 (1, 10) (0, 5) 0 {} 5
987 GO:0018169 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0018169\tlevel-06\tdepth-06\tribosomal S6-g... ribosomal S6-glutamic acid ligase activity 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
988 GO:0016893 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0016893\tlevel-06\tdepth-06\tendonuclease a... endonuclease activity, active with either ribo... 1.0 1.0 1.0 1.000000 1 {RNASEL} 10 (1, 10) (0, 5) 0 {} 5
989 GO:0031406 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0031406\tlevel-04\tdepth-04\tcarboxylic aci... carboxylic acid binding 1.0 1.0 1.0 1.000000 1 {IGF2R} 10 (1, 10) (0, 5) 0 {} 5
990 GO:0004842 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0004842\tlevel-04\tdepth-04\tubiquitin-prot... ubiquitin-protein transferase activity 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
991 GO:0003729 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0003729\tlevel-06\tdepth-06\tmRNA binding [... mRNA binding 1.0 1.0 1.0 1.000000 1 {QK} 10 (1, 10) (1, 5) 1 {QK} 5
992 GO:0019838 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0019838\tlevel-03\tdepth-03\tgrowth factor ... growth factor binding 1.0 1.0 1.0 1.000000 1 {IGF2R} 10 (1, 10) (0, 5) 0 {} 5
993 GO:0016408 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0016408\tlevel-05\tdepth-05\tC-acyltransfer... C-acyltransferase activity 1.0 1.0 1.0 1.000000 1 {ACAT2} 10 (1, 10) (0, 5) 0 {} 5
994 GO:0003690 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0003690\tlevel-05\tdepth-05\tdouble-strande... double-stranded DNA binding 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
995 GO:0008747 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0008747\tlevel-05\tdepth-05\tN-acetylneuram... N-acetylneuraminate lyase activity 1.0 1.0 1.0 1.000000 1 {NPL} 10 (1, 10) (1, 5) 1 {NPL} 5
996 GO:0016874 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0016874\tlevel-02\tdepth-02\tligase activit... ligase activity 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
997 GO:0033293 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0033293\tlevel-05\tdepth-05\tmonocarboxylic... monocarboxylic acid binding 1.0 1.0 1.0 1.000000 1 {IGF2R} 10 (1, 10) (0, 5) 0 {} 5
998 GO:0070739 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0070739\tlevel-05\tdepth-05\tprotein-glutam... protein-glutamic acid ligase activity 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
999 GO:0016411 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0016411\tlevel-06\tdepth-06\tacylglycerol O... acylglycerol O-acyltransferase activity 1.0 1.0 1.0 1.000000 1 {AGPAT4} 10 (1, 10) (1, 5) 1 {AGPAT4} 5
1000 GO:0003677 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0003677\tlevel-04\tdepth-04\tDNA binding [m... DNA binding 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
1001 GO:0097602 MF [(local, bonferroni, bonferroni), (local, sida... e GO:0097602\tlevel-03\tdepth-03\tcullin family ... cullin family protein binding 1.0 1.0 1.0 1.000000 1 {PARK2} 10 (1, 10) (1, 5) 1 {PARK2} 5
1002 GO:0016453 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0016453\tlevel-06\tdepth-06\tC-acetyltransf... C-acetyltransferase activity 1.0 1.0 1.0 1.000000 1 {ACAT2} 10 (1, 10) (0, 5) 0 {} 5
1003 GO:0043177 MF [(local, bonferroni, bonferroni), (local, sida... p GO:0043177\tlevel-03\tdepth-03\torganic acid b... organic acid binding 1.0 1.0 1.0 1.000000 1 {IGF2R} 10 (1, 10) (0, 5) 0 {} 5

1004 rows × 18 columns


In [6]:
from sklearn.svm import SVC

In [ ]:


In [10]:
classifier = SVC(kernel='linear')

In [11]:
def plot_svc_decision_function(clf, ax=None):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
    y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
    Y, X = np.meshgrid(y, x)
    P = np.zeros_like(X)
    for i, xi in enumerate(x):
        for j, yj in enumerate(y):
            P[i, j] = clf.decision_function([[xi, yj]])
    # plot the margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])

In [12]:
shalek2013_singles = shalek2013_expression.loc[[x for x in shalek2013_expression.index if x.startswith('S')]]
shalek2013_singles.shape


Out[12]:
(18, 6312)

In [13]:
shalek2013_singles = shalek2013_singles.loc[:, (shalek2013_singles > 1).sum() >= 3]
shalek2013_singles.shape


Out[13]:
(18, 6013)

In [14]:
mature = 'S12', 'S13', 'S16'
target = [1 if x in mature else 0 for x in shalek2013_singles.index]
target


Out[14]:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0]

In [15]:
classifier.fit(shalek2013_singles, target)


Out[15]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [16]:
from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import TSNE, MDS


smusher = FastICA(n_components=4)
reduced_data = smusher.fit_transform(shalek2013_singles+1)


/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/sklearn/decomposition/fastica_.py:116: UserWarning: FastICA did not converge. Consider increasing tolerance or the maximum number of iterations.
  warnings.warn('FastICA did not converge. Consider increasing '

In [17]:
reduced_data_df = pd.DataFrame(reduced_data)

In [18]:
reduced_data_df.min()


Out[18]:
0   -0.556400
1   -0.643007
2   -0.299692
3   -0.382223
dtype: float64

In [19]:
reduced_data_df.max()


Out[19]:
0    0.212303
1    0.387360
2    0.864751
3    0.332152
dtype: float64

In [20]:
reduced_data_df.quantile(1)


Out[20]:
0    0.212303
1    0.387360
2    0.864751
3    0.332152
dtype: float64

In [21]:
low_d_space = pd.DataFrame(reduced_data).apply(lambda x: pd.Series(np.linspace(x.min(), x.max(), 50)))

In [22]:
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
X = np.linspace(x_min, x_max, 50)
Y = np.linspace(y_min, y_max, 50)
xx, yy = np.meshgrid(X, Y)

# Get the decision boundary
# Z = classifier.decision_function(np.c_[xx.ravel(), yy.ravel()])

# two_d_space = np.c_[xx.ravel(), yy.ravel()]
# two_d_space
low_d_space = np.concatenate([])


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-4abe8f52bb01> in <module>()
     10 # two_d_space = np.c_[xx.ravel(), yy.ravel()]
     11 # two_d_space
---> 12 low_d_space = np.concatenate([])

ValueError: need at least one array to concatenate

In [ ]:
xx.shape

In [ ]:
np.concatenate([xx.ravel(), yy.ravel()]).reshape(50**2, 2)

In [ ]:
two_d_space.shape

In [ ]:
plt.scatter(two_d_space[:, 0], two_d_space[:, 1], color='black')

In [ ]:
import goatools

In [ ]:
shalek2013_expression.index[:10]

In [31]:
# Get http://geneontology.org/ontology/go-basic.obo
from goatools.base import download_go_basic_obo
obo_fname = download_go_basic_obo()
obo_fname


  EXISTS: go-basic.obo
Out[31]:
'go-basic.obo'

In [ ]:
# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
from goatools.base import download_ncbi_associations
gene2go = download_ncbi_associations()

In [32]:
obo_dag = goatools.obo_parser.GODag(obo_file=obo_fname)

import goatools

# cl = gene_annotation.sort(col, ascending=False)[gene_annotation[col] > 5e-4].index
g = goatools.GOEnrichmentStudy(list(gene_annotation.index), assoc_dict, obo_dag, study=list(cl))


load obo file go-basic.obo
go-basic.obo: format-version(1.2) data-version(releases/2016-06-15)
46823 nodes imported
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-32-183177986a6c> in <module>()
      4 
      5 # cl = gene_annotation.sort(col, ascending=False)[gene_annotation[col] > 5e-4].index
----> 6 g = goatools.GOEnrichmentStudy(list(gene_annotation.index), assoc_dict, obo_dag, study=list(cl))

NameError: name 'gene_annotation' is not defined

In [ ]:


In [3]:
# assoc = pd.read_table('danio-rerio-gene-ontology.txt').dropna()
# assoc_df = assoc.groupby('Ensembl Gene ID').agg(lambda s: ';'.join(s))
# assoc_s = assoc_df['GO Term Accession'].apply(lambda s: set(s.split(';')))
# assoc_dict = assoc_s.to_dict()


load obo file ../data/macaulay2016/go-basic.obo
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-ec6a9e36b6e5> in <module>()
----> 1 obo_dag = goatools.obo_parser.GODag(obo_file='../data/macaulay2016/go-basic.obo')
      2 
      3 # assoc = pd.read_table('danio-rerio-gene-ontology.txt').dropna()
      4 # assoc_df = assoc.groupby('Ensembl Gene ID').agg(lambda s: ';'.join(s))
      5 # assoc_s = assoc_df['GO Term Accession'].apply(lambda s: set(s.split(';')))

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/goatools/obo_parser.py in __init__(self, obo_file, optional_attrs)
    386 
    387     def __init__(self, obo_file="go-basic.obo", optional_attrs=None):
--> 388         self.load_obo_file(obo_file, optional_attrs)
    389 
    390     def load_obo_file(self, obo_file, optional_attrs):

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/goatools/obo_parser.py in load_obo_file(self, obo_file, optional_attrs)
    391 
    392         print("load obo file %s" % obo_file, file=sys.stderr)
--> 393         reader = OBOReader(obo_file, optional_attrs)
    394         for rec in reader:
    395             self[rec.id] = rec

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/goatools/obo_parser.py in __init__(self, obo_file, optional_attrs)
     40             # GOTerm attributes that are necessary for any operations:
     41         else:
---> 42             raise Exception("download obo file first\n "
     43                             "[http://geneontology.org/ontology/"
     44                             "go-basic.obo]")

Exception: download obo file first
 [http://geneontology.org/ontology/go-basic.obo]

In [19]:
import goatools

# cl = gene_annotation.sort(col, ascending=False)[gene_annotation[col] > 5e-4].index
g = goatools.GOEnrichmentStudy(list(gene_annotation.index), assoc_dict, obo_dag, study=list(cl))

In [6]:



/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

In [ ]:
for r in g.results[:25]:
    print r.goterm.id, '{:.2}'.format(r.p_bonferroni), r.ratio_in_study, r.goterm.name, r.goterm.namespace

In [21]:
unsmushed = smusher.inverse_transform(two_d_space)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-6cebabe76618> in <module>()
----> 1 unsmushed = smusher.inverse_transform(two_d_space)

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/sklearn/decomposition/fastica_.py in inverse_transform(self, X, copy)
    566 
    567         X = check_array(X, copy=(copy and self.whiten), dtype=FLOAT_DTYPES)
--> 568         X = fast_dot(X, self.mixing_.T)
    569         if self.whiten:
    570             X += self.mean_

ValueError: shapes (2500,2) and (4,6013) not aligned: 2 (dim 1) != 4 (dim 0)

In [57]:
Z = classifier.decision_function(unsmushed)
Z = Z.reshape(xx.shape)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-57-2bcd1d6f9dd1> in <module>()
----> 1 Z = classifier.decision_function(unsmushed)
      2 Z = Z.reshape(xx.shape)

NameError: name 'unsmushed' is not defined

In [ ]:
fig, ax = plt.subplots()
ax.scatter(reduced_data[:, 0], reduced_data[:, 1], c=target, cmap='Dark2')
ax.contour(X, Y, Z, colors='k',
           levels=[-1, 0, 1], alpha=0.5,
           linestyles=['--', '-', '--'])

Exercise 1

  1. Try the same analysis, but use ICA instead of PCA.
    1. How are the results different?
    2. Are the cells closer or farther from the decision boundary?
    3. Is that a "better" or "worse" classification? Why?
    4. Why does the reduction algorithm affect the visualization of the classification?
  2. Try the same analysis, but use MDS or t-SNE instead of PCA.
    1. How are the results different?
    2. Are the cells closer or farther from the decision boundary?
    3. Is that a "better" or "worse" classification? Why?
  3. Try the same analysis, but use the "LPS Response" genes and a dimensionality reduction algorithm of your choice.
    1. How are the results different?
    2. Are the cells closer or farther from the decision boundary?
    3. Is that a "better" or "worse" classification? Why?

Decision trees


In [ ]:
def visualize_tree(estimator, X, y, smusher, boundaries=True,
                   xlim=None, ylim=None):
    estimator.fit(X, y)
    smushed = smusher.fit_transform(X)

    if xlim is None:
        xlim = (smushed[:, 0].min() - 0.1, smushed[:, 0].max() + 0.1)
    if ylim is None:
        ylim = (smushed[:, 1].min() - 0.1, smushed[:, 1].max() + 0.1)

    x_min, x_max = xlim
    y_min, y_max = ylim
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    two_d_space = np.c_[xx.ravel(), yy.ravel()]
    unsmushed = smusher.inverse_transform(two_d_space)
    Z = estimator.predict(unsmushed)

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='Paired')
    plt.clim(y.min(), y.max())

    # Plot also the training points
    plt.scatter(smushed[:, 0], smushed[:, 1], c=y, s=50, cmap='Paired')
    plt.axis('off')

    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)        
    plt.clim(y.min(), y.max())
    
    # Plot the decision boundaries
    def plot_boundaries(i, xlim, ylim):
        if i < 0:
            return

        tree = estimator.tree_
        
        if tree.feature[i] == 0:
            plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k')
            plot_boundaries(tree.children_left[i],
                            [xlim[0], tree.threshold[i]], ylim)
            plot_boundaries(tree.children_right[i],
                            [tree.threshold[i], xlim[1]], ylim)
        
        elif tree.feature[i] == 1:
            plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
            plot_boundaries(tree.children_left[i], xlim,
                            [ylim[0], tree.threshold[i]])
            plot_boundaries(tree.children_right[i], xlim,
                            [tree.threshold[i], ylim[1]])
            
    if boundaries:
        plot_boundaries(0, plt.xlim(), plt.ylim())

In [ ]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()


from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import TSNE, MDS


smusher = PCA(n_components=2)
# reduced_data = smusher.fit_transform(shalek2013_singles+1)

visualize_tree(classifier, shalek2013_singles, np.array(target), smusher)

In [ ]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
classifier = RandomForestClassifier()

smusher = PCA(n_components=2)
# reduced_data = smusher.fit_transform(shalek2013_singles+1)

visualize_tree(classifier, shalek2013_singles, np.array(target), smusher, boundaries=False)

In [ ]:
classifier = ExtraTreesClassifier()

smusher = PCA(n_components=2)
# reduced_data = smusher.fit_transform(shalek2013_singles+1)

visualize_tree(classifier, shalek2013_singles, np.array(target), smusher, boundaries=False)

Macaulay2016


In [ ]:
pd.options.display.max_columns = 50

In [ ]:
macaulay2016_metadata = pd.read_csv('../4._Case_Study/macaulay2016/sample_info_qc.csv', index_col=0)
macaulay2016_metadata.head()

In [ ]:
macaulay2016_cluster_names = tuple(sorted(macaulay2016_metadata['cluster'].unique()))
macaulay2016_cluster_names

In [ ]:
macaulay2016_target = macaulay2016_metadata['cluster'].map(lambda x: macaulay2016_cluster_names.index(x))
macaulay2016_target

In [ ]:
macaulay2016_expression = pd.read_csv('../4._Case_Study/macaulay2016/gene_expression_s.csv', index_col=0).T

In [ ]:
macaulay2016_expression.head()

In [ ]:
macaulay2016_expression_filtered = macaulay2016_expression[[x for x in macaulay2016_expression if x.startswith("ENS")]]
macaulay2016_expression_filtered.shape

In [ ]:
macaulay2016_expression_filtered = macaulay2016_expression_filtered.loc[macaulay2016_metadata.index]

In [ ]:
macaulay2016_expression_filtered = 1e6*macaulay2016_expression_filtered.divide(macaulay2016_expression_filtered.sum(axis=1), axis=0)
macaulay2016_expression_filtered.head()

In [ ]:
macaulay2016_expression_filtered = np.log10(macaulay2016_expression_filtered+1)
macaulay2016_expression_filtered.head()

In [ ]:
macaulay2016_expression_filtered = macaulay2016_expression_filtered.loc[:, (macaulay2016_expression_filtered > 1).sum() >=3]
macaulay2016_expression_filtered.shape

In [ ]:
# classifier = SVC(kernel='linear')
# classifier = DecisionTreeClassifier(max_depth=10)
classifier = ExtraTreesClassifier(n_estimators=1000)
classifier.fit(macaulay2016_expression_filtered, macaulay2016_target)

In [ ]:
smusher = FastICA(n_components=2, random_state=0)
smushed_data = smusher.fit_transform(macaulay2016_expression_filtered)

x_min, x_max = smushed_data[:, 0].min(), smushed_data[:, 0].max()
y_min, y_max = smushed_data[:, 1].min(), smushed_data[:, 1].max()
delta_x = 0.05 * abs(x_max - x_min)
delta_y = 0.05 * abs(x_max - x_min)

x_min -= delta_x
x_max += delta_x
y_min -= delta_y
y_max += delta_y

X = np.linspace(x_min, x_max, 100)
Y = np.linspace(y_min, y_max, 100)
xx, yy = np.meshgrid(X, Y)

two_d_space = np.c_[xx.ravel(), yy.ravel()]
two_d_space

In [ ]:
high_dimensional_space = smusher.inverse_transform(two_d_space)

In [ ]:
# Get the class boundaries
Z = classifier.predict(high_dimensional_space)

In [ ]:
import matplotlib as mpl
macaulay2016_metadata['cluster_color_hex'] = macaulay2016_metadata['cluster_color'].map(lambda x: mpl.colors.rgb2hex(eval(x)))

In [ ]:
int_to_cluster_name = dict(zip(range(len(macaulay2016_cluster_names)), macaulay2016_cluster_names))
int_to_cluster_name

In [ ]:
cluster_name_to_color = dict(zip(macaulay2016_metadata['cluster'], macaulay2016_metadata['cluster_color_hex']))
cluster_name_to_color

In [ ]:
macaulay2016_palette = [mpl.colors.hex2color(cluster_name_to_color[int_to_cluster_name[i]]) 
                        for i in range(len(macaulay2016_cluster_names))]
macaulay2016_palette

In [ ]:
cmap = mpl.colors.ListedColormap(macaulay2016_palette)
cmap

In [ ]:
x_min, x_max

In [ ]:
y = macaulay2016_target
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap=cmap)
plt.clim(y.min(), y.max())

# Plot also the training points
plt.scatter(smushed_data[:, 0], smushed_data[:, 1], s=50, color=macaulay2016_metadata['cluster_color_hex'], 
            edgecolor='k') #c=macaulay2016_target, s=50, cmap='Set2')
plt.axis('off')

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)        
plt.clim(y.min(), y.max())

In [ ]:
smusher = FastICA(n_components=4, random_state=354)
smushed_data = pd.DataFrame(smusher.fit_transform(macaulay2016_expression_filtered))

# x_min, x_max = smushed_data[:, 0].min(), smushed_data[:, 0].max()
# y_min, y_max = smushed_data[:, 1].min(), smushed_data[:, 1].max()
# delta_x = 0.05 * abs(x_max - x_min)
# delta_y = 0.05 * abs(x_max - x_min)

# x_min -= delta_x
# x_max += delta_x
# y_min -= delta_y
# y_max += delta_y

# X = np.linspace(x_min, x_max, 100)
# Y = np.linspace(y_min, y_max, 100)
# xx, yy = np.meshgrid(X, Y)

# low_dimensional_space = np.c_[xx.ravel(), yy.ravel()]
# low_dimensional_space

In [ ]:
smushed_data.max() - smushed_data.min()

In [ ]:
grid = smushed_data.apply(lambda x: pd.Series(np.linspace(x.min(), x.max(), 50)))
grid.head()
# grid = [x.ravel() for x in grid]
# grid
# low_dimensional_space = np.concatenate(grid, axis=0)
# low_dimensional_space.shape
# # low_dimensional_space = low_dimensional_space.reshape(shape)

In [ ]:
x1, x2, x3, x4 = np.meshgrid(*[grid[col] for col in grid])
low_dimensional_space = np.c_[x1.ravel(), x2.ravel(), x3.ravel(), x4.ravel()]

In [ ]:
high_dimensional_space = smusher.inverse_transform(low_dimensional_space)

In [ ]:
smushed_data['hue'] = macau

In [ ]:
sns.pairplot(smushed_data)