Applying classifiers to Shalek2013

We're going to use the classifier knowledge that we've learned so far and apply it to the shalek2013 and macaulay2016 datasets.

For the GO analysis, we'll need a few other packages:

mygene for looking up the gene ontology categories of genes
goatools for performing gene ontology enrichment analysis
fishers_exact_test for goatools

Use the following commands at your terminal to install the packages. Some of them are on Github so it's important to get the whole command right.

$ pip install mygene
$ pip install git+git://github.com/olgabot/goatools.git
$ pip install git+https://github.com/brentp/fishers_exact_test.git



In [1]:

    
# Alphabetical order is standard
# We're doing "import superlongname as abbrev" for our laziness - this way we don't have to type out the whole thing each time.

# From python standard library
import collections

# Python plotting library
import matplotlib.pyplot as plt

# Numerical python library (pronounced "num-pie")
import numpy as np

# Dataframes in Python
import pandas as pd

# Statistical plotting library we'll use
import seaborn as sns
sns.set(style='whitegrid')

# Label processing
from sklearn import preprocessing

# Matrix decomposition
from sklearn.decomposition import PCA, FastICA

# Matrix decomposition
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# Manifold learning
from sklearn.manifold import MDS, TSNE

# Gene ontology
import goatools
import mygene

# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline









    



/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Utility functions for gene ontology and SVM decision boundary plotting



In [2]:

    
def plot_svc_decision_function(clf, ax=None):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
    y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
    Y, X = np.meshgrid(y, x)
    P = np.zeros_like(X)
    for i, xi in enumerate(x):
        for j, yj in enumerate(y):
            P[i, j] = clf.decision_function([[xi, yj]])
    # plot the margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
GO_KEYS = 'go.BP', 'go.MF', 'go.CC'

def parse_mygene_output(mygene_output):
    """Convert mygene.querymany output to a gene id to go term mapping (dictionary)
    
    Parameters
    ----------
    mygene_output : dict or list
        Dictionary (returnall=True) or list (returnall=False) of 
        output from mygene.querymany
        
    Output
    ------
    gene_name_to_go : dict
        Mapping of gene name to a set of GO ids    
    """
    # if "returnall=True" was specified, need to get just the "out" key
    if isinstance(mygene_output, dict):
        mygene_output = mygene_output['out']

    gene_name_to_go = collections.defaultdict(set)

    for line in mygene_output:
        gene_name = line['query']
        for go_key in GO_KEYS:
            try:
                go_terms = line[go_key]
            except KeyError:
                continue
            if isinstance(go_terms, dict):
                go_ids = set([go_terms['id']])
            else:
                go_ids = set(x['id'] for x in go_terms)
            gene_name_to_go[gene_name] |= go_ids
    return gene_name_to_go

Read in the Shalek2013 data



In [3]:

    
metadata = pd.read_csv('../data/shalek2013/metadata.csv', 
                               
                                     # Sets the first (Python starts counting from 0 not 1) column as the row names
                                      index_col=0)
expression = pd.read_csv('../data/shalek2013/expression.csv', 
                               
                                     # Sets the first (Python starts counting from 0 not 1) column as the row names
                                      index_col=0)
expression_feature = pd.read_csv('../data/shalek2013/expression_feature.csv', 
                               
                                     # Sets the first (Python starts counting from 0 not 1) column as the row names
                                      index_col=0)

# creating new column indicating color
metadata['color'] = metadata['maturity'].map(
    lambda x: 'MediumTurquoise' if x == 'immature' else 'Teal')
metadata.loc[metadata['pooled'], 'color'] = 'black'

# Create a column indicating both maturity and pooled for coloring with seaborn, e.g. sns.pairplot
metadata['group'] = metadata['maturity']
metadata.loc[metadata['pooled'], 'group'] = 'pooled'

# Create a palette and ordering for using with sns.pairplot
palette = ['MediumTurquoise', 'Teal', 'black']
order = ['immature', 'mature', 'pooled']

metadata









    Out[3]:






  
    
      
      phenotype
      pooled
      outlier
      maturity
      color
      group
    
  
  
    
      S1
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S2
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S3
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S4
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S5
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S6
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S7
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S8
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S9
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S10
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S11
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S12
      BDMC
      False
      False
      mature
      Teal
      mature
    
    
      S13
      BDMC
      False
      False
      mature
      Teal
      mature
    
    
      S14
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S15
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S16
      BDMC
      False
      False
      mature
      Teal
      mature
    
    
      S17
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      S18
      BDMC
      False
      False
      immature
      MediumTurquoise
      immature
    
    
      P1
      BDMC
      True
      False
      immature
      black
      pooled
    
    
      P2
      BDMC
      True
      False
      immature
      black
      pooled
    
    
      P3
      BDMC
      True
      False
      immature
      black
      pooled

Side note: getting LPS response genes using `query`

Get the "LPS response genes" using a query:



In [4]:

    
subset = expression_feature.query('gene_category == "LPS Response"')
subset.head()









    Out[4]:






  
    
      
      gene_category
    
  
  
    
      1110018G07RIK
      LPS Response
    
    
      1110032F04RIK
      LPS Response
    
    
      1110038F14RIK
      LPS Response
    
    
      1190002H23RIK
      LPS Response
    
    
      1200009I06RIK
      LPS Response

Assign the variable lps_response_genes based on the gene ids pulled out from this subset:



In [5]:

    
lps_response_genes = subset.index
lps_response_genes









    Out[5]:





Index(['1110018G07RIK', '1110032F04RIK', '1110038F14RIK', '1190002H23RIK',
       '1200009I06RIK', '1600014C10RIK', '1810029B16RIK', '2010002M12RIK',
       '2200002D01RIK', '2210009G21RIK',
       ...
       'ZCCHC2', 'ZCCHC6', 'ZDHHC21', 'ZFP36', 'ZFP558', 'ZFP800', 'ZFP811',
       'ZHX2', 'ZNFX1', 'ZUFSP'],
      dtype='object', length=945)

For this analysis We want to compare the difference between the "mature" and "immature" cells in the Shalek2013 data.



In [6]:

    
singles_ids = [x for x in expression.index if x.startswith('S')]
singles = expression.loc[singles_ids]
singles.shape









    Out[6]:





(18, 6312)

Use only the genes that are substantially expressed in single cells



In [7]:

    
singles = singles.loc[:, (singles > 1).sum() >= 3]
singles.shape









    Out[7]:





(18, 6013)

Now because computers only understand numbers, we'll convert the category label of "mature" and "immature" into integers to a using a LabelEncoder. Let's look at that column again, only for mature cells:



In [8]:

    
singles_maturity = metadata.loc[singles.index, 'maturity']
singles_maturity









    Out[8]:





S1     immature
S2     immature
S3     immature
S4     immature
S5     immature
S6     immature
S7     immature
S8     immature
S9     immature
S10    immature
S11    immature
S12      mature
S13      mature
S14    immature
S15    immature
S16      mature
S17    immature
S18    immature
Name: maturity, dtype: object



In [9]:

    
# Instantiate the encoder
encoder = preprocessing.LabelEncoder()

# Get number of categories and transform "mature"/"immature" to numbers
target = encoder.fit_transform(singles_maturity)
target









    Out[9]:





array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0])

Run the classifier!!

Yay so now we can run a classifier!



In [10]:

    
from sklearn.svm import SVC

classifier = SVC(kernel='linear')
classifier.fit(singles, target)









    Out[10]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

We'll use PCA or ICA to reduce our data for visualizing the SVM decision boundary. Stick to 32 or fewer components because the next steps will die if you use more than 32. Also, this n_components variable will be used later so pay attention :)



In [63]:

    
n_components = 3

smusher = PCA(n_components=n_components)
smushed = pd.DataFrame(smusher.fit_transform(singles), index=singles.index)
print(smushed.shape)
smushed.head()



In [117]:

    
singles.head()









    Out[117]:






  
    
      
      NPL
      QK
      AK163153
      AGPAT4
      IGF2R
      MRPL18
      TCP1
      RNASEL
      ACAT2
      WTAP
      ...
      MIR692-1
      RSPH3B
      TAGAP1
      RNASET2A
      RPS6KA2
      DHX9
      RNASET2B
      FGFR1OP
      BRP44L
      SFT2D1
    
  
  
    
      S1
      4.290577
      5.038477
      1.249363
      0.095072
      2.648209
      5.476280
      5.711141
      0.065396
      4.138114
      1.558020
      ...
      4.165290
      0.705881
      0.000000
      4.700306
      0.033376
      0.573596
      3.199527
      0.000000
      6.134085
      3.699883
    
    
      S2
      0.000000
      4.183371
      1.947622
      5.868557
      0.000000
      0.000000
      1.072934
      0.000000
      5.346876
      1.275351
      ...
      3.646853
      3.288222
      0.000000
      0.000000
      1.061873
      0.312905
      1.060520
      0.000000
      6.715250
      1.786313
    
    
      S3
      4.860293
      3.847854
      1.082463
      4.137252
      3.350293
      4.428491
      4.795181
      0.120361
      0.184545
      0.627058
      ...
      2.179880
      2.155733
      0.055715
      0.313902
      3.044962
      0.223030
      0.272703
      0.154241
      6.911104
      0.000000
    
    
      S4
      0.090829
      0.066797
      1.119633
      0.066015
      0.287966
      5.351368
      5.050281
      0.099992
      0.153751
      4.563653
      ...
      3.476627
      1.325198
      4.680847
      2.846205
      3.190581
      0.303179
      0.000000
      0.634990
      7.116205
      0.123292
    
    
      S5
      0.000000
      3.305915
      1.267464
      0.000000
      0.000000
      5.644345
      5.809713
      0.000000
      3.807793
      0.494250
      ...
      3.598966
      0.505812
      2.457638
      1.172225
      0.000000
      0.038511
      0.000000
      4.883386
      7.387554
      4.569395
    
  

5 rows × 6013 columns

Let's add the group identifier here for plotting:



In [64]:

    
smushed_with_group = smushed.join(metadata['group'])
smushed_with_group

And plot our components in



In [65]:

    
sns.pairplot(smushed_with_group, hue='group', palette=palette,
             hue_order=order, plot_kws=dict(s=100,  edgecolor='white', linewidth=2))









    Out[65]:





<seaborn.axisgrid.PairGrid at 0x118bd3978>






    



/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "

Now we'll make a dataframe of 20 equally spaced intervals to show the full range of the data:



In [125]:

    
n_intervals = 10
smushed_intervals = pd.DataFrame(smushed).apply(lambda x: pd.Series(np.linspace(x.min(), x.max(), n_intervals)))
print(smushed_intervals.shape)
smushed_intervals.head()

Add zero to the top (the reason for this will make sense later)



In [126]:

    
smushed_intervals = pd.concat([pd.DataFrame([[0, 0, 0]]), smushed_intervals], ignore_index=True)
print(smushed_intervals.shape)
smushed_intervals.head()

You'll notice that the top (head()) has the minimum values and the bottom (tail()) has the maximum values:



In [127]:

    
smushed_intervals.tail()

Just to convince ourselves that this actually shows the range of all values, lets plot the smushed intervals and the smushed data in teh same spot:



In [128]:

    
fig, ax = plt.subplots()

ax.scatter(smushed_intervals[0], smushed_intervals[1], color='pink')
ax.scatter(smushed[0], smushed[1], color=metadata['color'])









    Out[128]:





<matplotlib.collections.PathCollection at 0x1194e3278>






    



/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "

Now we'll make a 2-dimensional grid of the whole space so we can plot the decision boundary



In [129]:

    
low_d_grid = np.meshgrid(*[smushed_intervals[col] for col in smushed_intervals])
print(len(low_d_grid))
print([x.shape for x in low_d_grid])









    



3
[(11, 11, 11), (11, 11, 11), (11, 11, 11)]

Let's plot the grid so we can see it!



In [130]:

    
fig, ax = plt.subplots()

ax.scatter(low_d_grid[0], low_d_grid[1], color='pink')
ax.scatter(smushed[0], smushed[1], color=metadata['color'])









    Out[130]:





<matplotlib.collections.PathCollection at 0x117e15630>






    



/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "



In [131]:

    
new_nrows = smushed_intervals.shape[0]**n_components
new_ncols = n_components

low_dimensional_vectors = pd.DataFrame(
    np.concatenate([x.flatten() for x in low_d_grid]).reshape(new_nrows, new_ncols, order='F'))
print(low_dimensional_vectors.shape)
low_dimensional_vectors.head()

Convert the smushed area into unsmushed high dimensional space



In [135]:

    
smusher =









    



  File "<ipython-input-135-138a212e45c5>", line 1
    smusher =
              ^
SyntaxError: invalid syntax



In [132]:

    
smusher.components_.shape









    Out[132]:





(3, 20)



In [133]:

    
high_dimensional_vectors = smusher.inverse_transform(low_dimensional_vectors)
high_dimensional_vectors.shape









    Out[133]:





(1331, 20)

Get the surface of the decision function



In [123]:

    
low_d_grid[0].shape









    Out[123]:





(12, 12, 12)



In [124]:

    
decision_surface = classifier.decision_function(high_dimensional_vectors)
print(decision_surface.shape)
decision_surface = decision_surface.reshape(low_d_grid[0].shape, order='F')
decision_surface.shape









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-124-b85de00344e2> in <module>()
----> 1 decision_surface = classifier.decision_function(high_dimensional_vectors)
      2 print(decision_surface.shape)
      3 decision_surface = decision_surface.reshape(low_d_grid[0].shape, order='F')
      4 decision_surface.shape

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/sklearn/svm/base.py in decision_function(self, X)
    540             n_classes)
    541         """
--> 542         dec = self._decision_function(X)
    543         if self.decision_function_shape is None and len(self.classes_) > 2:
    544             warnings.warn("The decision_function_shape default value will "

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/sklearn/svm/base.py in _decision_function(self, X)
    397         # NOTE: _validate_for_predict contains check for is_fitted
    398         # hence must be placed before any other attributes are used.
--> 399         X = self._validate_for_predict(X)
    400         X = self._compute_kernel(X)
    401 

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/sklearn/svm/base.py in _validate_for_predict(self, X)
    472             raise ValueError("X.shape[1] = %d should be equal to %d, "
    473                              "the number of features at training time" %
--> 474                              (n_features, self.shape_fit_[1]))
    475         return X
    476 

ValueError: X.shape[1] = 20 should be equal to 6013, the number of features at training time



In [98]:

    
.shape









    



  File "<ipython-input-98-de6688d32422>", line 1
    .shape
    ^
SyntaxError: invalid syntax



In [23]:

    
import itertools



In [28]:

    
low_d_grid[0].shape









    Out[28]:





(20, 20, 20)



In [85]:

    
pairgrid = sns.pairplot(smushed_with_group, hue='group')

for i, j in itertools.permutations(range(n_components), 2):
    ax = pairgrid.axes[i, j]
    
    # Commands to get decision surface
    z_coords = [':' if x in (i, j) else -1 for x in range(n_components)]
    z_command = 'Z[{}]'.format(', '.join(map(str, z_coords)))
    Z_subset = eval(z_command)
    print('Z_subset.shape', Z_subset.shape)
#     Z_smushed = pd.DataFrame(smusher.fit_transform(Z_subset))
#     print('Z_smushed.shape', Z_smushed.shape)
#     print('Z_smushed.head()', Z_smushed.head())
    
    ax.contour(smushed_intervals[i], smushed_intervals[j], Z_subset, colors='k',
           levels=[-1, 0, 1], alpha=0.5,
           linestyles=['--', '-', '--'])









    



Z_subset.shape (20, 20)






    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-85-c2ab7213075e> in <module>()
     15     ax.contour(smushed_intervals[i], smushed_intervals[j], Z_subset, colors='k',
     16            levels=[-1, 0, 1], alpha=0.5,
---> 17            linestyles=['--', '-', '--'])

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
   1810                     warnings.warn(msg % (label_namer, func.__name__),
   1811                                   RuntimeWarning, stacklevel=2)
-> 1812             return func(ax, *args, **kwargs)
   1813         pre_doc = inner.__doc__
   1814         if pre_doc is None:

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/axes/_axes.py in contour(self, *args, **kwargs)
   5642             self.cla()
   5643         kwargs['filled'] = False
-> 5644         return mcontour.QuadContourSet(self, *args, **kwargs)
   5645     contour.__doc__ = mcontour.QuadContourSet.contour_doc
   5646 

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/contour.py in __init__(self, ax, *args, **kwargs)
   1422         are described in QuadContourSet.contour_doc.
   1423         """
-> 1424         ContourSet.__init__(self, ax, *args, **kwargs)
   1425 
   1426     def _process_args(self, *args, **kwargs):

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/contour.py in __init__(self, ax, *args, **kwargs)
    861         self._transform = kwargs.get('transform', None)
    862 
--> 863         self._process_args(*args, **kwargs)
    864         self._process_levels()
    865 

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/contour.py in _process_args(self, *args, **kwargs)
   1443                 self._corner_mask = mpl.rcParams['contour.corner_mask']
   1444 
-> 1445             x, y, z = self._contour_args(args, kwargs)
   1446 
   1447             _mask = ma.getmask(z)

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/contour.py in _contour_args(self, args, kwargs)
   1526             args = args[1:]
   1527         elif Nargs <= 4:
-> 1528             x, y, z = self._check_xyz(args[:3], kwargs)
   1529             args = args[3:]
   1530         else:

/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/contour.py in _check_xyz(self, args, kwargs)
   1573 
   1574             if nx != Nx:
-> 1575                 raise TypeError("Length of x must be number of columns in z.")
   1576 
   1577             if ny != Ny:

TypeError: Length of x must be number of columns in z.





    



/Users/olga/anaconda3/envs/single-cell-bioinformatics/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "



In [ ]:

    
fig, ax = plt.subplots()
ax.scatter(reduced_data[:, 0], reduced_data[:, 1], c=target, cmap='Dark2')
ax.contour(X, Y, Z, colors='k',
           levels=[-1, 0, 1], alpha=0.5,
           linestyles=['--', '-', '--'])



In [ ]:

    
np.reshape?

Evaluating classifiers through Gene Ontology (GO) Enrichment

Gene ontology is a tree (aka directed acyclic graph or "dag") of gene annotations. The topmost node is the most general, and the bottommost nodes are the most specific. Here is an example GO graph.

Three GO Domains:

Cellular Component (CC)
Molecular Function (MF)
Biological Process (BP)

Perform GO enrichment analysis (GOEA)

GOEA Step 1: Download GO graph file of "obo" type (same for all species)

This will download the file "go-basic.obo" if it doesn't already exist. This only needs to be done once.



In [ ]:

    
from goatools.base import download_go_basic_obo
obo_fname = download_go_basic_obo()

# Show the filename
obo_fname

GOEA Step 2: Create the GO graph (same for all species)



In [ ]:

    
obo_dag = goatools.obo_parser.GODag(obo_file='go-basic.obo')

GOEA Step 3: Get gene ID to GO id mapping (species-specific and experiment-specific)

Here we are establishing the background for our GOEA. Defining your background is very important because, for example, tehre are lots of neural genes so if you use all human genes as background in your study of which genes are upregulated in Neuron Type X vs Neuron Type Y, you'll get a bunch of neuron genes (which is true) but not the smaller differences between X and Y. Typicall, you use all expressed genes as the background.

For our data, we can access all expressed genes very simply by getting the column names in the dataframe: expression.columns.



In [ ]:

    
# Initialize the "mygene.info" (http://mygene.info/) interface
mg = mygene.MyGeneInfo()

mygene_output = mg.querymany(expression.columns, 
                             scopes='symbol', fields=['go.BP', 'go.MF', 'go.CC'], species='mouse', 
                             returnall=True)

gene_name_to_go = parse_mygene_output(mygene_output)

GOEA Step 4: Create a GO enrichment calculator object `go_enricher` (species- and experiment-specific)

In this step, we are using the two objects we've created (obo_dag from Step 2 and gene_name_to_go from Step 3) plus the gene ids to create a go_enricher object



In [ ]:

    
go_enricher = goatools.GOEnrichmentStudy(expression.columns, gene_name_to_go, obo_dag)

GOEA Step 5: Calculate go enrichment!!! (species- and experiment-specific)

Now we are ready to run go enrichment!! Let's take our enriched genes of interest and



In [ ]:

    
genes_of_interest = 

results = go.run_study(genes[:5])

go_enrichment = pd.DataFrame([r.__dict__ for r in results])
go_enrichment.head()



In [ ]:



In [ ]:

    
import pandas.util.testing as pdt



In [ ]:

    
pdt.assert_numpy_array_equal(two_d_space_v1, two_d_space_v2)



In [ ]:

    
two_d_space.shape



In [ ]:

    
plt.scatter(two_d_space[:, 0], two_d_space[:, 1], color='black')



In [ ]:

    
expression.index[:10]



In [ ]:

    
clf = ExtraTreesClassifier(n_estimators=100000, n_jobs=-1, verbose=1)



In [ ]:

    
expression.index.duplicated()



In [ ]:

    
expression.drop_duplicates()



In [ ]:

    
# assoc = pd.read_table('danio-rerio-gene-ontology.txt').dropna()
# assoc_df = assoc.groupby('Ensembl Gene ID').agg(lambda s: ';'.join(s))
# assoc_s = assoc_df['GO Term Accession'].apply(lambda s: set(s.split(';')))
# assoc_dict = assoc_s.to_dict()



In [ ]:

    
import goatools

# cl = gene_annotation.sort(col, ascending=False)[gene_annotation[col] > 5e-4].index
g = goatools.GOEnrichmentStudy(list(gene_annotation.index), assoc_dict, obo_dag, study=list(cl))



In [ ]:



In [ ]:

    
for r in g.results[:25]:
    print r.goterm.id, '{:.2}'.format(r.p_bonferroni), r.ratio_in_study, r.goterm.name, r.goterm.namespace



In [ ]:

    
unsmushed = smusher.inverse_transform(two_d_space)

Z = classifier.decision_function(unsmushed)
Z = Z.reshape(xx.shape)

fig, ax = plt.subplots()
ax.scatter(reduced_data[:, 0], reduced_data[:, 1], c=target, cmap='Dark2')
ax.contour(X, Y, Z, colors='k',
           levels=[-1, 0, 1], alpha=0.5,
           linestyles=['--', '-', '--'])

Exercise 1

Try the same analysis, but use ICA instead of PCA.
1. How does that change the classification?
2. How does it change the enriched genes?
3. Are the cells closer or farther from the decision boundary?
4. Is that a "better" or "worse" classification? Why?
5. Why does the reduction algorithm affect the visualization of the classification?
Could you use MDS or t-SNE for plotting of the classifier boundary? Why or why not?
Try the same analysis, but use the "LPS Response" genes and a dimensionality reduction algorithm of your choice. (... how do you subset only certain columns out of the dataframe?)
1. How does that change the classification?
2. How does it change the enriched genes?
3. Are the cells closer or farther from the decision boundary?
4. Is that a "better" or "worse" classification? Why?
For (1) and (2) above, also fry using radial basis kernel (kernel="rbf") for SVC.
1. How does that change the classification?
2. How does it change the enriched genes?
3. Are the cells closer or farther from the decision boundary?
4. Is that a "better" or "worse" classification? Why?

Decision trees



In [ ]:

    
def visualize_tree(estimator, X, y, smusher, boundaries=True,
                   xlim=None, ylim=None):
    estimator.fit(X, y)
    smushed = smusher.fit_transform(X)

    if xlim is None:
        xlim = (smushed[:, 0].min() - 0.1, smushed[:, 0].max() + 0.1)
    if ylim is None:
        ylim = (smushed[:, 1].min() - 0.1, smushed[:, 1].max() + 0.1)

    x_min, x_max = xlim
    y_min, y_max = ylim
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    two_d_space = np.c_[xx.ravel(), yy.ravel()]
    unsmushed = smusher.inverse_transform(two_d_space)
    Z = estimator.predict(unsmushed)

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='Paired')
    plt.clim(y.min(), y.max())

    # Plot also the training points
    plt.scatter(smushed[:, 0], smushed[:, 1], c=y, s=50, cmap='Paired')
    plt.axis('off')

    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)        
    plt.clim(y.min(), y.max())
    
    # Plot the decision boundaries
    def plot_boundaries(i, xlim, ylim):
        if i < 0:
            return

        tree = estimator.tree_
        
        if tree.feature[i] == 0:
            plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k')
            plot_boundaries(tree.children_left[i],
                            [xlim[0], tree.threshold[i]], ylim)
            plot_boundaries(tree.children_right[i],
                            [tree.threshold[i], xlim[1]], ylim)
        
        elif tree.feature[i] == 1:
            plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
            plot_boundaries(tree.children_left[i], xlim,
                            [ylim[0], tree.threshold[i]])
            plot_boundaries(tree.children_right[i], xlim,
                            [tree.threshold[i], ylim[1]])
            
    if boundaries:
        plot_boundaries(0, plt.xlim(), plt.ylim())



In [ ]:

    
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()


from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import TSNE, MDS


smusher = PCA(n_components=2)
# reduced_data = smusher.fit_transform(singles+1)

visualize_tree(classifier, singles, np.array(target), smusher)



In [ ]:

    
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
classifier = RandomForestClassifier()

smusher = PCA(n_components=2)
# reduced_data = smusher.fit_transform(singles+1)

visualize_tree(classifier, singles, np.array(target), smusher, boundaries=False)



In [ ]:

    
classifier = ExtraTreesClassifier()

smusher = PCA(n_components=2)
# reduced_data = smusher.fit_transform(singles+1)

visualize_tree(classifier, singles, np.array(target), smusher, boundaries=False)

Macaulay2016



In [ ]:

    
pd.options.display.max_columns = 50



In [ ]:

    
macaulay2016_metadata = pd.read_csv('../4._Case_Study/macaulay2016/sample_info_qc.csv', index_col=0)
macaulay2016_metadata.head()



In [ ]:

    
macaulay2016_cluster_names = tuple(sorted(macaulay2016_metadata['cluster'].unique()))
macaulay2016_cluster_names



In [ ]:

    
macaulay2016_target = macaulay2016_metadata['cluster'].map(lambda x: macaulay2016_cluster_names.index(x))
macaulay2016_target



In [ ]:

    
macaulay2016_expression = pd.read_csv('../4._Case_Study/macaulay2016/gene_expression_s.csv', index_col=0).T



In [ ]:

    
macaulay2016_expression.head()



In [ ]:

    
macaulay2016_expression_filtered = macaulay2016_expression[[x for x in macaulay2016_expression if x.startswith("ENS")]]
macaulay2016_expression_filtered.shape



In [ ]:

    
macaulay2016_expression_filtered = macaulay2016_expression_filtered.loc[macaulay2016_metadata.index]



In [ ]:

    
macaulay2016_expression_filtered = 1e6*macaulay2016_expression_filtered.divide(macaulay2016_expression_filtered.sum(axis=1), axis=0)
macaulay2016_expression_filtered.head()



In [ ]:

    
macaulay2016_expression_filtered = np.log10(macaulay2016_expression_filtered+1)
macaulay2016_expression_filtered.head()



In [ ]:

    
macaulay2016_expression_filtered = macaulay2016_expression_filtered.loc[:, (macaulay2016_expression_filtered > 1).sum() >=3]
macaulay2016_expression_filtered.shape



In [ ]:

    
# classifier = SVC(kernel='linear')
# classifier = DecisionTreeClassifier(max_depth=10)
classifier = ExtraTreesClassifier(n_estimators=1000)
classifier.fit(macaulay2016_expression_filtered, macaulay2016_target)



In [ ]:

    
smusher = FastICA(n_components=2, random_state=0)
smushed_data = smusher.fit_transform(macaulay2016_expression_filtered)

x_min, x_max = smushed_data[:, 0].min(), smushed_data[:, 0].max()
y_min, y_max = smushed_data[:, 1].min(), smushed_data[:, 1].max()
delta_x = 0.05 * abs(x_max - x_min)
delta_y = 0.05 * abs(x_max - x_min)

x_min -= delta_x
x_max += delta_x
y_min -= delta_y
y_max += delta_y

X = np.linspace(x_min, x_max, 100)
Y = np.linspace(y_min, y_max, 100)
xx, yy = np.meshgrid(X, Y)

two_d_space = np.c_[xx.ravel(), yy.ravel()]
two_d_space



In [ ]:

    
high_dimensional_space = smusher.inverse_transform(two_d_space)



In [ ]:

    
# Get the class boundaries
Z = classifier.predict(high_dimensional_space)



In [ ]:

    
import matplotlib as mpl
macaulay2016_metadata['cluster_color_hex'] = macaulay2016_metadata['cluster_color'].map(lambda x: mpl.colors.rgb2hex(eval(x)))



In [ ]:

    
int_to_cluster_name = dict(zip(range(len(macaulay2016_cluster_names)), macaulay2016_cluster_names))
int_to_cluster_name



In [ ]:

    
cluster_name_to_color = dict(zip(macaulay2016_metadata['cluster'], macaulay2016_metadata['cluster_color_hex']))
cluster_name_to_color



In [ ]:

    
macaulay2016_palette = [mpl.colors.hex2color(cluster_name_to_color[int_to_cluster_name[i]]) 
                        for i in range(len(macaulay2016_cluster_names))]
macaulay2016_palette



In [ ]:

    
cmap = mpl.colors.ListedColormap(macaulay2016_palette)
cmap



In [ ]:

    
x_min, x_max



In [ ]:

    
y = macaulay2016_target
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap=cmap)
plt.clim(y.min(), y.max())

# Plot also the training points
plt.scatter(smushed_data[:, 0], smushed_data[:, 1], s=50, color=macaulay2016_metadata['cluster_color_hex'], 
            edgecolor='k') #c=macaulay2016_target, s=50, cmap='Set2')
plt.axis('off')

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)        
plt.clim(y.min(), y.max())



In [ ]:

    
smusher = FastICA(n_components=4, random_state=354)
smushed_data = pd.DataFrame(smusher.fit_transform(macaulay2016_expression_filtered))

# x_min, x_max = smushed_data[:, 0].min(), smushed_data[:, 0].max()
# y_min, y_max = smushed_data[:, 1].min(), smushed_data[:, 1].max()
# delta_x = 0.05 * abs(x_max - x_min)
# delta_y = 0.05 * abs(x_max - x_min)

# x_min -= delta_x
# x_max += delta_x
# y_min -= delta_y
# y_max += delta_y

# X = np.linspace(x_min, x_max, 100)
# Y = np.linspace(y_min, y_max, 100)
# xx, yy = np.meshgrid(X, Y)

# low_dimensional_space = np.c_[xx.ravel(), yy.ravel()]
# low_dimensional_space



In [ ]:

    
smushed_data.max() - smushed_data.min()



In [ ]:

    
grid = smushed_data.apply(lambda x: pd.Series(np.linspace(x.min(), x.max(), 50)))
grid.head()
# grid = [x.ravel() for x in grid]
# grid
# low_dimensional_space = np.concatenate(grid, axis=0)
# low_dimensional_space.shape
# # low_dimensional_space = low_dimensional_space.reshape(shape)



In [ ]:

    
x1, x2, x3, x4 = np.meshgrid(*[grid[col] for col in grid])
low_dimensional_space = np.c_[x1.ravel(), x2.ravel(), x3.ravel(), x4.ravel()]



In [ ]:

    
high_dimensional_space = smusher.inverse_transform(low_dimensional_space)



In [ ]:

    
smushed_data['hue'] = macau



In [ ]:

    
sns.pairplot(smushed_data)

	0	1	2
S1	-50.895164	51.051727	-0.475594
S2	-6.621843	-16.845237	-53.126960
S3	-27.129347	-13.530018	38.690930
S4	-23.018061	-34.946561	12.429219
S5	-29.971542	-1.820914	25.019142

	0	1	2	group
S1	-50.895164	51.051727	-0.475594	immature
S2	-6.621843	-16.845237	-53.126960	immature
S3	-27.129347	-13.530018	38.690930	immature
S4	-23.018061	-34.946561	12.429219	immature
S5	-29.971542	-1.820914	25.019142	immature
S6	-24.764453	14.574389	28.669515	immature
S7	-20.374686	-11.174974	47.193818	immature
S8	-29.401500	9.302269	-6.327138	immature
S9	-20.940892	84.294129	-64.999387	immature
S10	-2.446060	-51.355672	0.803399	immature
S11	-4.313523	-31.422904	-34.798855	immature
S12	120.603174	6.673175	-3.347440	mature
S13	96.241872	31.074948	40.545691	mature
S14	-40.987292	6.963253	-3.442883	immature
S15	8.534140	-16.796547	-56.050497	immature
S16	42.654368	55.552508	14.577843	mature
S17	37.197595	-73.955499	-28.271309	immature
S18	-24.366786	-7.638072	42.910508	immature

	0	1	2
0	-50.895164	-73.955499	-64.999387
1	-31.839793	-56.372207	-52.533475
2	-12.784422	-38.788915	-40.067563
3	6.270949	-21.205623	-27.601652
4	25.326320	-3.622331	-15.135740

	0	1	2
0	0.000000	0.000000	0.000000
1	-50.895164	-73.955499	-64.999387
2	-31.839793	-56.372207	-52.533475
3	-12.784422	-38.788915	-40.067563
4	6.270949	-21.205623	-27.601652

	0	1	2
6	44.381690	13.960961	-2.669829
7	63.437061	31.544253	9.796083
8	82.492432	49.127545	22.261995
9	101.547803	66.710837	34.727906
10	120.603174	84.294129	47.193818

	phenotype	pooled	outlier	maturity	color	group
S1	BDMC	False	False	immature	MediumTurquoise	immature
S2	BDMC	False	False	immature	MediumTurquoise	immature
S3	BDMC	False	False	immature	MediumTurquoise	immature
S4	BDMC	False	False	immature	MediumTurquoise	immature
S5	BDMC	False	False	immature	MediumTurquoise	immature
S6	BDMC	False	False	immature	MediumTurquoise	immature
S7	BDMC	False	False	immature	MediumTurquoise	immature
S8	BDMC	False	False	immature	MediumTurquoise	immature
S9	BDMC	False	False	immature	MediumTurquoise	immature
S10	BDMC	False	False	immature	MediumTurquoise	immature
S11	BDMC	False	False	immature	MediumTurquoise	immature
S12	BDMC	False	False	mature	Teal	mature
S13	BDMC	False	False	mature	Teal	mature
S14	BDMC	False	False	immature	MediumTurquoise	immature
S15	BDMC	False	False	immature	MediumTurquoise	immature
S16	BDMC	False	False	mature	Teal	mature
S17	BDMC	False	False	immature	MediumTurquoise	immature
S18	BDMC	False	False	immature	MediumTurquoise	immature
P1	BDMC	True	False	immature	black	pooled
P2	BDMC	True	False	immature	black	pooled
P3	BDMC	True	False	immature	black	pooled

	gene_category
1110018G07RIK	LPS Response
1110032F04RIK	LPS Response
1110038F14RIK	LPS Response
1190002H23RIK	LPS Response
1200009I06RIK	LPS Response

	NPL	QK	AK163153	AGPAT4	IGF2R	MRPL18	TCP1	RNASEL	ACAT2	WTAP	...	MIR692-1	RSPH3B	TAGAP1	RNASET2A	RPS6KA2	DHX9	RNASET2B	FGFR1OP	BRP44L	SFT2D1
S1	4.290577	5.038477	1.249363	0.095072	2.648209	5.476280	5.711141	0.065396	4.138114	1.558020	...	4.165290	0.705881	0.000000	4.700306	0.033376	0.573596	3.199527	0.000000	6.134085	3.699883
S2	0.000000	4.183371	1.947622	5.868557	0.000000	0.000000	1.072934	0.000000	5.346876	1.275351	...	3.646853	3.288222	0.000000	0.000000	1.061873	0.312905	1.060520	0.000000	6.715250	1.786313
S3	4.860293	3.847854	1.082463	4.137252	3.350293	4.428491	4.795181	0.120361	0.184545	0.627058	...	2.179880	2.155733	0.055715	0.313902	3.044962	0.223030	0.272703	0.154241	6.911104	0.000000
S4	0.090829	0.066797	1.119633	0.066015	0.287966	5.351368	5.050281	0.099992	0.153751	4.563653	...	3.476627	1.325198	4.680847	2.846205	3.190581	0.303179	0.000000	0.634990	7.116205	0.123292
S5	0.000000	3.305915	1.267464	0.000000	0.000000	5.644345	5.809713	0.000000	3.807793	0.494250	...	3.598966	0.505812	2.457638	1.172225	0.000000	0.038511	0.000000	4.883386	7.387554	4.569395