Branching GP Regression on hematopoietic data

Alexis Boukouvalas, 2017

Branching GP regression with Gaussian noise on the hematopoiesis data described in the paper "BGP: Gaussian processes for identifying branching dynamics in single cell data".

This notebook shows how to build a BGP model and plot the posterior model fit and posterior branching times.


In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

Read the hematopoiesis data. This has been simplified to a small subset of 23 genes found to be branching.

We have also performed Monocle2 (version 2.1) - DDRTree on this data. The results loaded include the Monocle estimated pseudotime, branching assignment (state) and the DDRTree latent dimensions.


In [2]:
Y = pd.read_csv('singlecelldata/hematoData.csv', index_col=[0])
monocle = pd.read_csv('singlecelldata/hematoMonocle.csv', index_col=[0])

In [3]:
Y.head()


Out[3]:
CAR2 APOE MPO PRTN3 CTSG CEBPA FOXP1 RUNX1 CEBPE GFI1 ... MEF2C MEIS1 GATA2 NFE2 TAL1 GATA1 KLF1 FLT3 SATB1 ELANE
W29956 0.0 14.272981 7.136490 2.378830 4.757660 0.0 0.000000 0.000000 0.0 0.000000 ... 0.000000 0.0 0.000000 2.378830 0.000000 3.568245 0.0 0.000000 0.0 0.000000
W30018 0.0 0.000000 14.331045 18.239511 14.331045 0.0 0.000000 0.000000 0.0 0.000000 ... 0.000000 0.0 0.000000 2.605644 0.000000 0.000000 0.0 1.302822 0.0 0.000000
W30161 0.0 8.191847 0.000000 4.095923 4.095923 0.0 4.095923 0.000000 0.0 0.000000 ... 4.095923 0.0 4.095923 12.287769 4.095923 0.000000 0.0 0.000000 0.0 0.000000
W30258 0.0 8.889219 11.428996 11.428996 0.000000 0.0 0.000000 1.269889 0.0 0.000000 ... 0.000000 0.0 1.269889 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.000000
W30296 0.0 0.000000 64.317993 27.564854 15.313807 0.0 0.000000 0.000000 0.0 1.020921 ... 0.000000 0.0 0.000000 1.020921 0.000000 0.000000 0.0 0.000000 0.0 7.146443

5 rows × 23 columns


In [4]:
monocle.head()


Out[4]:
State StretchedPseudotime DDRTreeDim1 DDRTreeDim2
0 3.0 0.312208 -0.728425 -4.253484
1 2.0 0.256831 -3.131191 -2.867801
2 3.0 0.435655 3.106984 -4.869838
3 3.0 0.413354 0.881536 -5.958776
4 2.0 0.274771 -3.366619 -2.444610

In [5]:
# Plot Monocle DDRTree space
genelist = ['FLT3','KLF1','MPO']
f, ax = plt.subplots(1, len(genelist), figsize=(10, 5), sharex=True, sharey=True)
for ig, g in enumerate(genelist):
        y = Y[g].values
        yt = np.log(1+y/y.max())
        yt = yt/yt.max()
        h = ax[ig].scatter(monocle['DDRTreeDim1'], monocle['DDRTreeDim2'],
                       c=yt, s=50, alpha=1.0, vmin=0, vmax=1)
        ax[ig].set_title(g)



In [6]:
import BranchedGP, time
def PlotGene(label, X, Y, s=3, alpha=1.0, ax=None):
    fig = None
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(5, 5))
    for li in np.unique(label):
        idxN = (label == li).flatten()
        ax.scatter(X[idxN], Y[idxN], s=s, alpha=alpha, label=int(np.round(li)))
    return fig, ax


/Users/mqbssaby/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Fit BGP model

Notice the cell assignment uncertainty is higher for cells close to the branching point.


In [8]:
def FitGene(g, ns=20): # for quick results subsample data
    t = time.time()
    Bsearch = list(np.linspace(0.05, 0.95, 5)) + [1.1]  # set of candidate branching points
    GPy = (Y[g].iloc[::ns].values - Y[g].iloc[::ns].values.mean())[:, None]  # remove mean from gene expression data
    GPt = monocle['StretchedPseudotime'].values[::ns]
    globalBranching = monocle['State'].values[::ns].astype(int)
    d = BranchedGP.FitBranchingModel.FitModel(Bsearch, GPt, GPy, globalBranching)
    print(g, 'BGP inference completed in %.1f seconds.' %  (time.time()-t))
    # plot BGP
    fig,ax=BranchedGP.VBHelperFunctions.PlotBGPFit(GPy, GPt, Bsearch, d, figsize=(10,10))
    # overplot data
    f, a=PlotGene(monocle['State'].values, monocle['StretchedPseudotime'].values, Y[g].values-Y[g].iloc[::ns].values.mean(), 
                  ax=ax[0], s=10, alpha=0.5)
    # Calculate Bayes factor of branching vs non-branching
    bf = BranchedGP.VBHelperFunctions.CalculateBranchingEvidence(d)['logBayesFactor']

    fig.suptitle('%s log Bayes factor of branching %.1f' % (g, bf))    
    return d, fig, ax
d, fig, ax = FitGene('MPO')


INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 1921.177079
  Number of iterations: 57
  Number of functions evaluations: 69
INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 2013.111825
  Number of iterations: 45
  Number of functions evaluations: 49
INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 2616.968572
  Number of iterations: 83
  Number of functions evaluations: 98
INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 4256.654645
  Number of iterations: 100
  Number of functions evaluations: 116
INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 3889.340729
  Number of iterations: 100
  Number of functions evaluations: 126
INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 4332.965766
  Number of iterations: 100
  Number of functions evaluations: 113
MPO BGP inference completed in 174.2 seconds.

In [9]:
d_c, fig_c, ax_c = FitGene('CTSG')


INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 974.759460
  Number of iterations: 100
  Number of functions evaluations: 122
INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 1019.107789
  Number of iterations: 100
  Number of functions evaluations: 115
INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 1163.893313
  Number of iterations: 100
  Number of functions evaluations: 128
INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 1602.213731
  Number of iterations: 100
  Number of functions evaluations: 116
INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 1772.048462
  Number of iterations: 100
  Number of functions evaluations: 130
INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 1772.288783
  Number of iterations: 32
  Number of functions evaluations: 37
CTSG BGP inference completed in 244.3 seconds.

In [ ]: