A stochastic model of stem cell proliferation, based on the growth of spleen colony-forming cells

Reading Course 2015

Session 2

Michael Strasser

29/04/2015

Institute of Computational Biology, Helmholtz Zentrum Muenchen



In [7]:

    
import  scipy.stats as stats
from scipy.stats import binom
from __future__ import division



In [8]:

    
%pylab
%matplotlib inline









    



Using matplotlib backend: Qt4Agg
Populating the interactive namespace from numpy and matplotlib



In [9]:

    
import seaborn as sns
plt.plot([1,2,3], [2,3,5])
pylab.rcParams['figure.figsize'] = 12, 6

Hematopoiesis

best studied stem cell system
easy access
clean (celltypes are known)

Current view

Orkin, S. H. & Zon, L. I. Hematopoiesis: an evolving paradigm for stem cell biology. Cell 132, 631–44 (2008).

Hematopoiesis in 1960

About the paper

Till, J. E., McCulloch, E. a. & Siminovitch, L. A stochastic model of stem cell proliferation, based on the growth of spleen colony-forming cells . Proc. Natl. Acad. Sci. U. S. A. 51, 29–36 (1964).

First evidence for stem cells by Till/McCulloch around 1960

The Radiation Sensitivity of Normal Mouse Bone Marrow Cells, Determined by Quantitative Marrow Transplantation into Irradiated Mice
This one is a follow up, analysing data from

Siminovitch, L., McCulloch, E. & Till, J. The distribution of colony forming cells among spleen colonies'

Evidence for stem cells

Inject bone marrow cells into irradiated mice
Marrow cells form colonies in spleen (german: 'Milz', part of blood/lymph-system):

Colony forming cells/units (CFU)

Analysis of the colonies/CFUs

Rare celltype: 1/10000 cells
Colonies contain different kinds of blood cells
- multipotent
Colonies contain CFUs again (secondary transplants)
- self renewal
Capable of rapid proliferation: $10^6$ cell progeny in 10 days
What's the number of divisions and the division time?



In [10]:

    
nCells = 1e6
nDays  = 10
nDivisions = np.log2(nCells)
time_per_division = nDays * 24 / nDivisions
print "#Generations/Divisions: %d\n\nAverage generation time %.2f (hours)" %(round(nDivisions), time_per_division)









    



#Generations/Divisions: 20

Average generation time 12.04 (hours)

Defining properties of stem cells

multipotent and able to differentiate
- visual inspection of spleen colonies
- spleen colonies contain diff. cells
are capable of self-renewal
- secondary "transplants"
- cells within the spleen colonies still have stem cell props

BASIC DEFINITION OF STEM CELLS EVEN TODAY

Basic Questions

how is this balance between self-renewal and differentation regulated
Theories
- deterministic : tighly controlled, each cell knows what to do
- stochastic: no control, seemingly random

The data

The distribution of colony forming cells among spleen colonies, Siminovitch 1963

Table 2,3 of Siminovitch
Number of CFU per colony (i.e. #stem cells per colony)



In [11]:

    
colonies_per_spleen = dict()
colonies_per_spleen["exp1"] = np.concatenate([np.repeat(0,24),
                                      np.repeat(1,6),
                                      np.array([2,6,19,21,23,36])
                                      ])

colonies_per_spleen["exp2"] = np.concatenate([np.repeat(0,3),
                                      np.repeat(1,3),
                                      np.repeat(2,2),
                                      np.array([3,3, 20,20, 32])
                                      ])

colonies_per_spleen["exp3"]= np.concatenate([np.repeat(0,12),
                                      np.repeat(1,8),
                                      np.repeat(2,5),
                                      np.repeat(3,2),
                                      np.array([4,5,5,7,8,8,11,13,20,23,29,46])
                                      ])

Efficiency of the CFUs

only a fraction $f$ of stem cells makes it into the spleen
measure $f$:
1. inject known amount of CFUs into mouse
2. remove spleen 2h after
3. disolve spleen, reinject, count

Kind of circular argument (3. is again a function of f)



In [12]:

    
fraction = 0.17 # the fraction of CFUs that actually go to the spleen

Converting to actual CFU numbers

due to loss from injection to spleen

colonies_per_spleen = f * trueCFUs



In [13]:

    
#just divide each dict entry by 'fraction'
tmpFunc = lambda x: np.ceil(x/fraction)

# apply this function to each dictionary entry
trueCFUs = {k: tmpFunc(v) for k,v in colonies_per_spleen.items()}

print "#Colonies per spleen"
print colonies_per_spleen["exp1"]

print "Estimated #CFU per spleen"
print trueCFUs["exp1"]









    



#Colonies per spleen
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
  1  1  1  1  1  2  6 19 21 23 36]
Estimated #CFU per spleen
[   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
    6.    6.    6.    6.    6.    6.   12.   36.  112.  124.  136.  212.]

What's the catch here?

Statistical uncertainty

e.g. all the spleens with 0 colonies: hard to say if 0,1 or 6 stem cells were injected and all lost

Binomial sampling

$N$ stem cells injected: unknown
$k$ colonies observed
prob of colony formation: $f=0.17$

What range of $N$ is plausible for given $k$, $f$ ?

Usually

Binomial distribution

$P(k|N,\theta) = \binom{N}{k} \cdot \theta^k (1-\theta)^{N-k}$
usually, we estimate the posterior $P(\theta|k,N)$
HOW?

Bayes' Law

$$P(\theta|k,N)=\frac{L(k,N|\theta) \cdot \pi(\theta)}{\int L(k,N|\theta) \cdot \pi(\theta)\;d\theta}$$



In [14]:

    
#observations
N = 100
k = 10

#inference
tmpP = np.linspace(0,1,100)
like_theta = binom.pmf(k,n=N,p=tmpP)
prior_theta = 1 # simply constant
posterior_theta = like_theta * prior_theta/ np.sum(like_theta * prior_theta)

plot(tmpP,posterior_theta)
xlabel('theta'), ylabel('Posterior(theta|k)'), xlim([0,0.4])









    Out[14]:





(<matplotlib.text.Text at 0x7f99b5cc25d0>,
 <matplotlib.text.Text at 0x7f99b5dc3110>,
 (0, 0.4))

Instead

also works the other way around:
estimate the posterior of $N$ for observed $f$ and $k$:
$$P(N|k,f)=\frac{L(k,f|N) \cdot \pi(N)}{\int L(k,f|N) \cdot \pi(N)\;dN}$$



In [15]:

    
n = np.arange(300)
data = unique(colonies_per_spleen["exp1"]) #just considering the first experiment
posteriorArray = np.zeros((len(data), len(n)))
for i,d in enumerate(data):
    likelihood_n = binom.pmf(d, n, fraction)
    prior_n = 1 #constant
    posterior_n = likelihood_n * prior_n / np.sum(likelihood_n*prior_n)
    posteriorArray[i,:] = posterior_n

plot(n,posteriorArray.T)
xlabel('N'), ylabel('Posterior(N|k,f)'), legend(data),  ylim([0, 0.1])









    Out[15]:





(<matplotlib.text.Text at 0x7f99b5dce810>,
 <matplotlib.text.Text at 0x7f99b5cdec10>,
 <matplotlib.legend.Legend at 0x7f99b5d460d0>,
 (0, 0.1))

Conclusion: careful with uncertainties, ignorance makes you overconfident

How to include uncertainty in $f = 0.17 \pm 0.02$ ?

A first look at the data

lets ignore those uncertainties now for simplicity and use the MLEs (counts/0.17)
uncertainty propagation in the end (maybe)

Histograms



In [16]:

    
#normal hists
binning = np.arange(0,300,step=5)
plt.figure()
for exp,data in trueCFUs.iteritems():
    plt.hist(data, bins=binning, histtype="stepfilled", alpha=.3)
plt.legend(trueCFUs.keys())
xlabel('#CFU'), ylabel('Frequency')









    Out[16]:





(<matplotlib.text.Text at 0x7f99b5cab850>,
 <matplotlib.text.Text at 0x7f99b5d30450>)

Cumulative distribution function

Definition: $P(X\le x) = \sum_{i=-\infty}^x P(X=x)$



In [17]:

    
def ecdf(x):
    """ calculate the empirical distribution function of data x """
    sortedX=np.sort( x )
    yvals=np.arange(len(sortedX))/float(len(sortedX))
    return sortedX, yvals


# plot the CDF for each experiment individually
for exp,data in trueCFUs.iteritems():
    x,y = ecdf(data)
    plt.plot(x,y)
xlabel('#CFUs'), ylabel('cumulative density')

#put all three together
trueCFUs_all = np.concatenate(trueCFUs.values())
x,y = ecdf(trueCFUs_all)
plt.plot(x,y, linewidth=5)

leg= trueCFUs.keys()
leg.append('all')
plt.legend(leg)









    Out[17]:





<matplotlib.legend.Legend at 0x7f99b58b3a90>

Figure 4 from the paper



In [18]:

    
plt.figure() # the figure as presented in the paper (scaled back to colonies)
plt.plot(x*0.17,y*len(trueCFUs_all), linewidth=2)
xlim([0,25])
ylim([0,120])
title('as in the paper')









    Out[18]:





<matplotlib.text.Text at 0x7f99b58dc710>

Comparison to a Possion distribution

why would you compare to a Poisson?
for Possion, $E[X] = Var[X]$
or equivalent: Fano-factor $F = \frac{Var[X]}{E[X]} = 1$



In [19]:

    
mu = np.mean(trueCFUs_all)
sigma= np.std(trueCFUs_all)
print "Mean: %f Fano factor %f" % (mu,sigma**2/mu)









    



Mean: 26.988636 Fano factor 104.922522



In [20]:

    
# put the poisson distribution as a comparision which has the same mean as the data
x_ECDF,y_ECDF = ecdf(trueCFUs_all)

x = np.arange(0,100)
y =stats.poisson.cdf(x,mu)

plt.subplot(1,2,1)
plt.plot(x_ECDF, y_ECDF, linewidth=2)
plt.plot(x,y, linewidth=2)
plt.xlim([0,300])
plt.xlabel('#CFUs'), plt.ylabel('CDF'), plt.legend(['Data','Poisson fit'])

# DO THE PDF as well, much easier to see
#plt.figure()
plt.subplot(1,2,2)
sns.distplot(trueCFUs_all,norm_hist=True, kde=False)
plt.plot(x,stats.poisson.pmf(x,mu), linewidth=2)
plt.xlabel('#CFUs'), plt.ylabel('CDF'), plt.legend(['Poisson fit', 'Data'])









    Out[20]:





(<matplotlib.text.Text at 0x7f99b60d1a10>,
 <matplotlib.text.Text at 0x7f99b574d750>,
 <matplotlib.legend.Legend at 0x7f99b5e38f50>)

Comparison to a $\Gamma$-distribution

two parameters
- shape $k$, scale $\theta$
- $E[X] = k \cdot\theta$
- $Var[X] = k \cdot \theta^2$
conveniently, scipy.stats does a MLE fit of $\Gamma$-distributions out of the box



In [21]:

    
plt.subplot(1,2,1)
sns.distplot(trueCFUs_all, fit =stats.gamma, kde=False)
xlim([0, 300])
plt.xlabel('#CFUs'), plt.ylabel('PDF'), plt.legend(['Gamma fit', 'Data'])

# lets look at the CDF as well
shape,loc,scale = stats.gamma.fit(trueCFUs_all)
plt.title('Fitted Gamma(Shape=%.2f, scale=%.2f)' % (shape,scale))

x_Gamma = np.arange(0,200)
y_Gamma = stats.gamma(shape,scale=scale).cdf(x_Gamma)

plt.subplot(1,2,2)
plt.plot(x_ECDF, y_ECDF, linewidth=2)
plt.plot(x_Gamma, y_Gamma)
plt.xlabel('#CFUs'), plt.ylabel('CDF'), plt.legend(['Data', 'Gamma fit'])









    Out[21]:





(<matplotlib.text.Text at 0x7f99b54bbed0>,
 <matplotlib.text.Text at 0x7f99b5450610>,
 <matplotlib.legend.Legend at 0x7f99b543ca90>)

Uncertainty revisited

Until now

Sampling error was ignored entirely
MLE estimates of #CFU

Including sampling uncertainty

fitting a $\Gamma$-distribution to the observations, accounting for these errors
Bayesian inference
- graphical model (network of random variables)

describes the joint distribution $P(\alpha,\beta,f,nCFU,nColonies)$
inference via MCMC
- estimate $P(\alpha,\beta,f |nColonies)$
- Pymc3 package

The posterior $P(\alpha,\beta, f | Data)$

Samples from the posterior

Conclusion

Poisson model doesn't fit at all
$\Gamma$-distribution looks better, but kind of unmotivated

The stochastic model

How to explain the data otherwise?

two cell states:
- stem cells
- differentiated cells

probabilistic model:
- decision to divide or differentiated is stochastic
- unknown probability p0 to differentiate (p2= 1-p0)
solvable analytically only for exponential lifetimes
instead: Monte Carlo simulation

Fun fact: Hardware in 1960

IBM 7090
500.000 \$
Gemini/Mercury space flights
100 kFLOPS/s
- modern mobile phone: 100 MFLOPS/s
- this laptop: ~ GFLOPS/s



In [22]:

    
def simulateTree(p0, nGens):
    """  simulates a single tree from the Till/McCulloch model
    
    inputs:
        - p0: probability that a single cell undergoes terminal 
              differentiation (i.e. no more division)
        - nGens: number of generations to simulate
    
    returns: 
        - a list (one element per generation) of single cells 
          present at that generation.
        - a single element is just an array of cells present at that time 
          (zeros for stem cells, 1s for differentiated cells).
    """
    
    # cell state is either 0 (stem cell) or 1 (differentiated),
    # which is the only thing we keep track of here
    theGenerations = list()
    theGenerations.append(np.array(0))
    
    for g in range(nGens):
        lastGen = theGenerations[-1]
        
        # for each of the last generation, roll a dice whether it terminally diffs
        newState = roll_the_dice(lastGen, p0)
        
        #all the zeros divide, the 1's just stay
        n0 = sum(newState==0)
        n1 = sum(newState==1)
        nextGen = np.concatenate([np.repeat(0, 2*n0), np.repeat(1,n1)])

        theGenerations.append(nextGen)

    return theGenerations


def roll_the_dice(cellstate_array, p0):
    """
    decide if a cell goes from 0->1 (wit probability p0)
    does that for an entire vector of zeros and ones in paralell
    """
    # helper function so that we can index into it via generation
    # makes sure that as soon as cell_state==1 it wont change anymore
    tmpP = np.array([p0, 1])     
    p = tmpP[cellstate_array]
    r = np.random.rand(cellstate_array.size)
    newGeneration = r<p
    
    return newGeneration.astype(int)

Create some trees/colonies, calculate number of CFUs within



In [23]:

    
def pretty_print_tree(tree):
    for gen in tree:
        print gen



In [24]:

    
p0 = 0.4
nGens = 5
aTree = simulateTree(p0,nGens)
bTree = simulateTree(p0,nGens)
print 'First tree'
pretty_print_tree(aTree)

print 'Second tree'
pretty_print_tree(bTree)









    



First tree
0
[1]
[1]
[1]
[1]
[1]
Second tree
0
[0 0]
[0 0 1]
[0 0 1 1]
[0 0 1 1 1]
[0 0 0 0 1 1 1]



In [25]:

    
#CFUs per generation
print '\n#CFUs, first tree'
print map(lambda x: sum(x==0), aTree)

print '\n#CFUs, second tree'
print map(lambda x: sum(x==0), bTree)









    



#CFUs, first tree
[1, 0, 0, 0, 0, 0]

#CFUs, second tree
[1, 2, 2, 2, 2, 4]

Distribution of #CFUs over time



In [26]:

    
nTrees = 1000
nGens = 20
p0 = 0.4

# assemble a matrix of tree vs #CFUs(t)  (one row, one tree; one col one timepoint)
cfus_over_time = np.zeros((nTrees, nGens+1))
for i in range(nTrees):
    tree = simulateTree(p0, nGens)
    nCFU = map(lambda x: np.sum(x==0), tree)
    cfus_over_time[i,:] = nCFU   

print cfus_over_time[0:5,:].astype(int)









    



[[ 1  2  2  4  6  8 12 12 10 12 14 14 20 16 24 32 40 40 42 44 48]
 [ 1  2  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 1  2  4  4  6  8  8  8  6  4  4  6  4  6  8 12 16 18 24 24 26]
 [ 1  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 1  2  4  8  8 14 18 20 26 30 32 28 30 38 34 40 52 58 68 88 90]]



In [27]:

    
import seaborn as sns

plt.plot(log10(cfus_over_time.T))
plt.xlabel('Generation')
plt.ylabel("log10(#CFU)")









    Out[27]:





<matplotlib.text.Text at 0x7f99b552f0d0>



In [28]:

    
# violin plot
plt.figure()
sns.violinplot((log10(cfus_over_time+1)), inner='points')
plt.xlabel('Generation')
plt.ylabel("log10(#CFU+1)")









    Out[28]:





<matplotlib.text.Text at 0x7f99b3eee990>

Last generation



In [29]:

    
lastGen_nCFUs = cfus_over_time[:,-1]

plt.subplot(1,2,1)
plt.hist(lastGen_nCFUs, bins=50)
xlabel('#CFUs')
ylabel('Frequency in last generation')

# CDF
plt.subplot(1,2,2)
x_ECDF_lastGen,y_ECDF_lastGen = ecdf(lastGen_nCFUs)
plt.plot(x_ECDF_lastGen, y_ECDF_lastGen, linewidth=2)
xlabel('#CFUs')
ylabel('CDF')









    Out[29]:





<matplotlib.text.Text at 0x7f99b23040d0>

$\Gamma$-distribution fit



In [30]:

    
shape, loc, scale = stats.gamma.fit(lastGen_nCFUs)
fittedGamma = stats.gamma(a=shape, loc=loc, scale=scale)
print "Shape: %.3f,  Location: %.3f,  Scale: %.3f" %(shape, loc, scale)

# histogram of the simulated data
n,bins = np.histogram(lastGen_nCFUs, bins=arange(0, 300, step=5))

# pdf, cdf
fittedPDF, fittedCDF  =  fittedGamma.pdf(bins[:-1]), fittedGamma.cdf(bins[:-1])

plt.subplot(1,2,1)
plt.plot(x_ECDF, y_ECDF, linewidth=2)
plt.plot(bins[:-1], fittedCDF)
xlabel('#CFU'), ylabel('CDF'), legend(['synthetic Data','Gamma Fit'])

## the PDF derived from the cdf
plt.subplot(1,2,2)
plt.plot(bins[:-1],n/n.sum())
plt.plot(bins[:-1], diff(np.insert(fittedCDF, 0, 0)))









    



Shape: 0.260,  Location: -0.000,  Scale: 60.575






    Out[30]:





[<matplotlib.lines.Line2D at 0x7f99b213f250>]



In [31]:

    
#estimates, based on matching means an vars
gamma_mean, gamma_var = fittedGamma.stats(moments='mv')
sample_mean, sample_var = np.mean(lastGen_nCFUs), np.var(lastGen_nCFUs)


myShape = sample_mean**2/sample_var
myScale = sample_var/sample_mean
print myShape,myScale
#fittedGamma = stats.gamma(a=myShape, loc=0, scale=myScale)


print "Sample Mean: %.2f  Gamma Mean: %.2f" % (sample_mean,gamma_mean)
print "Sample Var: %.2f  Gamma Var: %.2f" % (sample_var,gamma_var)









    



0.26751498967 137.69695689
Sample Mean: 36.84  Gamma Mean: 15.73
Sample Var: 5072.21  Gamma Var: 953.08

Trying to reproduce Figure 8

Monte Carlo simulation vs $\Gamma$-distribution



In [32]:

    
nGens = 20
p0 = 0.4
nTrees_fig8 = 71
fig8_cfu_20 = np.zeros((nTrees_fig8))
for i in range(nTrees_fig8):
    tree = simulateTree(p0, nGens)
    nCFU = map(lambda x: np.sum(x==0), tree)
    fig8_cfu_20[i] = nCFU[-1]
    
xTmp, yTmp = ecdf(fig8_cfu_20)
plot(xTmp, yTmp*nTrees_fig8)

#fit by gamma
shape, loc, scale = stats.gamma.fit(fig8_cfu_20)
fittedGamma_synthData = stats.gamma(a=shape, loc=loc, scale=scale)
fittedCDF  =  fittedGamma_synthData.cdf(xTmp[:-1])
plot(xTmp[:-1], fittedCDF*nTrees_fig8)









    Out[32]:





[<matplotlib.lines.Line2D at 0x7f99b21e1dd0>]

Exercise: Fitting the Birth Death model to the data

no likelihood available (at least not so simple -> branching processes)
simulate data, compare to observations
- comparing the means/variance
- histogram distance



In [33]:

    
"Simulate a bunch of trees and get the #CFUs over time"
def sim_cfu_over_time(p0, nTrees, nGens):
    cfus_over_time = np.zeros((nTrees, nGens+1))
    for i in range(nTrees):
        tree = simulateTree(p0, nGens)
        nCFU = map(lambda x: np.sum(x==0), tree)
        cfus_over_time[i,:] = nCFU
    return cfus_over_time

"calculate a KS test between observed and simulated dists"
def distance_function(obs, sim):
    K,pval = stats.ks_2samp(obs,sim)
    return K, pval
    #return np.abs(np.mean(obs)- np.mean(sim))

p0_grid = np.linspace(0, 1, 50)
distances, pvals = np.zeros(p0_grid.shape), np.zeros(p0_grid.shape)
observedData = trueCFUs_all # experimental data after 10 days/20divisions

for i, p0 in enumerate(p0_grid):
    simData = sim_cfu_over_time(p0, nTrees = 100, nGens= 20)
    simData_lastGen = simData[:,-1]
    distances[i], pvals[i] = distance_function(obs= observedData, sim=simData_lastGen)



In [34]:

    
def plot_fitted_stoch_model(p0_array, K_array, pval_array, observedData):
    bestP0 = p0_array[argmin(K_array)]
    plt.subplot(2,2,1)
    plt.plot(p0_array, K_array)
    plt.vlines(bestP0,0,1)
    xlabel('p0'), ylabel('KS distance')

    plt.subplot(2,2,2)
    plt.plot(p0_array, log10(pvals))
    xlabel('p0'), ylabel('KS pval')
    bestpval = pval_array[argmin(K_array)]
    title(bestpval)

    plt.subplot(2,2,3)
    simData = sim_cfu_over_time(p0=0.4, nTrees = 100, nGens= 20)

    x,y =ecdf(observedData)
    plt.plot(x, y)

    x,y =ecdf(simData[:,-1])
    plt.plot(x, y)
    xlabel('#CFU'), ylabel('CDF')
    legend(['observed','simulated'])

    binning = np.arange(0,300,step=5)
    plt.subplot(2,2,4)
    plt.hist(observedData, bins=binning, histtype="stepfilled", alpha=.5)
    plt.hist(simData[:,-1], bins=binning, histtype="stepfilled", alpha=.5)
    xlabel('#CFU'), ylabel('PDF')



In [35]:

    
plot_fitted_stoch_model(p0_grid, distances, pvals, observedData)

Discussion

very heterogeneous data explained (kind of) by a simple stochastic model

How is that compatible with ordered hematopoiesis?

e.g. you could run out of stem cells due to bad luck
regulation handled on a statistical level
individual cells are random but a pattern on population level emerges

Current area of research

regulation via cell-cell communication
randomness due to gene expression noise Simons, B. D. & Clevers, H. Strategies for homeostatic stem cell self-renewal in adult tissues. Cell 145, 851–62 (2011).