Flux sampling

Basic usage

The easiest way to get started with flux sampling is using the sample function in the flux_analysis submodule. sample takes at least two arguments: a cobra model and the number of samples you want to generate.



In [1]:

    
from cobra.test import create_test_model
from cobra.flux_analysis import sample

model = create_test_model("textbook")
s = sample(model, 100)
s.head()









    Out[1]:






  
    
      
      ACALD
      ACALDt
      ACKr
      ACONTa
      ACONTb
      ACt2r
      ADK1
      AKGDH
      AKGt2r
      ALCD2x
      ...
      RPI
      SUCCt2_2
      SUCCt3
      SUCDi
      SUCOAS
      TALA
      THD2
      TKT1
      TKT2
      TPI
    
  
  
    
      0
      -3.706944
      -0.163964
      -0.295823
      8.975852
      8.975852
      -0.295823
      4.847986
      6.406533
      -0.081797
      -3.542980
      ...
      -1.649393
      20.917568
      20.977290
      744.206008
      -6.406533
      1.639515
      1.670533
      1.639515
      1.635542
      6.256787
    
    
      1
      -1.340710
      -0.175665
      -0.429169
      11.047827
      11.047827
      -0.429169
      2.901598
      7.992916
      -0.230564
      -1.165045
      ...
      -0.066975
      24.735567
      24.850041
      710.481004
      -7.992916
      0.056442
      9.680476
      0.056442
      0.052207
      7.184752
    
    
      2
      -1.964087
      -0.160334
      -0.618029
      9.811474
      9.811474
      -0.618029
      17.513791
      8.635576
      -0.284992
      -1.803753
      ...
      -4.075515
      23.425719
      23.470968
      696.114154
      -8.635576
      4.063291
      52.316496
      4.063291
      4.058376
      5.122237
    
    
      3
      -0.838442
      -0.123865
      -0.376067
      11.869552
      11.869552
      -0.376067
      7.769872
      9.765178
      -0.325219
      -0.714577
      ...
      -0.838094
      23.446704
      23.913036
      595.787313
      -9.765178
      0.822987
      36.019720
      0.822987
      0.816912
      8.364314
    
    
      4
      -0.232088
      -0.034346
      -1.067684
      7.972039
      7.972039
      -1.067684
      5.114975
      5.438125
      -0.787864
      -0.197742
      ...
      -3.109205
      8.902309
      9.888083
      584.552692
      -5.438125
      3.088152
      12.621811
      3.088152
      3.079686
      6.185089
    
  

5 rows × 95 columns

By default sample uses the optgp method based on the method presented here as it is suited for larger models and can run in parallel. By default the sampler uses a single process. This can be changed by using the processes argument.



In [2]:

    
print("One process:")
%time s = sample(model, 1000)
print("Two processes:")
%time s = sample(model, 1000, processes=2)









    



One process:
CPU times: user 5.31 s, sys: 433 ms, total: 5.74 s
Wall time: 5.27 s
Two processes:
CPU times: user 217 ms, sys: 488 ms, total: 705 ms
Wall time: 2.8 s

Alternatively you can also user Artificial Centering Hit-and-Run for sampling by setting the method to achr. achr does not support parallel execution but has good convergence and is almost Markovian.



In [3]:

    
s = sample(model, 100, method="achr")

In general setting up the sampler is expensive since initial search directions are generated by solving many linear programming problems. Thus, we recommend to generate as many samples as possible in one go. However, this might require finer control over the sampling procedure as described in the following section.

Advanced usage

Sampler objects

The sampling process can be controlled on a lower level by using the sampler classes directly.



In [4]:

    
from cobra.flux_analysis.sampling import OptGPSampler, ACHRSampler

Both sampler classes have standardized interfaces and take some additional argument. For instance the thinning factor. "Thinning" means only recording samples every n iterations. A higher thinning factors mean less correlated samples but also larger computation times. By default the samplers use a thinning factor of 100 which creates roughly uncorrelated samples. If you want less samples but better mixing feel free to increase this parameter. If you want to study convergence for your own model you might want to set it to 1 to obtain all iterates.



In [5]:

    
achr = ACHRSampler(model, thinning=10)

OptGPSampler has an additional processes argument specifying how many processes are used to create parallel sampling chains. This should be in the order of your CPU cores for maximum efficiency. As noted before class initialization can take up to a few minutes due to generation of initial search directions. Sampling on the other hand is quick.



In [6]:

    
optgp = OptGPSampler(model, processes=4)

Sampling and validation

Both samplers have a sample function that generates samples from the initialized object and act like the sample function described above, only that this time it will only accept a single argument, the number of samples. For OptGPSampler the number of samples should be a multiple of the number of processes, otherwise it will be increased to the nearest multiple automatically.



In [7]:

    
s1 = achr.sample(100)

s2 = optgp.sample(100)

You can call sample repeatedly and both samplers are optimized to generate large amount of samples without falling into "numerical traps". All sampler objects have a validate function in order to check if a set of points are feasible and give detailed information about feasibility violations in a form of a short code denoting feasibility. Here the short code is a combination of any of the following letters:

"v" - valid point
"l" - lower bound violation
"u" - upper bound violation
"e" - equality violation (meaning the point is not a steady state)

For instance for a random flux distribution (should not be feasible):



In [8]:

    
import numpy as np

bad = np.random.uniform(-1000, 1000, size=len(model.reactions))
achr.validate(np.atleast_2d(bad))









    Out[8]:





array(['le'], 
      dtype='<U3')

And for our generated samples:



In [9]:

    
achr.validate(s1)









    Out[9]:





array(['v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v',
       'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v',
       'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v',
       'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v',
       'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v',
       'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v',
       'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v',
       'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v', 'v'], 
      dtype='<U3')

Batch sampling

Sampler objects are made for generating billions of samples, however using the sample function might quickly fill up your RAM when working with genome-scale models. Here, the batch method of the sampler objects might come in handy. batch takes two arguments, the number of samples in each batch and the number of batches. This will make sense with a small example.

Let's assume we want to quantify what proportion of our samples will grow. For that we might want to generate 10 batches of 50 samples each and measure what percentage of the individual 100 samples show a growth rate larger than 0.1. Finally, we want to calculate the mean and standard deviation of those individual percentages.



In [10]:

    
counts = [np.mean(s.Biomass_Ecoli_core > 0.1) for s in optgp.batch(100, 10)]
print("Usually {:.2f}% +- {:.2f}% grow...".format(
    np.mean(counts) * 100.0, np.std(counts) * 100.0))









    



Usually 8.70% +- 2.72% grow...

Adding constraints

Flux sampling will respect additional contraints defined in the model. For instance we can add a constraint enforcing growth in asimilar manner as the section before.



In [11]:

    
co = model.problem.Constraint(model.reactions.Biomass_Ecoli_core.flux_expression, lb=0.1)
model.add_cons_vars([co])

Note that this is only for demonstration purposes. usually you could set the lower bound of the reaction directly instead of creating a new constraint.



In [12]:

    
s = sample(model, 10)
print(s.Biomass_Ecoli_core)









    



0    0.175547
1    0.111499
2    0.123073
3    0.151874
4    0.122541
5    0.121878
6    0.147333
7    0.106499
8    0.174448
9    0.143273
Name: Biomass_Ecoli_core, dtype: float64

As we can see our new constraint was respected.

	ACALD	ACALDt	ACKr	ACONTa	ACONTb	ACt2r	ADK1	AKGDH	AKGt2r	ALCD2x	...	RPI	SUCCt2_2	SUCCt3	SUCDi	SUCOAS	TALA	THD2	TKT1	TKT2	TPI
0	-3.706944	-0.163964	-0.295823	8.975852	8.975852	-0.295823	4.847986	6.406533	-0.081797	-3.542980	...	-1.649393	20.917568	20.977290	744.206008	-6.406533	1.639515	1.670533	1.639515	1.635542	6.256787
1	-1.340710	-0.175665	-0.429169	11.047827	11.047827	-0.429169	2.901598	7.992916	-0.230564	-1.165045	...	-0.066975	24.735567	24.850041	710.481004	-7.992916	0.056442	9.680476	0.056442	0.052207	7.184752
2	-1.964087	-0.160334	-0.618029	9.811474	9.811474	-0.618029	17.513791	8.635576	-0.284992	-1.803753	...	-4.075515	23.425719	23.470968	696.114154	-8.635576	4.063291	52.316496	4.063291	4.058376	5.122237
3	-0.838442	-0.123865	-0.376067	11.869552	11.869552	-0.376067	7.769872	9.765178	-0.325219	-0.714577	...	-0.838094	23.446704	23.913036	595.787313	-9.765178	0.822987	36.019720	0.822987	0.816912	8.364314
4	-0.232088	-0.034346	-1.067684	7.972039	7.972039	-1.067684	5.114975	5.438125	-0.787864	-0.197742	...	-3.109205	8.902309	9.888083	584.552692	-5.438125	3.088152	12.621811	3.088152	3.079686	6.185089