This illustrates the new cobra sampling and gives some basic performance metrics.
In [1]:
%matplotlib inline
from cobra.test import create_test_model
from cobra.flux_analysis.sampling import OptGPSampler, ACHRSampler
import pandas as pd
mod = create_test_model("textbook") # 95 reactions
All the samplers implement thinning (only using every n samples) to avoid autocorrelation, we will start with versions without thinning to illustrate the basic performance.
In [2]:
achr = ACHRSampler(mod, thinning=1)
In [3]:
optgp = OptGPSampler(mod, processes=1, thinning=1)
In [4]:
achr_samples = achr.sample(1000)
achr_samples = pd.DataFrame(achr_samples, columns=[r.id for r in mod.reactions])
In [5]:
optgp_samples = optgp.sample(1000)
optgp_samples = pd.DataFrame(optgp_samples, columns=[r.id for r in mod.reactions])
We can also validate our samples. Using the validate method of the classes (v means valid, see docstring for more info).
In [6]:
from collections import Counter
print(Counter(achr.validate(achr_samples)))
print(Counter(optgp.validate(optgp_samples)))
ACHR should perform a bit better since it does not used delayed updates as optGP...
In [7]:
from pandas.plotting import autocorrelation_plot as acp
import matplotlib.pyplot as plt
def normalize(df): return (df - df.mean()) / df.std()
acp(normalize(achr_samples))
Out[7]:
In [8]:
acp(normalize(optgp_samples))
Out[8]:
We see that autocorrelation diminishes after 100 iterations, which is why the default thinning factor is 100. Thus, using the default contructors we get:
In [9]:
achr = ACHRSampler(mod)
achr_samples = achr.sample(1000)
achr_samples = pd.DataFrame(achr_samples, columns=[r.id for r in mod.reactions])
acp(normalize(achr_samples))
Out[9]:
In [10]:
optgp = OptGPSampler(mod, 6)
optgp_samples = optgp.sample(1000)
optgp_samples = pd.DataFrame(optgp_samples, columns=[r.id for r in mod.reactions])
acp(normalize(optgp_samples))
Out[10]:
OptGP samples one chain per process. This can be nicely seen in the autocorrelation function where we have a peak for each process. Those do not influence overall convergence since they quickly drop... This can be seen by plotting the trace which shows pretty much uniform sampling. Nevertheless, the preferred use case for OptGP is to sample large chains, meaning sampling many flux distributions for each batch.
In [11]:
normalize(optgp_samples).plot(legend=False)
Out[11]:
Sampling itself obviously takes time, however there is also an initializiation overhead since the initial warmup points are generated from a set of 2 * n_r LP problems.
In [12]:
print(mod.id)
print(len(mod.metabolites), len(mod.reactions), sep=", ")
In [13]:
achr = %time ACHRSampler(mod)
t = %timeit -r1 -n1 -o achr.sample(1000)
print("Time per iteration = {:.2f} μs".format(t.worst / achr.thinning / 1000 * 1e6))
print("Time per sample = {:.2f} ms".format(t.worst / 1000 * 1e3))
In [14]:
optgp = %time OptGPSampler(mod, 4)
t = %timeit -r1 -n1 -o optgp.sample(1000)
print("Time per iteration = {:.2f} μs".format(t.worst / achr.thinning / 1000 * 1e6))
print("Time per sample = {:.2f} ms".format(t.worst / 1000 * 1e3))
This gives a speed up in the order of the number of processes.
In [15]:
ecoli = create_test_model("ecoli")
print(ecoli.id)
print(len(ecoli.metabolites), len(ecoli.reactions), sep=", ")
In [16]:
achr = %time ACHRSampler(ecoli)
t = %timeit -r1 -n1 -o achr.sample(100)
print("Time per iteration = {:.2f} μs".format(t.worst / achr.thinning / 100 * 1e6))
print("Time per sample = {:.2f} ms".format(t.worst / 100 * 1e3))
In [17]:
optgp = %time OptGPSampler(ecoli, 4)
t = %timeit -r1 -n1 -o optgp.sample(100)
print("Time per iteration = {:.2f} μs".format(t.worst / optgp.thinning / 100 * 1e6))
print("Time per sample = {:.2f} ms".format(t.worst / 100 * 1e3))
In [ ]: