About

this notebook shows resulst of toymc - toy Monte Carlo generation module.

We look at one-two variable distributions of original data set and toy data set (that was generated using toymc module), compare the mean, std, pearson correlations and so on.


In [1]:
import numpy, pandas
from IPython.display import display_html
from hep_ml import toymc

Tests with real monte-carlo

generating the toy Monte Carlo for decay $\tau \to \mu\mu\mu$,
the data set has 35000 events of both types and 38 columns (latter is very much for simple bin-based procedure of toymc generation)


In [2]:
data = pandas.read_csv("../hep_ml/datasets/tau_into_muons_toymc.csv", sep='\t')

toymc.compare_toymc(data, clustering_features=["isolationd", "isolatione", "isolationf", "is_signal"])


Generating ...
Copied 186 events in 147 groups from original file. Totally generated 35550 rows 

Means and std

mean orig mean toy difference error, % std orig std toy difference error, %
FlightDistance 14.488601 14.430099 -0.058502 0.403780 12.348747 11.039740 -1.309007 10.600319
FlightDistanceError 0.479349 0.478603 -0.000746 0.155661 0.365756 0.331105 -0.034652 9.474030
IP 0.078939 0.079075 0.000136 0.172197 0.075194 0.070220 -0.004974 6.615294
IPSig 4.452422 4.455008 0.002586 0.058084 3.655748 3.411279 -0.244469 6.687243
VertexChi2 4.533389 4.532106 -0.001283 0.028293 3.166947 2.903718 -0.263228 8.311740
pt 4693.372912 4672.205382 -21.167530 0.451009 2710.222137 2673.681336 -36.540801 1.348259
p0_IPSig 14.496634 14.465133 -0.031501 0.217299 10.715816 9.840496 -0.875321 8.168494
p1_IPSig 15.248047 15.181170 -0.066877 0.438593 10.702280 9.694147 -1.008133 9.419796
p2_IPSig 15.022873 15.083960 0.061087 0.406627 10.626690 9.803458 -0.823233 7.746840
p0_IP 0.515437 0.515646 0.000209 0.040584 0.386842 0.354846 -0.031997 8.271275
p1_IP 0.346874 0.346083 -0.000791 0.227924 0.281502 0.259783 -0.021720 7.715581
p2_IP 0.432911 0.434889 0.001979 0.457027 0.361596 0.333587 -0.028009 7.746005
p0_Laura_IsoBDT -0.283503 -0.282674 0.000829 0.292368 0.127240 0.117640 -0.009600 7.544447
p1_Laura_IsoBDT -0.270808 -0.270496 0.000313 0.115531 0.123722 0.114308 -0.009413 7.608474
p2_Laura_IsoBDT -0.276756 -0.276194 0.000562 0.202951 0.125636 0.115640 -0.009996 7.956534
DOCAone 0.042840 0.042778 -0.000062 0.144398 0.044602 0.042148 -0.002454 5.501546
DOCAtwo 0.035964 0.036116 0.000152 0.422506 0.034255 0.031614 -0.002641 7.708745
DOCAthree 0.047778 0.047925 0.000147 0.307541 0.047121 0.043955 -0.003167 6.720034
p0_pt 1010.577514 1010.223654 -0.353859 0.035016 701.873723 700.920535 -0.953188 0.135806
p1_pt 2281.852736 2275.191225 -6.661510 0.291934 1566.544595 1554.929092 -11.615503 0.741473
p2_pt 1666.604915 1653.641058 -12.963857 0.777860 1362.322844 1324.758705 -37.564140 2.757360
CDF3 0.530630 0.530901 0.000271 0.051032 0.206617 0.192200 -0.014416 6.977212
CDF1 0.753848 0.754251 0.000404 0.053527 0.188076 0.171502 -0.016575 8.812695
CDF2 0.665112 0.665048 -0.000064 0.009586 0.199719 0.183945 -0.015774 7.897862
Laura_SumBDT -0.831346 -0.829700 0.001646 0.197991 0.338263 0.313920 -0.024343 7.196416
FlightD_FlightDE 34.590387 34.510965 -0.079421 0.229606 23.509432 21.418827 -2.090605 8.892623
LifeTime_N 0.336826 0.336558 -0.000268 0.079532 0.155832 0.143561 -0.012271 7.874403
dira_N 0.008678 0.008663 -0.000015 0.173402 0.011048 0.010469 -0.000578 5.235559
min_p_IP 0.223615 0.223673 0.000058 0.026041 0.142426 0.131277 -0.011149 7.828229
min_DOCA 0.016291 0.016396 0.000105 0.646741 0.015588 0.014734 -0.000855 5.482821
min_Laura -0.337495 -0.336896 0.000599 0.177429 0.128004 0.119003 -0.009001 7.031762
min_track 0.850864 0.850736 -0.000128 0.015077 0.154329 0.139973 -0.014356 9.302418
min_p_pt 803.009182 801.506548 -1.502634 0.187125 498.459385 493.480423 -4.978962 0.998870
mass 1781.470069 1781.801204 0.331135 0.018588 79.257738 77.832673 -1.425065 1.798014
isolationd 0.272293 0.272293 0.000000 0.000000 0.637554 0.637554 0.000000 0.000000
isolatione 0.264895 0.264895 0.000000 0.000000 0.634332 0.634332 0.000000 0.000000
isolationf 0.260113 0.260113 0.000000 0.000000 0.619372 0.619372 0.000000 0.000000
is_signal 0.867454 0.867454 0.000000 0.000000 0.339083 0.339083 0.000000 0.000000

Covariance

original toy difference error, %
(FlightDistance, FlightDistanceError) 0.588674 0.594066 0.005392 0.915936
(FlightDistance, IP) 0.186495 0.160818 -0.025676 13.767802
(FlightDistance, IPSig) 0.142919 0.122216 -0.020703 14.485981
(FlightDistance, VertexChi2) -0.037627 -0.043229 -0.005602 14.887674
(FlightDistance, pt) 0.241570 0.261615 0.020045 8.297710
(FlightDistance, p0_IPSig) 0.346552 0.347280 0.000728 0.210027
(FlightDistance, p1_IPSig) 0.293768 0.285871 -0.007896 2.687964
(FlightDistance, p2_IPSig) 0.319710 0.321113 0.001403 0.438894
(FlightDistance, p0_IP) 0.473343 0.470497 -0.002846 0.601243
(FlightDistance, p1_IP) 0.346893 0.326001 -0.020893 6.022811
(FlightDistance, p2_IP) 0.401770 0.394213 -0.007557 1.880994
(FlightDistance, p0_Laura_IsoBDT) -0.052232 -0.072357 -0.020125 38.530355
(FlightDistance, p1_Laura_IsoBDT) -0.046472 -0.064620 -0.018148 39.051880
(FlightDistance, p2_Laura_IsoBDT) -0.052950 -0.072485 -0.019535 36.894129
(FlightDistance, DOCAone) -0.000571 -0.007016 -0.006445 1129.031849
(FlightDistance, DOCAtwo) 0.018494 0.001907 -0.016587 89.686614
(FlightDistance, DOCAthree) 0.023132 0.015070 -0.008062 34.851457
(FlightDistance, p0_pt) 0.124053 0.137837 0.013783 11.110917
(FlightDistance, p1_pt) 0.207116 0.230328 0.023212 11.207035
(FlightDistance, p2_pt) 0.156776 0.161059 0.004283 2.731693
(FlightDistance, CDF3) 0.103241 0.125782 0.022541 21.833487
(FlightDistance, CDF1) 0.165104 0.180463 0.015359 9.302531
(FlightDistance, CDF2) 0.135743 0.155636 0.019893 14.654768
(FlightDistance, Laura_SumBDT) -0.055887 -0.076852 -0.020964 37.511994
(FlightDistance, FlightD_FlightDE) 0.489238 0.488984 -0.000254 0.051851
(FlightDistance, LifeTime_N) -0.496252 -0.487791 0.008461 1.704972
(FlightDistance, dira_N) -0.288705 -0.306295 -0.017590 6.092673
(FlightDistance, min_p_IP) 0.380737 0.361326 -0.019411 5.098300
(FlightDistance, min_DOCA) 0.005879 -0.005940 -0.011819 201.035353
(FlightDistance, min_Laura) -0.071903 -0.093203 -0.021300 29.622918
... ... ... ... ...
(min_DOCA, isolationf) 0.081142 0.089354 0.008212 10.120139
(min_DOCA, is_signal) -0.163607 -0.181191 -0.017584 10.747515
(min_Laura, min_track) 0.064698 0.069093 0.004395 6.792587
(min_Laura, min_p_pt) -0.021440 -0.024339 -0.002899 13.520975
(min_Laura, mass) 0.004027 0.006080 0.002052 50.951976
(min_Laura, isolationd) 0.436307 0.466411 0.030104 6.899667
(min_Laura, isolatione) 0.417401 0.447183 0.029782 7.135121
(min_Laura, isolationf) 0.424682 0.453863 0.029181 6.871259
(min_Laura, is_signal) -0.298852 -0.314272 -0.015420 5.159702
(min_track, min_p_pt) -0.018098 -0.022801 -0.004703 25.986646
(min_track, mass) 0.001267 -0.003255 -0.004522 357.003945
(min_track, isolationd) 0.052219 0.047448 -0.004772 9.137789
(min_track, isolatione) 0.061490 0.057459 -0.004031 6.555827
(min_track, isolationf) 0.045693 0.045630 -0.000063 0.138481
(min_track, is_signal) -0.052675 -0.064929 -0.012254 23.262951
(min_p_pt, mass) -0.018352 -0.023451 -0.005099 27.781483
(min_p_pt, isolationd) -0.097590 -0.094903 0.002687 2.753700
(min_p_pt, isolatione) -0.102372 -0.099186 0.003186 3.111880
(min_p_pt, isolationf) -0.104702 -0.107886 -0.003184 3.041170
(min_p_pt, is_signal) 0.132310 0.129809 -0.002501 1.890194
(mass, isolationd) 0.008389 0.019857 0.011467 136.691553
(mass, isolatione) 0.021730 0.037024 0.015294 70.380204
(mass, isolationf) 0.016335 0.027322 0.010987 67.257794
(mass, is_signal) -0.122742 -0.135925 -0.013183 10.740319
(isolationd, isolatione) 0.542932 0.542932 0.000000 0.000000
(isolationd, isolationf) 0.512898 0.512898 0.000000 0.000000
(isolationd, is_signal) -0.294191 -0.294191 0.000000 -0.000000
(isolatione, isolationf) 0.544315 0.544315 0.000000 0.000000
(isolatione, is_signal) -0.316461 -0.316461 0.000000 -0.000000
(isolationf, is_signal) -0.291629 -0.291629 0.000000 -0.000000

703 rows × 4 columns

Gauss distribution tests


In [3]:
def generate_gauss(size, n_features):
    n_features = 10
    rand = numpy.random.rand(n_features, n_features) * 2 - 1
    covar = rand + rand.T
    for i in range(n_features):
        covar[i, i] = 1.0
    mean = numpy.random.randn(n_features) * 2
    original_data = numpy.random.multivariate_normal(mean, covar, size)
    return pandas.DataFrame(original_data, columns=['column'+str(i) for i in range(n_features)])

In [4]:
data = generate_gauss(20000, 10)
toymc.compare_toymc(data)


Means and std

mean orig mean toy difference error, % std orig std toy difference error, %
column0 3.721887 3.707922 -0.013965 0.375207 1.350264 1.317775 -0.032489 2.406133
column1 2.595101 2.599125 0.004024 0.155062 1.348756 1.326363 -0.022392 1.660221
column2 0.398480 0.427005 0.028525 7.158476 1.496726 1.477562 -0.019164 1.280389
column3 4.311618 4.311198 -0.000420 0.009731 1.366808 1.338900 -0.027908 2.041868
column4 1.134374 1.139357 0.004983 0.439311 1.510989 1.488779 -0.022210 1.469917
column5 0.577185 0.557482 -0.019703 3.413703 1.376852 1.354749 -0.022103 1.605359
column6 0.481421 0.482967 0.001546 0.321221 1.306622 1.278299 -0.028322 2.167600
column7 0.696555 0.718441 0.021886 3.142025 1.259722 1.238034 -0.021688 1.721635
column8 -1.305061 -1.293310 0.011752 0.900461 1.395227 1.380613 -0.014614 1.047412
column9 2.730072 2.735623 0.005551 0.203328 1.709496 1.686321 -0.023175 1.355666

Covariance

original toy difference error, %
(column0, column1) -0.026548 -0.022621 0.003927 14.791123
(column0, column2) -0.111947 -0.111997 -0.000049 0.043986
(column0, column3) 0.265485 0.259013 -0.006472 2.437931
(column0, column4) 0.250631 0.258581 0.007950 3.171997
(column0, column5) 0.089024 0.087902 -0.001123 1.261052
(column0, column6) 0.093367 0.100941 0.007574 8.112293
(column0, column7) -0.427630 -0.436360 -0.008730 2.041586
(column0, column8) 0.057408 0.055272 -0.002136 3.721215
(column0, column9) 0.040959 0.050633 0.009674 23.619109
(column1, column2) 0.411367 0.418032 0.006665 1.620233
(column1, column3) -0.219887 -0.222385 -0.002498 1.135949
(column1, column4) 0.255001 0.254035 -0.000966 0.378950
(column1, column5) 0.294034 0.295860 0.001826 0.620934
(column1, column6) -0.172782 -0.184935 -0.012153 7.033661
(column1, column7) -0.244542 -0.248490 -0.003948 1.614594
(column1, column8) -0.017463 -0.018793 -0.001331 7.619909
(column1, column9) -0.038736 -0.028096 0.010640 27.469046
(column2, column3) -0.343538 -0.341605 0.001933 0.562701
(column2, column4) -0.073787 -0.075841 -0.002054 2.783176
(column2, column5) -0.344017 -0.345252 -0.001235 0.358892
(column2, column6) 0.043747 0.038370 -0.005377 12.290120
(column2, column7) 0.242342 0.237596 -0.004746 1.958540
(column2, column8) 0.208449 0.214867 0.006418 3.079103
(column2, column9) 0.082292 0.076680 -0.005612 6.819552
(column3, column4) 0.066250 0.071855 0.005605 8.460069
(column3, column5) -0.117375 -0.117564 -0.000189 0.161260
(column3, column6) -0.586752 -0.597226 -0.010474 1.785109
(column3, column7) -0.007660 -0.001907 0.005752 75.101735
(column3, column8) 0.264198 0.270808 0.006610 2.501908
(column3, column9) 0.033138 0.032712 -0.000427 1.287719
(column4, column5) -0.220074 -0.214030 0.006043 2.746028
(column4, column6) -0.014715 -0.007891 0.006824 46.371958
(column4, column7) 0.022625 0.029556 0.006931 30.635367
(column4, column8) 0.129984 0.127225 -0.002759 2.122357
(column4, column9) -0.049505 -0.031200 0.018305 36.975538
(column5, column6) -0.041264 -0.046057 -0.004793 11.614784
(column5, column7) -0.166705 -0.170011 -0.003306 1.982879
(column5, column8) 0.003759 -0.005191 -0.008950 238.109561
(column5, column9) 0.320572 0.328604 0.008032 2.505555
(column6, column7) 0.047011 0.045381 -0.001631 3.468653
(column6, column8) -0.278791 -0.288105 -0.009314 3.341019
(column6, column9) 0.193594 0.189588 -0.004006 2.069326
(column7, column8) 0.401420 0.419594 0.018175 4.527561
(column7, column9) 0.125114 0.122112 -0.003002 2.399252
(column8, column9) -0.170308 -0.178393 -0.008086 4.747656

Detailed statisics of monte-carlo on gaussian distribution with identity covar matrix.

In this part we demonstrate how the shrinkage of covariance (which happens when we generate toyMC) depends on different parameters

The less diag, the better!


In [5]:
def test_toymc_iteration(size, variables, knn, symmetrize, power, reweighting_iterations):
    data = numpy.random.randn(size, variables)    
    toyData, _ = toymc.generate_toymc(data, size=size, knn=knn, symmetrize=symmetrize, power=power, 
                                      reweighting_iterations=reweighting_iterations)
    mean = numpy.mean(data, axis = 0)
    covar = numpy.cov(data.T)

    mean2 = numpy.mean(toyData, axis = 0)
    covar2 = numpy.cov(toyData.T)
    
    covardiff =  abs(covar - covar2)
    covarsigned = covar - covar2
    diagDeviation = numpy.trace(covardiff) * 1.0 / variables
    print "size={0:6} vars={1:2} knn={2:2} symmetrize={3:5} power={4:4} diag={5:5f}"\
        .format(size, variables, knn, str(symmetrize), power, diagDeviation)
    return diagDeviation


def print_toymc_parameter_dependence():
    sizes = [int( 100 * (1.7 ** i)) for i in range(10)]
    data  = [test_toymc_iteration(size, 15, 4, True, 2.0, 6) for size in sizes]
    display_html("<h3>Size sensitivity</h3>", raw=True)

    plot(sizes, data)
    pylab.show()
    
    variables = [5 + 2 * x for x in range(14)]
    for knn in [5, 10, 20]:
        data  = [test_toymc_iteration(4000, varn, knn, True, 2.0, 6) for varn in variables]
        display_html("<h3>Sensitivity to number of variables, knn=%i </h3>" % knn, raw=True)
        plot(variables, data)
        pylab.show()

    powers = [1.0, 1.3, 2.2, 2.7, 4, 6]
    data  = [test_toymc_iteration(2500, 15, 4, True, power, 6) for power in powers]
    display_html("<h3>Power sensitivity</h3>", raw=True)
    plot(powers, data)
    pylab.show()
        
#     print "\nRaw data as a table"
#     for size in [800, 2000]:
#         for variables in [10, 15, 20]:
#             for knn in [5, 10, 20]:
#                 for symmetrize in [False, True]:
#                     test_toymc_iteration(size, variables, knn, symmetrize, 2.0, 6)
            
    

print_toymc_parameter_dependence()


size=   100 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.144924
size=   170 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.106645
size=   288 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.075534
size=   491 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.076792
size=   835 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.069808
size=  1419 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.067595
size=  2413 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.061043
size=  4103 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.062045
size=  6975 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.051242
size= 11858 vars=15 knn= 4 symmetrize=True  power= 2.0 diag=0.047425

Size sensitivity

size=  4000 vars= 5 knn= 5 symmetrize=True  power= 2.0 diag=0.017347
size=  4000 vars= 7 knn= 5 symmetrize=True  power= 2.0 diag=0.028906
size=  4000 vars= 9 knn= 5 symmetrize=True  power= 2.0 diag=0.030389
size=  4000 vars=11 knn= 5 symmetrize=True  power= 2.0 diag=0.038106
size=  4000 vars=13 knn= 5 symmetrize=True  power= 2.0 diag=0.052025
size=  4000 vars=15 knn= 5 symmetrize=True  power= 2.0 diag=0.065777
size=  4000 vars=17 knn= 5 symmetrize=True  power= 2.0 diag=0.071164
size=  4000 vars=19 knn= 5 symmetrize=True  power= 2.0 diag=0.071328
size=  4000 vars=21 knn= 5 symmetrize=True  power= 2.0 diag=0.067564
size=  4000 vars=23 knn= 5 symmetrize=True  power= 2.0 diag=0.074470
size=  4000 vars=25 knn= 5 symmetrize=True  power= 2.0 diag=0.085850
size=  4000 vars=27 knn= 5 symmetrize=True  power= 2.0 diag=0.083612
size=  4000 vars=29 knn= 5 symmetrize=True  power= 2.0 diag=0.081972
size=  4000 vars=31 knn= 5 symmetrize=True  power= 2.0 diag=0.088030

Sensitivity to number of variables, knn=5

size=  4000 vars= 5 knn=10 symmetrize=True  power= 2.0 diag=0.036195
size=  4000 vars= 7 knn=10 symmetrize=True  power= 2.0 diag=0.027791
size=  4000 vars= 9 knn=10 symmetrize=True  power= 2.0 diag=0.046222
size=  4000 vars=11 knn=10 symmetrize=True  power= 2.0 diag=0.052166
size=  4000 vars=13 knn=10 symmetrize=True  power= 2.0 diag=0.069693
size=  4000 vars=15 knn=10 symmetrize=True  power= 2.0 diag=0.084689
size=  4000 vars=17 knn=10 symmetrize=True  power= 2.0 diag=0.081493
size=  4000 vars=19 knn=10 symmetrize=True  power= 2.0 diag=0.082043
size=  4000 vars=21 knn=10 symmetrize=True  power= 2.0 diag=0.095138
size=  4000 vars=23 knn=10 symmetrize=True  power= 2.0 diag=0.090205
size=  4000 vars=25 knn=10 symmetrize=True  power= 2.0 diag=0.102006
size=  4000 vars=27 knn=10 symmetrize=True  power= 2.0 diag=0.100463
size=  4000 vars=29 knn=10 symmetrize=True  power= 2.0 diag=0.103545
size=  4000 vars=31 knn=10 symmetrize=True  power= 2.0 diag=0.112866

Sensitivity to number of variables, knn=10

size=  4000 vars= 5 knn=20 symmetrize=True  power= 2.0 diag=0.034331
size=  4000 vars= 7 knn=20 symmetrize=True  power= 2.0 diag=0.040743
size=  4000 vars= 9 knn=20 symmetrize=True  power= 2.0 diag=0.060922
size=  4000 vars=11 knn=20 symmetrize=True  power= 2.0 diag=0.075118
size=  4000 vars=13 knn=20 symmetrize=True  power= 2.0 diag=0.080022
size=  4000 vars=15 knn=20 symmetrize=True  power= 2.0 diag=0.087210
size=  4000 vars=17 knn=20 symmetrize=True  power= 2.0 diag=0.090066
size=  4000 vars=19 knn=20 symmetrize=True  power= 2.0 diag=0.100165
size=  4000 vars=21 knn=20 symmetrize=True  power= 2.0 diag=0.109141
size=  4000 vars=23 knn=20 symmetrize=True  power= 2.0 diag=0.113134
size=  4000 vars=25 knn=20 symmetrize=True  power= 2.0 diag=0.113180
size=  4000 vars=27 knn=20 symmetrize=True  power= 2.0 diag=0.115811
size=  4000 vars=29 knn=20 symmetrize=True  power= 2.0 diag=0.124802
size=  4000 vars=31 knn=20 symmetrize=True  power= 2.0 diag=0.116951

Sensitivity to number of variables, knn=20

size=  2500 vars=15 knn= 4 symmetrize=True  power= 1.0 diag=0.081709
size=  2500 vars=15 knn= 4 symmetrize=True  power= 1.3 diag=0.072864
size=  2500 vars=15 knn= 4 symmetrize=True  power= 2.2 diag=0.057633
size=  2500 vars=15 knn= 4 symmetrize=True  power= 2.7 diag=0.040848
size=  2500 vars=15 knn= 4 symmetrize=True  power=   4 diag=0.049937
size=  2500 vars=15 knn= 4 symmetrize=True  power=   6 diag=0.028052

Power sensitivity