Illustration of datascience Tables for multivariate analysis

David E. Culler

This notebook illustrates some of the use of datascience Tables to perform regressions on multiple variables. In doing so, it shows some of the elegant ways that computational concepts and statistical concepts naturally fit together.



In [3]:

    
# HIDDEN
from datascience import *
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Read data from a local or remote csv file to create a Table



In [4]:

    
# Source https://onlinecourses.science.psu.edu/stat501/node/380
births=Table.read_table("data/birthsmokers.csv")
births









    Out[4]:





    
        
            Wgt  Gest  Smoke
        
    
    
        
            2940  38   1    
        
    
        
            3130  38   0    
        
    
        
            2420  36   1    
        
    
        
            2450  34   0    
        
    
        
            2760  39   1    
        
    
        
            2440  35   1    
        
    
        
            3226  40   0    
        
    
        
            3301  42   1    
        
    
        
            2729  37   0    
        
    
        
            3410  40   0    
        
    

... (22 rows omitted)

Looking at the raw data

Table.scatter produces a scatter plot of columns versus a specific column. Here we look at how birthweight varies with gestation. And we note whether the mother smoked.



In [5]:

    
# First let's just look at what is here.  This needs to be a scatter, rather
# than a plot because there is no simple ordering - just relationships between
# birthweight and gestation time along with whether the mother smokes.
births.scatter('Gest')



In [6]:

    
births.num_rows









    Out[6]:





32



In [7]:

    
# How many samples in each category
births.where('Smoke').num_rows









    Out[7]:





16

Fitting a line to the data on a scatter plot

Here we drop the smoke column and look at the birthweight for the whole population.



In [8]:

    
# As there is a trend among birthweight and gestation, we can show a fit line to try to
# capture it
births.drop('Smoke').scatter('Gest', fit_line=True)

Partitioning data

The question is whether smoking causes the trend to be substantially different. Split the data into two tables using the Smoke column. Then we can find the trends for each.



In [9]:

    
nosmoke=births.where('Smoke',0).drop('Smoke')
smoke=births.where('Smoke',1).drop('Smoke')



In [10]:

    
# we can attempt to find the trend for each
nosmoke.scatter('Gest',fit_line=True)



In [11]:

    
smoke.scatter('Gest',fit_line=True, marker='x')

Build a model by fitting a line to data in a Table

Selecting a column of a Table yields a numpy array, allowing the numpy numerical tools to be used for fitting curves to the data. For documentation on the polyfit function, try help(np.polyfit). It returns the polynomial coefficients, highest power first, which is the slope for a line.

A linear model is only meaningful in the range around the normal gestation period, and indeed the intercept is not physcially meaningful



In [12]:

    
# what is the coeeficients of the line fitted to these data?

np.polyfit(nosmoke['Gest'],nosmoke['Wgt'],1)









    Out[12]:





array([  147.20689655, -2546.13793103])



In [13]:

    
np.polyfit(smoke['Gest'],smoke['Wgt'],1)









    Out[13]:





array([  139.02874903, -2474.56410256])

We see that both the constant and the weight increase per week is lower for the smokers.

Higher order functions as models

At this point, we could do mx+b all over the place, or we could utilize higher order functions to capture the concept of a model. Here is an example, building such a model directly from the data. It takes the set of x and y values and returns a function that evaluates the model built from that data at a particular x.



In [14]:

    
# Build a linear model from data and return it is a function
def make_lm(x, y):
    m,b = np.polyfit(x, y, 1)
    def lm(a):
        return m*a+b
    return lm



In [15]:

    
# Create model of non-smokers that returns estimated weight a function of weeks gestation
nosmoker_weight = make_lm(nosmoke['Gest'], nosmoke['Wgt'])



In [16]:

    
# Evaluate it at normal gestations
nosmoker_weight(40)









    Out[16]:





3342.1379310344846



In [17]:

    
smoker_weight = make_lm(smoke['Gest'], smoke['Wgt'])



In [18]:

    
smoker_weight(40)









    Out[18]:





3086.5858585858577

Drawing a conclusion

Using the models to remove the contribution due to gestation time, we can attempt to draw a conclusion about the effect of smoking at typical gestation age.



In [19]:

    
# based on this data set, fitting the data to models of weight as a function of gestation
# We might conclude that at 40 weeks the effect of smoking on birthweigth in grams is
smoke_diff = nosmoker_weight(40) - smoker_weight(40)
smoke_diff









    Out[19]:





255.55207244862686



In [20]:

    
# Or in relative terms
"{0:.1f}%".format(100*(nosmoker_weight(40)-smoker_weight(40))/nosmoker_weight(40))









    Out[20]:





'7.6%'

Use the models to build a Table and visualize the effect of the categorical parameter - smoking



In [21]:

    
# Create a table with a column containing the independent variable
estimated_birthweight = Table().with_column('week', np.arange(32,44))
# Add columns of dependent variables
estimated_birthweight['nosmoke'] = estimated_birthweight.apply(nosmoker_weight,'week')
estimated_birthweight['smoke'] = estimated_birthweight.apply(smoker_weight,'week')
estimated_birthweight









    Out[21]:





    
        
            week  nosmoke  smoke
        
    
    
        
            32   2164.48  1974.36
        
    
        
            33   2311.69  2113.38
        
    
        
            34   2458.9  2252.41
        
    
        
            35   2606.1  2391.44
        
    
        
            36   2753.31  2530.47
        
    
        
            37   2900.52  2669.5 
        
    
        
            38   3047.72  2808.53
        
    
        
            39   3194.93  2947.56
        
    
        
            40   3342.14  3086.59
        
    
        
            41   3489.34  3225.61
        
    

... (2 rows omitted)



In [22]:

    
# plot it to visualize
estimated_birthweight.plot('week',overlay=True)

Determining if the conclusion is sound

At this point, we might ask how accurately these linear models fit the data. The 'residual' in the fit would give us some idea of this. But the error in summarizing a collection of empirical data with an analytical model is only a part of the question.

The deeper point is that this data is not "truth", it is merely a sample of a population, and a tiny one at that. We should be asking, is an inference drawn from this data valid for the population that the data is intended to represent. Of course, the population is not directly observable, only the sample of it. How can we use the sample we have to get some idea of how representative it is of the larger population. That is what bootstrap seeks to accomplish.

Tables provide a method sample for just this purpose. Here we return to looking at the coefficients, rather than build a function for the model.



In [23]:

    
# Construct a new model by forming a new sample from our existing one and fiting a line to that
# Here we create quite a general function, which takes a table and column names over which
# the model is to be formed.  
def rboot(table, x, y):
    sample = table.sample(table.num_rows, with_replacement=True)
    return np.polyfit(sample[x],sample[y],1)



In [24]:

    
# Try it out for non-smokers.  Note that every time this cell is evaluated (ctrl-enter)
# the result is a little different, since a new sample is drawn.
rboot(nosmoke, 'Gest', 'Wgt')









    Out[24]:





array([  164.26388889, -3203.14583333])



In [25]:

    
# And for smokers
rboot(smoke,'Gest','Wgt')









    Out[25]:





array([  138.46774194, -2468.50806452])

Bootstrap

Using this model builder as a function, draw many samples and form a model for each.
Then we can look at the distribution of the model parameters over lots of models.

This illustrates the construction of tables by rows. The Table constructor accepts the column names and with_rows fills them in row by row.

hist forms and shows a histogram of the result.



In [26]:

    
# Bootstrap a distribution of models by drawing many random samples, with replacement, from our samples
num_samples = 1000
nosamples = Table(['slope','intercept']).with_rows([rboot(nosmoke,'Gest','Wgt') for i in range(num_samples)])
nosamples.hist(bins=50,normed=True, overlay=False)

And we repeate this for the other category.



In [27]:

    
smokesamples = Table(['slope','intercept']).with_rows([rboot(smoke,'Gest','Wgt') for i in range(num_samples)])
smokesamples.hist(bins=50,normed=True, overlay=False)

Summary of sample distributions of the regression

At this point we could compute a statistic over the sample distributions of these parameters, such as the total variational distance, or the mean.



In [28]:

    
nosamples.stats([np.min,np.mean,np.max])









    Out[28]:





    
        
            statistic  slope  intercept
        
    
    
        
            amin      115.182  -3814.18 
        
    
        
            mean      147.53  -2556.88 
        
    
        
            amax      179.448  -1237.52



In [29]:

    
smokesamples.stats([np.min,np.mean,np.max])









    Out[29]:





    
        
            statistic  slope  intercept
        
    
    
        
            amin      88.2447  -4820.81 
        
    
        
            mean      139.815  -2508.46 
        
    
        
            amax      196     -589.657

Estimation of birthweights at 40 weeks

Selecting a column of a Table yields a numpy array. Arithmetic operators work elementwise on the entire array.



In [30]:

    
smokesamples['slope']*40+smokesamples['intercept']









    Out[30]:





array([ 3042.95149254,  3112.06811146,  3088.03921569,  3074.52740168,
        3097.85313531,  3072.14760148,  3070.93203181,  3046.54018445,
        3091.70460048,  3093.00762777,  3053.96825397,  3120.44646925,
        3085.02777778,  3082.42842843,  3122.50111857,  3112.85164835,
        3051.5572666 ,  3113.27892977,  3132.4546148 ,  3134.81023102,
        3072.63189093,  3109.0745098 ,  3083.01234568,  3088.57260726,
        3106.7652968 ,  3084.32237312,  3090.56628788,  3072.37733333,
        3054.70972222,  3097.01269841,  3066.01210898,  3070.85327924,
        3121.12681159,  3053.00757576,  3051.21326165,  3071.42556084,
        3079.7807971 ,  3049.87083333,  3074.06917543,  3131.01574803,
        3056.32230392,  3074.91666667,  3131.29546991,  3091.94089347,
        3138.13524937,  3039.70972222,  3133.85558354,  3085.60526316,
        3048.9380805 ,  3071.09302326,  3064.60703364,  3060.356974  ,
        3027.38797814,  3071.80500284,  3126.89369708,  3138.5862069 ,
        3052.69314472,  3067.54347826,  3082.2112104 ,  3094.14850136,
        3056.80555556,  3082.04794521,  3071.25686275,  3068.62769231,
        3112.43869378,  3098.52      ,  3084.37903226,  3108.48170732,
        3059.11003236,  3069.81706435,  3070.67164179,  3091.36392405,
        3130.8767288 ,  3048.86862334,  3053.96558704,  3097.6010101 ,
        3044.7173913 ,  3047.55952381,  3101.49650794,  3034.86271186,
        3088.68835616,  3116.25347222,  3057.40847458,  3089.01898734,
        3057.27248141,  3066.62080316,  3085.05424322,  3091.16551724,
        3113.08035714,  3080.32795699,  3060.37535817,  3090.30196078,
        3099.        ,  3069.03942652,  3110.50244584,  3065.67070218,
        3068.92938497,  3069.46392694,  3119.32732733,  3068.7125    ,
        3107.62778112,  3071.17967782,  3075.50833333,  3138.92143402,
        3079.60501567,  3133.93037336,  3055.17638791,  3018.47804878,
        3037.7797619 ,  3072.86454734,  3069.3262891 ,  3110.73650794,
        3087.29153605,  3098.98891528,  3107.71975181,  3033.72255018,
        3072.5942029 ,  3129.81920904,  3032.10869565,  3056.40804598,
        3075.08603352,  3110.98245614,  3141.88655789,  3064.48071979,
        3140.23098792,  3118.3214671 ,  3104.4375    ,  3076.2923777 ,
        3105.65540541,  3091.23967961,  3061.20833333,  3079.09874608,
        3034.10970464,  3075.87486034,  3108.99634703,  3084.39160839,
        3099.2375317 ,  3070.23009624,  3070.39293937,  3019.79710145,
        3077.        ,  3099.42200452,  3122.32915718,  3077.9478022 ,
        3057.47184987,  3056.53823529,  3128.43333333,  3141.20765027,
        3078.39160839,  3106.60438144,  3052.30291627,  3103.64931309,
        3082.453125  ,  3083.04761905,  3107.85520054,  3094.56521739,
        3121.04761905,  3104.68283582,  3060.36964286,  3080.32936508,
        3136.3614304 ,  3080.1163311 ,  3095.76759777,  3028.21653543,
        3111.8427464 ,  3073.96078431,  3086.0530303 ,  3065.92359121,
        3084.05663567,  3080.74159021,  3071.16591252,  3089.24458874,
        3082.696793  ,  3052.01787592,  3101.57035176,  3034.3968254 ,
        3057.01404853,  3095.08201893,  3096.97798742,  3093.06277056,
        3013.44069641,  3069.23026316,  3087.69055745,  3098.25947187,
        3080.52777778,  3070.81681682,  3119.14636872,  3087.2298066 ,
        3058.16666667,  3099.27882038,  3061.03125   ,  3082.55191257,
        3065.69927536,  3099.5408805 ,  3127.9673913 ,  3056.14627415,
        3068.72222222,  3121.63466667,  3077.50433705,  3088.09798775,
        3072.18930041,  3072.58250825,  3072.06299213,  3087.89204545,
        3116.58930112,  3127.03139013,  3106.26090064,  3166.10413312,
        3079.03921569,  3092.03374055,  3111.7372549 ,  3088.84340659,
        3091.24828935,  3125.67473118,  3112.50606909,  3097.77641026,
        3117.2803532 ,  3110.30379124,  3122.37050805,  3170.70666667,
        3088.82620145,  3076.61309524,  3078.36666667,  3088.08628128,
        3101.91714836,  3116.83590876,  3072.25925926,  3103.03508772,
        3080.9375    ,  3069.10074627,  3084.625     ,  3057.72285714,
        3107.83561644,  3029.50569801,  3062.50042845,  3045.75347222,
        3115.47397564,  3118.58229066,  3076.15415822,  3072.48542274,
        3047.3876652 ,  3083.875     ,  3062.51376147,  3047.47560976,
        3057.30920372,  3030.65625   ,  3105.40111732,  3046.93710692,
        3089.25284339,  3090.        ,  3066.25      ,  3028.45232647,
        3112.05294118,  3092.41269841,  3086.31658291,  3065.94776119,
        3057.68876611,  3098.0544518 ,  3081.9218107 ,  3104.89504373,
        3104.94577942,  3073.81299332,  3106.65137615,  3070.4142752 ,
        3096.64506173,  3033.30601093,  3022.08654971,  3122.72880337,
        3086.14010007,  3070.4562212 ,  3088.85290323,  3015.79207921,
        3065.05232558,  3095.90086494,  3064.79651795,  3079.94618182,
        3113.30871212,  3033.15384615,  3086.48472222,  3070.34299517,
        3083.17986799,  3039.15699659,  3127.59016393,  3079.82091918,
        3037.27083333,  3088.49407115,  3124.53025641,  3086.30322581,
        3077.51388889,  3083.05978261,  3054.3968254 ,  3099.26923077,
        3070.40972222,  3075.36391437,  3120.17273576,  3066.19444444,
        3146.375     ,  3077.84416492,  3101.5974026 ,  3019.69533528,
        3102.452565  ,  3098.57916667,  3050.53504145,  3055.47214612,
        3089.4877451 ,  3105.02847282,  3059.29015544,  3119.58440446,
        3090.65      ,  3126.79059829,  3095.3947529 ,  3068.213732  ,
        3082.26420455,  3037.54391892,  3109.        ,  3117.18295543,
        3077.14504325,  3095.38621795,  3100.67099567,  3050.77873563,
        3079.829555  ,  3138.67530488,  3096.84747706,  3066.33828383,
        3046.11021814,  3047.23062731,  3102.40725806,  3068.88033012,
        3102.65776699,  3115.62145749,  3094.16790123,  3129.71314103,
        3072.0625    ,  3052.88301282,  3096.73779193,  3053.69521411,
        3055.79043825,  3071.2265625 ,  3059.18171091,  3067.7937034 ,
        3070.95698925,  3119.40554156,  3091.28673835,  3119.03508772,
        3120.99484915,  3103.98055556,  3083.75968992,  3049.85424354,
        3070.47738693,  3093.875     ,  3099.76278497,  3028.        ,
        3072.20658228,  3106.85936152,  3070.94220111,  3080.98184818,
        3044.22222222,  3019.1875    ,  3030.08636364,  3080.61819659,
        3058.34112792,  3147.31488801,  3113.70170455,  3087.7704918 ,
        3128.75105485,  3070.79545455,  3086.02913453,  3094.99126638,
        3059.81666667,  3043.03703704,  3074.1181888 ,  3141.8362069 ,
        3090.37634409,  3053.47340824,  3138.31525424,  3099.43240974,
        3083.73197192,  3071.52668416,  3033.61254613,  3120.65179353,
        3123.73412698,  3068.30410959,  3052.24024024,  3100.54924242,
        3113.8879056 ,  3105.78737113,  3093.76495726,  3090.32738095,
        3070.24626866,  3101.59861111,  3076.89837398,  3087.32156863,
        3084.42315369,  3049.90421456,  3100.05128205,  3072.88778878,
        3077.26647564,  3110.53643725,  3030.18306011,  3092.72248354,
        3123.26086957,  3027.59302326,  3037.34610473,  3092.12246234,
        3017.63499604,  3045.5625    ,  3081.85381818,  3041.97727273,
        3093.97359357,  3121.13131313,  3047.92407809,  3098.88311688,
        3048.64139344,  3123.49408983,  3084.65617805,  3112.32537961,
        3114.81944444,  3077.2220339 ,  3098.61920173,  3120.55833333,
        3063.4521306 ,  3028.85964912,  3085.6867284 ,  3058.10561056,
        3101.62941176,  3066.06470588,  3069.01630435,  3110.64805415,
        3054.00753769,  3037.94554455,  3090.6935167 ,  3058.79112754,
        3083.956     ,  3048.8707671 ,  3077.26704545,  3056.66666667,
        3099.68214055,  3091.51428571,  3087.16666667,  3118.41447368,
        3079.98737374,  3074.3573487 ,  3114.09407338,  3064.76867687,
        3097.95098039,  3074.81581582,  3069.92212389,  3050.88283828,
        3101.25636672,  3156.98958333,  3066.26294821,  3068.46539028,
        3053.27272727,  3095.86087768,  3095.88398357,  3094.28860936,
        3083.79537954,  3028.81430536,  3087.29723992,  3086.91666667,
        3069.29830149,  3030.16037736,  3107.73026316,  3080.35663082,
        3092.68543342,  3067.59016393,  3117.62770563,  3116.83      ,
        3126.54166667,  3127.14215686,  3120.16368286,  3115.59222281,
        3080.20238095,  3118.3218107 ,  3062.96187291,  3105.24464832,
        3074.95267072,  3105.07652686,  3109.29109589,  3072.45986239,
        3108.13736264,  3051.35568953,  3016.84415584,  3092.30830489,
        3064.14754098,  3031.02173913,  3097.05180723,  3098.66308519,
        3048.27058824,  3092.30196078,  3049.16969697,  3103.2991453 ,
        3108.36307311,  3012.94372294,  3062.21266615,  3070.15384615,
        3092.48435544,  3092.85064935,  3061.44457831,  3092.73849879,
        3082.74885845,  3114.85592665,  3088.75877588,  3104.92361111,
        3085.44901961,  3036.5625    ,  3044.57033639,  3084.08      ,
        3037.42899191,  3114.46793003,  3071.66081871,  3084.86936937,
        3112.95061728,  3128.96753247,  3105.71511208,  3068.04310345,
        3098.97088792,  3077.56089744,  3114.31985294,  3054.98662846,
        3152.23      ,  3093.59512837,  3109.27107062,  3125.65254237,
        3082.25296804,  3071.05832321,  3062.82403965,  3130.64262821,
        3083.97064454,  3103.43193525,  3060.96505376,  3157.25416667,
        3117.59428061,  3101.5       ,  3059.72384035,  3064.86650485,
        3110.64326739,  3072.55357143,  3076.01103956,  3121.24027778,
        3119.02425016,  3014.20820513,  3101.52699784,  3085.13116971,
        3015.91769547,  3105.11805556,  3080.14576271,  3096.67034584,
        3041.28629032,  3089.26041667,  3045.24583333,  3081.52931978,
        3111.07265522,  3039.11185087,  3088.68587491,  3123.23508772,
        3046.02213667,  3067.83038638,  3093.45501285,  3072.35714286,
        3110.86666667,  3134.27404148,  3022.81272596,  3098.05787037,
        3070.3064133 ,  3100.02183039,  3049.97540408,  3096.40196078,
        3079.69586141,  3132.08333333,  3117.32386748,  3019.33578657,
        3080.83060635,  3040.93493976,  3102.87304075,  3091.42170671,
        3068.96369637,  3077.71574344,  3057.96460177,  3047.90059983,
        3034.99089875,  3094.53510896,  3137.92361111,  3126.22437137,
        3096.39701664,  3073.2848392 ,  3081.97210482,  3121.58037225,
        3100.86161011,  3126.64374591,  3133.46683459,  3053.97916667,
        3075.42910448,  3046.08636364,  3075.00645161,  3073.30087897,
        3086.99877451,  3121.92952381,  3127.97126437,  3067.92542373,
        3052.17334906,  3036.36754643,  2940.13102119,  3085.68518519,
        3103.56388889,  3053.03846154,  3132.1568254 ,  3113.40338164,
        3083.18851757,  3042.40557276,  3047.45045045,  3105.60416667,
        3084.61761762,  3039.50917431,  3083.26720947,  3077.69661017,
        3083.21188428,  3104.76992936,  3067.14979757,  3112.69049859,
        3082.13131313,  3101.8125    ,  3128.1483376 ,  3084.12689394,
        3093.24719101,  3101.10818438,  3073.52048962,  3074.2218543 ,
        3032.69717773,  3086.1489726 ,  3056.2689769 ,  3104.08901515,
        3117.09362389,  3099.00738007,  3149.13472222,  3071.52173913,
        3089.08723069,  3048.81900452,  3093.93489583,  3076.49023861,
        3111.57053292,  3063.20082305,  3134.6831844 ,  3094.68970588,
        3119.07506053,  3108.9626506 ,  3140.03559322,  3111.39130435,
        3056.37091805,  3087.8255144 ,  3121.79416713,  3082.50825083,
        3089.56692913,  3075.17712177,  3112.47282609,  3083.99373041,
        3092.15617128,  3103.12276612,  3078.94444444,  3072.42767296,
        3079.74509804,  3056.22920407,  3119.27298851,  3110.35246505,
        3096.86288416,  3048.86898594,  3079.86786787,  3071.60897436,
        3057.45466156,  3094.15408085,  3077.7021041 ,  3061.78530425,
        3056.11111111,  3026.17204301,  3034.76211454,  3112.69747381,
        3052.70967742,  3078.13399779,  3087.64305177,  3080.7826087 ,
        3080.35416667,  3107.18666667,  3117.58001398,  3073.32536974,
        3087.04967949,  3120.55465587,  3097.83631714,  3056.82539683,
        3079.65511551,  2965.238921  ,  3115.78294574,  3081.3825682 ,
        3051.50797267,  3104.27016129,  3107.25889968,  3088.1402214 ,
        3086.21585903,  3127.73979862,  3074.50369004,  3116.58735554,
        3068.75      ,  3046.67210145,  3121.95833333,  3095.29569892,
        3132.30254545,  3042.6384671 ,  3118.38539683,  3144.03765227,
        3078.38154808,  3054.90873016,  3064.08974359,  3052.03728362,
        3060.00746269,  3089.13608563,  3093.41269841,  3077.24931507,
        3120.8125    ,  3086.92562542,  3104.89871873,  3115.03559322,
        3131.02013423,  3084.04044944,  3163.3655706 ,  3108.72197309,
        3098.48333333,  3114.13204952,  3050.49206349,  3089.33766234,
        3123.45973909,  3045.80729167,  3070.42114695,  3062.95665635,
        3100.53044872,  3034.65351299,  3087.51182033,  3134.51980198,
        3080.84946237,  3138.66882416,  3071.39617486,  3112.53333333,
        3107.88394468,  3088.1484038 ,  3122.54512958,  3060.54509804,
        3088.13149847,  3109.38141026,  3123.38203191,  3072.51602815,
        3089.13819095,  3097.00726392,  3092.71253823,  3064.21333333,
        3023.85      ,  3145.78674352,  3116.21566265,  3113.77593032,
        3046.56670602,  3059.9204893 ,  3048.19391026,  3069.34650856,
        3109.76286579,  3087.52882206,  3119.4150641 ,  3096.80449209,
        3048.69603524,  3104.94725394,  3079.94089347,  3084.75247525,
        3108.71558351,  3098.718398  ,  3151.8046595 ,  3087.24065262,
        3077.9030303 ,  3071.02867384,  3101.08805031,  3087.40637141,
        3050.3647343 ,  3058.08088235,  3100.32694064,  3154.61973526,
        3024.34323432,  3113.3872549 ,  3081.16472416,  3100.01207729,
        3080.14752568,  3056.40477864,  3066.95904437,  3092.25585149,
        3097.4974773 ,  3078.25542916,  3088.83557394,  3068.05448718,
        3122.25      ,  3043.33552632,  3050.84298908,  3099.38768116,
        3041.8357021 ,  3139.12825397,  3057.62798354,  3126.07488987,
        3135.56592292,  3085.97748874,  3068.34356553,  3074.83064516,
        3060.05329154,  3108.85397965,  3114.40942029,  3097.66415804,
        3090.89273927,  3083.98360656,  3122.94687232,  3109.6987061 ,
        3085.88210652,  3033.94920635,  3073.06726457,  3087.62009315,
        3096.97855611,  3021.1627451 ,  3102.24315872,  3093.25      ,
        3064.14096916,  3069.05315822,  3072.35869565,  3045.11038961,
        3100.1703854 ,  3082.38983051,  3039.53918495,  3036.34631268,
        3060.72252448,  3108.03820225,  3131.54545455,  3083.64806481,
        3112.70134228,  3033.02578269,  3067.41714286,  3064.63852814,
        3107.72972973,  3102.67443667,  3113.9       ,  3081.53958944,
        3053.81188119,  3117.90277778,  3087.30651341,  3052.75      ,
        3129.69266055,  3077.49585195,  3101.02466368,  3110.18723037,
        3073.15104167,  3129.58906526,  3128.69583333,  3106.68275862,
        3055.11290323,  3101.30208333,  3122.8044739 ,  3090.22839506,
        3090.37174982,  3092.49960412,  3090.16169394,  3120.12393162,
        3058.5625    ,  3095.02604167,  3103.61926606,  3074.125     ,
        3045.36574074,  3052.69007264,  3117.15545915,  3104.7406015 ,
        3078.95157385,  3046.71649066,  3107.09444444,  3061.64851485,
        3080.60089686,  3086.45907007,  3133.45227123,  3065.02323503,
        3114.42088608,  3041.46195652,  3121.57352941,  3089.1280654 ,
        3103.73265651,  3102.60973783,  3074.78475336,  3080.75757576,
        3120.24286582,  3076.875     ,  3060.18251928,  3112.25459318,
        3095.28187919,  3078.87978142,  3117.62157534,  3088.49715909,
        3054.54166667,  3081.59836066,  3095.80512821,  3043.07236842,
        3114.325     ,  3070.9122807 ,  3022.69357496,  3089.20345745,
        3085.83653108,  3022.09333333,  3108.96153846,  3086.26567657,
        3066.59363702,  3080.47301587,  3083.02186589,  3069.13978495,
        3035.30266789,  3086.76190476,  3065.77596996,  3075.96414343,
        3109.70308968,  3094.7008547 ,  3082.78313253,  3069.67628205,
        3041.65217391,  3094.1901723 ,  3087.8362069 ,  3056.09053498,
        3076.41609589,  3100.67983244,  3147.13802083,  3064.66879387,
        3083.70175439,  3016.38507822,  3084.152     ,  3081.998999  ,
        3126.65889213,  3106.02475248,  3087.55411255,  3107.15333333,
        3094.01563786,  3111.86862334,  3080.3222804 ,  3023.51209253,
        3091.63790186,  3043.59897172,  3062.05059524,  3090.96380952,
        3021.81121899,  3048.39247312,  3059.8024819 ,  3061.74480849,
        3061.11438785,  3063.58276125,  3067.83272727,  3058.15217391,
        3091.79569892,  3039.41196465,  3114.37638889,  3070.79039301,
        3074.56642066,  3050.63073852,  3072.2012987 ,  3121.97521866,
        3084.875     ,  3103.02485004,  3062.00847458,  3054.1352657 ,
        3034.82551143,  3095.16006339,  3069.57487923,  3060.598234  ,
        3075.51712329,  3149.98795181,  3075.91666667,  3106.03128761,
        3091.34996041,  3141.93301049,  3032.46973366,  3049.12747427,
        3129.68551842,  3117.67114788,  3032.28375286,  3069.82616487,
        3091.07173913,  3116.23387097,  3031.90625   ,  3113.64451827,
        3145.07046714,  3087.78066613,  3103.90909091,  3064.36985237,
        3070.080316  ,  3095.57894737,  3056.8125    ,  3096.08385482,
        3098.2074426 ,  3089.23550725,  3114.04868914,  3041.82638889,
        3075.60401003,  3052.87445887,  3133.16049383,  3082.37296037,
        3089.5       ,  3115.9375    ,  3093.20833333,  3034.87055016])



In [31]:

    
# So now we have an estimate of the distribution of birthweights at week 40 for 
# something closer to the populations that these small samples represent.
weights_40 = Table().with_columns([
        ('nosmoke', nosamples['slope']*40+nosamples['intercept']),
        ('smoke', smokesamples['slope']*40+smokesamples['intercept'])])
weights_40









    Out[31]:





    
        
            nosmoke  smoke
        
    
    
        
            3335.95  3042.95
        
    
        
            3400.74  3112.07
        
    
        
            3301.64  3088.04
        
    
        
            3343.92  3074.53
        
    
        
            3374.81  3097.85
        
    
        
            3329.85  3072.15
        
    
        
            3328.49  3070.93
        
    
        
            3371.23  3046.54
        
    
        
            3357.23  3091.7 
        
    
        
            3331.11  3093.01
        
    

... (990 rows omitted)



In [32]:

    
weights_40['Smoke Wgt Loss'] = weights_40['nosmoke'] - weights_40['smoke']



In [33]:

    
# what do we expect the distribution of birthweight reduction due to smoking to look like
# for the population represented by the original sample?
weights_40.select('Smoke Wgt Loss').hist(bins=30,normed=True)



In [34]:

    
smoke_diff









    Out[34]:





255.55207244862686



In [35]:

    
def firstQtile(x) : return np.percentile(x,25)
def thirdQtile(x) : return np.percentile(x,25)
summary_ops = (min, firstQtile, np.median, np.mean, thirdQtile, max)



In [36]:

    
summary = weights_40.stats(summary_ops)
summary









    Out[36]:





    
        
            statistic  nosmoke  smoke  Smoke Wgt Loss
        
    
    
        
            min        3242.75  2940.13  116.965       
        
    
        
            firstQtile  3321.98  3065.69  231.516       
        
    
        
            median     3342.44  3084.41  260.932       
        
    
        
            mean       3344.3  3084.13  260.173       
        
    
        
            thirdQtile  3321.98  3065.69  231.516       
        
    
        
            max        3464.28  3170.71  401.404



In [37]:

    
summary['diff']=summary['nosmoke']-summary['smoke']



In [38]:

    
# the bottom line
summary









    Out[38]:





    
        
            statistic  nosmoke  smoke  Smoke Wgt Loss  diff
        
    
    
        
            min        3242.75  2940.13  116.965        302.617
        
    
        
            firstQtile  3321.98  3065.69  231.516        256.29 
        
    
        
            median     3342.44  3084.41  260.932        258.029
        
    
        
            mean       3344.3  3084.13  260.173        260.173
        
    
        
            thirdQtile  3321.98  3065.69  231.516        256.29 
        
    
        
            max        3464.28  3170.71  401.404        293.573

Visualizing the separation of these distributions



In [39]:

    
weights_40.select(['smoke','nosmoke']).hist(overlay=True,bins=30,normed=True)

Empirical p-values

A more formal approach would be to take as the null hypothesis that smoking does not affect the birthweight. Repeatedly split the data into random halves and model the birthweight difference. What is the chance that the difference we see in summary table is an artifact?



In [40]:

    
# As an example, split the original data into two random halves
A, B = births.split(births.num_rows//2)



In [41]:

    
A









    Out[41]:





    
        
            Wgt  Gest  Smoke
        
    
    
        
            3175  41   1    
        
    
        
            3459  40   0    
        
    
        
            2420  36   1    
        
    
        
            3040  37   0    
        
    
        
            2715  36   1    
        
    
        
            3322  39   0    
        
    
        
            2957  39   1    
        
    
        
            3244  39   0    
        
    
        
            2928  39   1    
        
    
        
            3095  39   0    
        
    

... (6 rows omitted)



In [42]:

    
B









    Out[42]:





    
        
            Wgt  Gest  Smoke
        
    
    
        
            3530  42   0    
        
    
        
            3130  39   1    
        
    
        
            2841  36   0    
        
    
        
            3446  42   1    
        
    
        
            2580  38   1    
        
    
        
            3410  40   0    
        
    
        
            2920  38   0    
        
    
        
            2760  39   1    
        
    
        
            3523  41   0    
        
    
        
            2740  38   1    
        
    

... (6 rows omitted)



In [43]:

    
make_lm(A['Gest'], A['Wgt'])(40)









    Out[43]:





3212.4761904761881



In [44]:

    
make_lm(B['Gest'], B['Wgt'])(40)









    Out[44]:





3174.8721804511288

Capturing statistical computations and general tools

Rather than compute the null hypothesis for this particular table, we can build a very general tool, as a function, that will do it in general.

Then we can use it to build a sample distribution under the null hypothesis



In [45]:

    
def null_diff_at(tbl, x, y, w):
    A, B = tbl.split(tbl.num_rows//2)
    return make_lm(A[x], A[y])(w) - make_lm(B[x], B[y])(w)



In [46]:

    
null_diff_at(births, 'Gest', 'Wgt', 40)









    Out[46]:





21.556415688831294



In [47]:

    
null = Table().with_column('Diff', [null_diff_at(births, 'Gest', 'Wgt', 40) for i in range(1000)])
null.hist()

What is the probablility that we got a 260g difference in birthweight at 40 weeks as an artifact of the sample?

Zero



In [48]:

    
null.stats()









    Out[48]:





    
        
            statistic  Diff
        
    
    
        
            min       -233.477
        
    
        
            max       224.563 
        
    
        
            median    1.10395 
        
    
        
            sum       -1725.1



In [ ]:

Wgt	Gest	Smoke
2940	38	1
3130	38	0
2420	36	1
2450	34	0
2760	39	1
2440	35	1
3226	40	0
3301	42	1
2729	37	0
3410	40	0

week	nosmoke	smoke
32	2164.48	1974.36
33	2311.69	2113.38
34	2458.9	2252.41
35	2606.1	2391.44
36	2753.31	2530.47
37	2900.52	2669.5
38	3047.72	2808.53
39	3194.93	2947.56
40	3342.14	3086.59
41	3489.34	3225.61

nosmoke	smoke
3335.95	3042.95
3400.74	3112.07
3301.64	3088.04
3343.92	3074.53
3374.81	3097.85
3329.85	3072.15
3328.49	3070.93
3371.23	3046.54
3357.23	3091.7
3331.11	3093.01

statistic	nosmoke	smoke	Smoke Wgt Loss
min	3242.75	2940.13	116.965
firstQtile	3321.98	3065.69	231.516
median	3342.44	3084.41	260.932
mean	3344.3	3084.13	260.173
thirdQtile	3321.98	3065.69	231.516
max	3464.28	3170.71	401.404

Wgt	Gest	Smoke
3175	41	1
3459	40	0
2420	36	1
3040	37	0
2715	36	1
3322	39	0
2957	39	1
3244	39	0
2928	39	1
3095	39	0

Wgt	Gest	Smoke
3530	42	0
3130	39	1
2841	36	0
3446	42	1
2580	38	1
3410	40	0
2920	38	0
2760	39	1
3523	41	0
2740	38	1