David E. Culler
This notebook illustrates some of the use of datascience Tables to perform regressions on multiple variables. In doing so, it shows some of the elegant ways that computational concepts and statistical concepts naturally fit together.
In [3]:
# HIDDEN
from datascience import *
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
In [4]:
# Source https://onlinecourses.science.psu.edu/stat501/node/380
births=Table.read_table("data/birthsmokers.csv")
births
Out[4]:
In [5]:
# First let's just look at what is here. This needs to be a scatter, rather
# than a plot because there is no simple ordering - just relationships between
# birthweight and gestation time along with whether the mother smokes.
births.scatter('Gest')
In [6]:
births.num_rows
Out[6]:
In [7]:
# How many samples in each category
births.where('Smoke').num_rows
Out[7]:
In [8]:
# As there is a trend among birthweight and gestation, we can show a fit line to try to
# capture it
births.drop('Smoke').scatter('Gest', fit_line=True)
In [9]:
nosmoke=births.where('Smoke',0).drop('Smoke')
smoke=births.where('Smoke',1).drop('Smoke')
In [10]:
# we can attempt to find the trend for each
nosmoke.scatter('Gest',fit_line=True)
In [11]:
smoke.scatter('Gest',fit_line=True, marker='x')
Selecting a column of a Table yields a numpy array, allowing the numpy numerical
tools to be used for fitting curves to the data. For documentation on the polyfit
function, try help(np.polyfit)
. It returns the polynomial coefficients, highest power first, which is the slope for a line.
A linear model is only meaningful in the range around the normal gestation period, and indeed the intercept is not physcially meaningful
In [12]:
# what is the coeeficients of the line fitted to these data?
np.polyfit(nosmoke['Gest'],nosmoke['Wgt'],1)
Out[12]:
In [13]:
np.polyfit(smoke['Gest'],smoke['Wgt'],1)
Out[13]:
We see that both the constant and the weight increase per week is lower for the smokers.
At this point, we could do mx+b all over the place, or we could utilize higher order functions to capture the concept of a model. Here is an example, building such a model directly from the data. It takes the set of x
and y
values and returns a function
that evaluates the model built from that data at a particular x
.
In [14]:
# Build a linear model from data and return it is a function
def make_lm(x, y):
m,b = np.polyfit(x, y, 1)
def lm(a):
return m*a+b
return lm
In [15]:
# Create model of non-smokers that returns estimated weight a function of weeks gestation
nosmoker_weight = make_lm(nosmoke['Gest'], nosmoke['Wgt'])
In [16]:
# Evaluate it at normal gestations
nosmoker_weight(40)
Out[16]:
In [17]:
smoker_weight = make_lm(smoke['Gest'], smoke['Wgt'])
In [18]:
smoker_weight(40)
Out[18]:
In [19]:
# based on this data set, fitting the data to models of weight as a function of gestation
# We might conclude that at 40 weeks the effect of smoking on birthweigth in grams is
smoke_diff = nosmoker_weight(40) - smoker_weight(40)
smoke_diff
Out[19]:
In [20]:
# Or in relative terms
"{0:.1f}%".format(100*(nosmoker_weight(40)-smoker_weight(40))/nosmoker_weight(40))
Out[20]:
In [21]:
# Create a table with a column containing the independent variable
estimated_birthweight = Table().with_column('week', np.arange(32,44))
# Add columns of dependent variables
estimated_birthweight['nosmoke'] = estimated_birthweight.apply(nosmoker_weight,'week')
estimated_birthweight['smoke'] = estimated_birthweight.apply(smoker_weight,'week')
estimated_birthweight
Out[21]:
In [22]:
# plot it to visualize
estimated_birthweight.plot('week',overlay=True)
At this point, we might ask how accurately these linear models fit the data. The 'residual' in the fit would give us some idea of this. But the error in summarizing a collection of empirical data with an analytical model is only a part of the question.
The deeper point is that this data is not "truth", it is merely a sample of a population, and a tiny one at that. We should be asking, is an inference drawn from this data valid for the population that the data is intended to represent. Of course, the population is not directly observable, only the sample of it. How can we use the sample we have to get some idea of how representative it is of the larger population. That is what bootstrap seeks to accomplish.
Tables provide a method sample
for just this purpose. Here we return to looking at
the coefficients, rather than build a function for the model.
In [23]:
# Construct a new model by forming a new sample from our existing one and fiting a line to that
# Here we create quite a general function, which takes a table and column names over which
# the model is to be formed.
def rboot(table, x, y):
sample = table.sample(table.num_rows, with_replacement=True)
return np.polyfit(sample[x],sample[y],1)
In [24]:
# Try it out for non-smokers. Note that every time this cell is evaluated (ctrl-enter)
# the result is a little different, since a new sample is drawn.
rboot(nosmoke, 'Gest', 'Wgt')
Out[24]:
In [25]:
# And for smokers
rboot(smoke,'Gest','Wgt')
Out[25]:
Using this model builder as a function, draw many samples and form a model for each.
Then we can look at the distribution of the model parameters over lots of models.
This illustrates the construction of tables by rows. The Table constructor accepts the
column names and with_rows
fills them in row by row.
hist
forms and shows a histogram of the result.
In [26]:
# Bootstrap a distribution of models by drawing many random samples, with replacement, from our samples
num_samples = 1000
nosamples = Table(['slope','intercept']).with_rows([rboot(nosmoke,'Gest','Wgt') for i in range(num_samples)])
nosamples.hist(bins=50,normed=True, overlay=False)
And we repeate this for the other category.
In [27]:
smokesamples = Table(['slope','intercept']).with_rows([rboot(smoke,'Gest','Wgt') for i in range(num_samples)])
smokesamples.hist(bins=50,normed=True, overlay=False)
At this point we could compute a statistic over the sample distributions of these parameters, such as the total variational distance, or the mean.
In [28]:
nosamples.stats([np.min,np.mean,np.max])
Out[28]:
In [29]:
smokesamples.stats([np.min,np.mean,np.max])
Out[29]:
In [30]:
smokesamples['slope']*40+smokesamples['intercept']
Out[30]:
In [31]:
# So now we have an estimate of the distribution of birthweights at week 40 for
# something closer to the populations that these small samples represent.
weights_40 = Table().with_columns([
('nosmoke', nosamples['slope']*40+nosamples['intercept']),
('smoke', smokesamples['slope']*40+smokesamples['intercept'])])
weights_40
Out[31]:
In [32]:
weights_40['Smoke Wgt Loss'] = weights_40['nosmoke'] - weights_40['smoke']
In [33]:
# what do we expect the distribution of birthweight reduction due to smoking to look like
# for the population represented by the original sample?
weights_40.select('Smoke Wgt Loss').hist(bins=30,normed=True)
In [34]:
smoke_diff
Out[34]:
In [35]:
def firstQtile(x) : return np.percentile(x,25)
def thirdQtile(x) : return np.percentile(x,25)
summary_ops = (min, firstQtile, np.median, np.mean, thirdQtile, max)
In [36]:
summary = weights_40.stats(summary_ops)
summary
Out[36]:
In [37]:
summary['diff']=summary['nosmoke']-summary['smoke']
In [38]:
# the bottom line
summary
Out[38]:
In [39]:
weights_40.select(['smoke','nosmoke']).hist(overlay=True,bins=30,normed=True)
In [40]:
# As an example, split the original data into two random halves
A, B = births.split(births.num_rows//2)
In [41]:
A
Out[41]:
In [42]:
B
Out[42]:
In [43]:
make_lm(A['Gest'], A['Wgt'])(40)
Out[43]:
In [44]:
make_lm(B['Gest'], B['Wgt'])(40)
Out[44]:
In [45]:
def null_diff_at(tbl, x, y, w):
A, B = tbl.split(tbl.num_rows//2)
return make_lm(A[x], A[y])(w) - make_lm(B[x], B[y])(w)
In [46]:
null_diff_at(births, 'Gest', 'Wgt', 40)
Out[46]:
In [47]:
null = Table().with_column('Diff', [null_diff_at(births, 'Gest', 'Wgt', 40) for i in range(1000)])
null.hist()
What is the probablility that we got a 260g difference in birthweight at 40 weeks as an artifact of the sample?
Zero
In [48]:
null.stats()
Out[48]:
In [ ]: