In [1]:
from __future__ import print_function, division
import numpy as np
import thinkstats2
import nsfg
import thinkplot
%matplotlib inline
Let's load up the NSFG pregnancy data.
In [2]:
preg = nsfg.ReadFemPreg()
preg.shape
Out[2]:
And select the rows corresponding to live births.
In [3]:
live = preg[preg.outcome == 1]
live.shape
Out[3]:
We can use describe
to generate summary statistics.
In [4]:
live.prglngth.describe()
Out[4]:
But there is no substitute for looking at the whole distribution, not just a summary.
One way to represent a distribution is a Probability Mass Function (PMF).
thinkstats2
provides a class named Pmf
that represents a PMF.
A Pmf object contains a Python dictionary that maps from each possible value to its probability (that is, how often it appears in the dataset).
Items
returns a sorted list of values and their probabilities:
In [5]:
pmf = thinkstats2.Pmf(live.prglngth)
for val, prob in pmf.Items():
print(val, prob)
There are some values here that are certainly errors, and some that are suspect. For now we'll take them at face value.
There are several ways to visualize Pmfs.
thinkplot
provides functions to plot Pmfs and other types from thinkstats2
.
thinkplot.Pmf
renders a Pmf as histogram (bar chart).
In [6]:
thinkplot.PrePlot(1)
thinkplot.Hist(pmf)
thinkplot.Config(xlabel='Pregnancy length (weeks)',
ylabel='PMF',
xlim=[0, 50],
legend=False)
Pmf
renders the outline of the histogram.
In [7]:
thinkplot.PrePlot(1)
thinkplot.Pmf(pmf)
thinkplot.Config(xlabel='Pregnancy length (weeks)',
ylabel='PMF',
xlim=[0, 50])
Pdf
tries to render the Pmf with a smooth curve.
In [8]:
thinkplot.PrePlot(1)
thinkplot.Pdf(pmf)
thinkplot.Config(xlabel='Pregnancy length (weeks)',
ylabel='PMF',
xlim=[0, 50])
I started with PMFs and histograms because they are familiar, but I think they are bad for exploration.
For one thing, they don't hold up well when the number of values increases.
In [9]:
pmf_weight = thinkstats2.Pmf(live.totalwgt_lb)
thinkplot.PrePlot(1)
thinkplot.Hist(pmf_weight)
thinkplot.Config(xlabel='Birth weight (lbs)',
ylabel='PMF')
In [10]:
pmf_weight = thinkstats2.Pmf(live.totalwgt_lb)
thinkplot.PrePlot(1)
thinkplot.Pmf(pmf_weight)
thinkplot.Config(xlabel='Birth weight (lbs)',
ylabel='PMF')
In [11]:
pmf_weight = thinkstats2.Pmf(live.totalwgt_lb)
thinkplot.PrePlot(1)
thinkplot.Pdf(pmf_weight)
thinkplot.Config(xlabel='Birth weight (lbs)',
ylabel='PMF')
Sometimes you can make the visualization better by binning the data:
In [12]:
def bin_and_pmf(weights, num_bins):
bins = np.linspace(0, 15.5, num_bins)
indices = np.digitize(weights, bins)
values = bins[indices]
pmf_weight = thinkstats2.Pmf(values)
thinkplot.PrePlot(1)
thinkplot.Pdf(pmf_weight)
thinkplot.Config(xlabel='Birth weight (lbs)',
ylabel='PMF')
bin_and_pmf(live.totalwgt_lb.dropna(), 50)
Binning is simple enough, but it is still a nuisance.
And it is fragile. If you have too many bins, the result is noisy. Too few, you obliterate features that might be important.
And if the bin boundaries don't align well with data boundaries, you can create artifacts.
In [13]:
bin_and_pmf(live.totalwgt_lb.dropna(), 51)
There must be a better way!
Indeed there is. In my opinion, cumulative distribution functions (CDFs) are a better choice for data exploration.
You don't have to bin the data or make any other transformation.
thinkstats2
provides a function that makes CDFs, and thinkplot
provides a function for plotting them.
In [14]:
data = [1, 2, 2, 5]
pmf = thinkstats2.Pmf(data)
pmf
Out[14]:
In [15]:
cdf = thinkstats2.Cdf(data)
cdf
Out[15]:
In [16]:
thinkplot.PrePlot(1)
thinkplot.Cdf(cdf)
thinkplot.Config(ylabel='CDF',
xlim=[0.5, 5.5])
Let's see what that looks like for real data.
In [17]:
cdf_weight = thinkstats2.Cdf(live.totalwgt_lb)
thinkplot.PrePlot(1)
thinkplot.Cdf(cdf_weight)
thinkplot.Config(xlabel='Birth weight (lbs)',
ylabel='CDF')
A CDF is a map from each value to its cumulative probability.
You can use it to compute percentiles:
In [18]:
cdf_weight.Percentile(50)
Out[18]:
Or if you are given a value, you can compute its percentile rank.
In [19]:
cdf_weight.PercentileRank(8.3)
Out[19]:
Looking at the CDF, it is easy to see the range of values, the central tendency and spread, as well as the overall shape of the distribution.
If there are particular values that appear often, they are visible as vertical lines. If there are ranges where no values appear, they are visible as horizontal lines.
And one of the best things about CDFs is that you can plot several of them on the same axes for comparison. For example, let's see if first babies are lighter than others.
In [20]:
firsts = live[live.birthord == 1]
others = live[live.birthord != 1]
len(firsts), len(others)
Out[20]:
In [21]:
cdf_firsts = thinkstats2.Cdf(firsts.totalwgt_lb, label='firsts')
cdf_others = thinkstats2.Cdf(others.totalwgt_lb, label='others')
In [26]:
thinkplot.PrePlot(2)
thinkplot.Cdfs([cdf_firsts, cdf_others])
thinkplot.Config(xlabel='Birth weight (lbs)',
ylabel='CDF',
legend=True)
Plotting the two distributions on the same axes, we can see that the distribution for others is shifted to the right; that is, toward higher values. And we can see that the shift is close to the same over the whole distribution.
Let's see how well we can make this comparison with PMFs:
In [23]:
pmf_firsts = thinkstats2.Pmf(firsts.totalwgt_lb, label='firsts')
pmf_others = thinkstats2.Pmf(others.totalwgt_lb, label='others')
thinkplot.PrePlot(2)
thinkplot.Pdfs([pmf_firsts, pmf_others])
thinkplot.Config(xlabel='Birth weight (lbs)',
ylabel='PMF')
With PMFs it is hard to compare distributions. And if you plot more than two PMFs on the same axes, it is likely to be a mess.
Reading CDFs takes some getting used to, but it is worth it! For data exploration and visualization, CDFs are better than PMFs in almost every way.
But if you really have to generate a PMF, a good option is to estimate a smoothed PDF using Kernel Density Estimation (KDE).
In [24]:
pdf_firsts = thinkstats2.EstimatedPdf(firsts.totalwgt_lb.dropna(), label='firsts')
pdf_others = thinkstats2.EstimatedPdf(others.totalwgt_lb.dropna(), label='others')
thinkplot.PrePlot(2)
thinkplot.Pdfs([pdf_firsts, pdf_others])
thinkplot.Config(xlabel='Birth weight (lbs)',
ylabel='PDF')
Like binning, KDE involves smoothing the data, so you lose some information.
And you might have to tune the "bandwidth" parameter to get the right amount of smoothing.
In [ ]: