This file contains example code to demonstrate various Pandas features.
In [2]:
from __future__ import print_function, division
%matplotlib inline
For the first example, I'll work with data from the BRFSS
In [3]:
import brfss
df = brfss.ReadBrfss(nrows=5000)
df['height'] = df.htm3
Of the first 5000 respondents, 42 have invalid heights. Note that most obvious ways of checking for null don't work.
In [4]:
sum(df.height.isnull())
Out[4]:
Use dropna to select valid heights.
In [5]:
valid_heights = df.height.dropna()
len(valid_heights)
Out[5]:
EstimatedPdf is an interface to gaussian_kde
In [6]:
import thinkstats2
pdf = thinkstats2.EstimatedPdf(valid_heights)
The kde object provides resample:
In [7]:
fillable = pdf.kde.resample(len(df)).flatten()
fillable.shape
Out[7]:
Or you can use thinkstats objects instead. First convert from EstimatedPdf to Pmf
In [8]:
import thinkplot
pmf = pdf.MakePmf()
thinkplot.Pdf(pmf)
You can use the Pmf to generate a random sample, but it is faster to convert to Cdf:
In [9]:
cdf = pmf.MakeCdf()
fillable = cdf.Sample(len(df))
Then we can use fillna to replace NaNs
In [10]:
import pandas
series = pandas.Series(fillable)
df.height.fillna(series, inplace=True)
sum(df.height.isnull())
Out[10]:
In [11]:
cdf = thinkstats2.Cdf(df.height)
thinkplot.Cdf(cdf)
Out[11]:
In [11]:
In [ ]:
In [ ]:
In [14]:
import brfss
resp = brfss.ReadBrfss(nrows=5000).dropna(subset=['sex', 'htm3'])
grouped = resp.groupby('sex')
In [30]:
for i, group in grouped:
print(i, group.shape)
In [26]:
grouped.get_group(1).mean()
Out[26]:
In [23]:
grouped.mean()
Out[23]:
In [21]:
grouped['htm3'].mean()
Out[21]:
In [27]:
grouped.htm3.std()
Out[27]:
In [18]:
import numpy
grouped.aggregate(numpy.mean)
Out[18]:
In [19]:
grouped.aggregate(numpy.std)
Out[19]:
In [13]:
d = {}
for name, group in grouped:
d[name] = group.htm3.values
In [13]:
d
Out[13]:
In [1]:
import brfss
resp = brfss.ReadBrfss().dropna(subset=['sex', 'wtkg2'])
In [4]:
groups = resp.groupby('sex')
d = {}
for name, group in groups:
d[name] = group.wtkg2
In [5]:
d
Out[5]:
In [9]:
import numpy
for sex, weights in d.items():
print(sex, numpy.log(weights).mean(), numpy.log(weights).std())
In [10]:
import scipy.stats
In [28]:
shape, loc, scale = scipy.stats.lognorm.fit(d[1], floc=0)
shape, loc, scale
Out[28]:
In [27]:
shape, loc, scale = scipy.stats.lognorm.fit(d[2], floc=0)
shape, loc, scale
Out[27]:
In [12]:
import thinkstats2
cdf = thinkstats2.Cdf(d[2])
In [14]:
import thinkplot
%matplotlib inline
thinkplot.Cdf(cdf)
Out[14]:
In [25]:
rv = scipy.stats.lognorm(0.23, 0, 70.8)
In [26]:
import matplotlib.pyplot as pyplot
xs = numpy.linspace(20, 200, 100)
ys = rv.cdf(xs)
thinkplot.Cdf(cdf)
pyplot.plot(xs, ys)
Out[26]:
In [ ]: