This file contains example code to demonstrate various Pandas features.
In [36]:
from __future__ import print_function, division
%matplotlib inline
For the first example, I'll work with data from the BRFSS
In [12]:
import brfss
df = brfss.ReadBrfss(nrows=5000)
df.height = df.htm3
Of the first 5000 respondents, 42 have invalid heights. Note that most obvious ways of checking for null don't work.
In [13]:
sum(df.height.isnull())
Out[13]:
Use dropna to select valid heights.
In [14]:
valid_heights = df.height.dropna()
len(valid_heights)
Out[14]:
EstimatedPdf is an interface to gaussian_kde
In [15]:
import thinkstats2
pdf = thinkstats2.EstimatedPdf(valid_heights)
The kde object provides resample:
In [29]:
fillable = pdf.kde.resample(len(df)).flatten()
fillable.shape
Out[29]:
Or you can use thinkstats objects instead. First convert from EstimatedPdf to Pmf
In [32]:
import thinkplot
pmf = pdf.MakePmf()
thinkplot.Pdf(pmf)
You can use the Pmf to generate a random sample, but it is faster to convert to Cdf:
In [39]:
cdf = pmf.MakeCdf()
fillable = cdf.Sample(len(df))
Then we can use fillna to replace NaNs
In [40]:
import pandas
series = pandas.Series(fillable)
df.height.fillna(series, inplace=True)
sum(df.height.isnull())
Out[40]:
In [35]:
cdf = thinkstats2.Cdf(df.height)
thinkplot.Cdf(cdf)
Out[35]:
In [ ]:
In [ ]:
In [ ]:
In [2]:
import brfss
resp = brfss.ReadBrfss(nrows=5000).dropna(subset=['sex', 'htm3'])
groups = resp.groupby('sex')
In [12]:
d = {}
for name, group in groups:
d[name] = group.htm3.values
In [13]:
d
Out[13]:
In [1]:
import brfss
resp = brfss.ReadBrfss().dropna(subset=['sex', 'wtkg2'])
In [4]:
groups = resp.groupby('sex')
d = {}
for name, group in groups:
d[name] = group.wtkg2
In [5]:
d
Out[5]:
In [9]:
import numpy
for sex, weights in d.iteritems():
print(sex, numpy.log(weights).mean(), numpy.log(weights).std())
In [10]:
import scipy.stats
In [28]:
shape, loc, scale = scipy.stats.lognorm.fit(d[1], floc=0)
shape, loc, scale
Out[28]:
In [27]:
shape, loc, scale = scipy.stats.lognorm.fit(d[2], floc=0)
shape, loc, scale
Out[27]:
In [12]:
import thinkstats2
cdf = thinkstats2.Cdf(d[2])
In [14]:
import thinkplot
%matplotlib inline
thinkplot.Cdf(cdf)
Out[14]:
In [25]:
rv = scipy.stats.lognorm(0.23, 0, 70.8)
In [26]:
import matplotlib.pyplot as pyplot
xs = numpy.linspace(20, 200, 100)
ys = rv.cdf(xs)
thinkplot.Cdf(cdf)
pyplot.plot(xs, ys)
Out[26]:
In [ ]: