This file contains example code to demonstrate various Pandas features.



In [36]:

    
from __future__ import print_function, division

%matplotlib inline

For the first example, I'll work with data from the BRFSS



In [12]:

    
import brfss
df = brfss.ReadBrfss(nrows=5000)
df.height = df.htm3

Of the first 5000 respondents, 42 have invalid heights. Note that most obvious ways of checking for null don't work.



In [13]:

    
sum(df.height.isnull())









    Out[13]:





42

Use dropna to select valid heights.



In [14]:

    
valid_heights = df.height.dropna()
len(valid_heights)









    Out[14]:





4958

EstimatedPdf is an interface to gaussian_kde



In [15]:

    
import thinkstats2
pdf = thinkstats2.EstimatedPdf(valid_heights)

The kde object provides resample:



In [29]:

    
fillable = pdf.kde.resample(len(df)).flatten()
fillable.shape









    Out[29]:





(5000,)

Or you can use thinkstats objects instead. First convert from EstimatedPdf to Pmf



In [32]:

    
import thinkplot
pmf = pdf.MakePmf()
thinkplot.Pdf(pmf)

You can use the Pmf to generate a random sample, but it is faster to convert to Cdf:



In [39]:

    
cdf = pmf.MakeCdf()
fillable = cdf.Sample(len(df))

Then we can use fillna to replace NaNs



In [40]:

    
import pandas
series = pandas.Series(fillable)
df.height.fillna(series, inplace=True)
sum(df.height.isnull())









    Out[40]:





0



In [35]:

    
cdf = thinkstats2.Cdf(df.height)
thinkplot.Cdf(cdf)









    Out[35]:





{'xscale': 'linear', 'yscale': 'linear'}



In [ ]:



In [ ]:



In [ ]:



In [2]:

    
import brfss
resp = brfss.ReadBrfss(nrows=5000).dropna(subset=['sex', 'htm3'])
groups = resp.groupby('sex')



In [12]:

    
d = {}
for name, group in groups:
    d[name] = group.htm3.values



In [13]:

    
d









    Out[13]:





{1: array([ 170.,  185.,  183., ...,  178.,  175.,  170.]),
 2: array([ 157.,  163.,  165., ...,  168.,  157.,  173.])}



In [1]:

    
import brfss
resp = brfss.ReadBrfss().dropna(subset=['sex', 'wtkg2'])



In [4]:

    
groups = resp.groupby('sex')
d = {}
for name, group in groups:
    d[name] = group.wtkg2



In [5]:

    
d









    Out[5]:





{1: 3      73.64
 4      88.64
 5     109.09
 8      90.00
 9      77.27
 10     63.64
 13    127.27
 20     76.36
 23     78.18
 26     77.27
 35     81.82
 39     90.00
 42     90.91
 45     93.18
 50     81.82
 ...
 414468     63.64
 414470     81.82
 414474     87.73
 414475    113.64
 414477    100.91
 414480     89.09
 414481     76.36
 414488    102.27
 414490     71.82
 414498     75.00
 414501     86.36
 414503     78.18
 414504     88.64
 414506     90.91
 414508     75.00
 Name: wtkg2, Length: 153900, dtype: float64, 2: 0      70.91
 1      72.73
 6      50.00
 7     122.73
 11     78.18
 12     62.73
 14     95.45
 15     88.64
 16     90.91
 17     50.00
 18    100.00
 19     72.73
 21     63.64
 22     55.45
 24     90.91
 ...
 414483     68.18
 414484     87.27
 414485     77.27
 414486     61.36
 414489     86.36
 414492     70.45
 414493     56.82
 414494     68.18
 414495     72.73
 414496     56.82
 414497     65.91
 414499    129.55
 414500     75.00
 414505     72.73
 414507     89.09
 Name: wtkg2, Length: 244584, dtype: float64}



In [9]:

    
import numpy

for sex, weights in d.iteritems():
    print(sex, numpy.log(weights).mean(), numpy.log(weights).std())









    



(1, 4.4693001977146656, 0.19557721757853172)
(2, 4.2596856357921178, 0.22599757494674719)



In [10]:

    
import scipy.stats



In [28]:

    
shape, loc, scale = scipy.stats.lognorm.fit(d[1], floc=0)
shape, loc, scale









    Out[28]:





(0.19557677265342968, 0, 87.295636995626353)



In [27]:

    
shape, loc, scale = scipy.stats.lognorm.fit(d[2], floc=0)
shape, loc, scale









    Out[27]:





(0.22599714638037718, 0, 70.787767991419116)



In [12]:

    
import thinkstats2
cdf = thinkstats2.Cdf(d[2])



In [14]:

    
import thinkplot
%matplotlib inline

thinkplot.Cdf(cdf)









    Out[14]:





{'xscale': 'linear', 'yscale': 'linear'}



In [25]:

    
rv = scipy.stats.lognorm(0.23, 0, 70.8)



In [26]:

    
import matplotlib.pyplot as pyplot
xs = numpy.linspace(20, 200, 100)
ys = rv.cdf(xs)
thinkplot.Cdf(cdf)
pyplot.plot(xs, ys)









    Out[26]:





[<matplotlib.lines.Line2D at 0x7f6a0b53e150>]



In [ ]: