ThinkStats2 Chapter3の学習記録（ノート）

Exerciseの解答例は chap3ex.ipynb。



In [26]:

    
#!/usr/bin/python
#-*- encoding: utf-8 -*-
"""
Sample Codes for ThinkStats2 - Chapter3

Copyright 2015 @myuuuuun
URL: https://github.com/myuuuuun/ThinkStats2-Notebook
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
"""

%matplotlib inline
from __future__ import division, print_function
import sys
sys.path.append('./code')
import pandas as pd
import nsfg
import custom_functions as cf
import sys
import math
import numpy as np
import thinkstats2
import thinkplot

3.1 Pmfs（離散確率関数）

Histオブジェクトのvalueが確率になるよう、総数で割る



In [5]:

    
hist = thinkstats2.Hist([1, 2, 2, 3, 5])
n = hist.Total()
d = {}
for x, freq in hist.Items():
    d[x] = freq / n

print(d)









    



{1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2}

Pmfオブジェクトを使えば、同じことができる



In [6]:

    
pmf = thinkstats2.Pmf([1, 2, 2, 3, 5])

print(pmf)









    



Pmf({1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2})

例えば、実現値が"2"となる確率、"5"となる確率、"4"となる確率は



In [8]:

    
print(pmf.Prob(2))
print(pmf.Prob(5))
print(pmf.Prob(4))

次に、pmfのあるkeyのをとる確率を変更してみる。例えば、"2"をとる確率を0.2増やすと、



In [9]:

    
pmf.Incr(2, 0.2)
print(pmf.Prob(2))

0.6

このように0.6になる。しかし、これでは全確率は1になっていない。実際



In [12]:

    
print(pmf)
print(pmf.Total())









    



Pmf({1: 0.2, 2: 0.6000000000000001, 3: 0.2, 5: 0.2})
1.2

となって、全確率が1.2になってしまっている。そこで、全確率を1になおしてみる。



In [14]:

    
pmf.Normalize()
print(pmf)
print(pmf.Total())









    



Pmf({1: 0.16666666666666666, 2: 0.5, 3: 0.16666666666666666, 5: 0.16666666666666666})
1.0

とすると、全確率を1に標準化してくれる。

確率を変更する方法は他にも、ある値を取る確率をn倍するMultや、



In [16]:

    
pmf = thinkstats2.Pmf([1, 2, 2, 3, 5])
print(pmf)
pmf.Mult(2, 0.5)
print(pmf)
pmf.Normalize()
print(pmf)









    



Pmf({1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2})
Pmf({1: 0.2, 2: 0.2, 3: 0.2, 5: 0.2})
Pmf({1: 0.25, 2: 0.25, 3: 0.25, 5: 0.25})

単純に書き換える方法



In [18]:

    
pmf[2] = 0.2
print(pmf)
pmf.Normalize()
print(pmf)









    



Pmf({1: 0.25, 2: 0.2, 3: 0.25, 5: 0.25})
Pmf({1: 0.2631578947368421, 2: 0.21052631578947367, 3: 0.2631578947368421, 5: 0.2631578947368421})

もある。

3.2 Plotting PMFs

棒グラフを横に並べたり重ねたりして、Pmfの様子をプロット



In [39]:

    
df = nsfg.ReadFemPreg()
live = df[df.outcome == 1]
firsts = live[df.birthord == 1]
others = live[df.birthord > 1]
weeks_f = firsts['prglngth']
weeks_o = others['prglngth']
first_pmf = thinkstats2.Pmf(weeks_f)
other_pmf = thinkstats2.Pmf(weeks_o)

width = 0.45

# 横並べ
thinkplot.PrePlot(2, cols=2) # グラフを2つ横に並べる
thinkplot.Hist(first_pmf, align='right', width=width)
thinkplot.Hist(other_pmf, align='left', width=width)
thinkplot.Config(xlabel='weeks', ylabel='probability', axis=[27, 46, 0, 0.6])

# 重ねる
thinkplot.PrePlot(2)
thinkplot.SubPlot(2)
thinkplot.Pmfs([first_pmf, other_pmf])

thinkplot.Show(xlabel='weeks', axis=[27, 46, 0, 0.6])









    












    





<matplotlib.figure.Figure at 0x1156fd910>

3.3 Other visualizations

ある週において、1人目の赤ちゃんの占める割合と2人目以降の赤ちゃんの占める割合の違いをプロット



In [43]:

    
weeks = range(35, 46)
diffs = []
for week in weeks:
    p1 = first_pmf.Prob(week)
    p2 = other_pmf.Prob(week)
    diff = 100 * (p1 - p2)
    diffs.append(diff)

thinkplot.Bar(weeks, diffs)
thinkplot.Show(xlabel='weeks', ylabel='difference in percentage (firsts - others)')









    












    





<matplotlib.figure.Figure at 0x116805210>

3.4 The class size paradox

大学で、生徒が1人1つ授業をとっているとする。
観測者は、授業受講者の平均人数が知りたい。
そこで、適当な生徒に自分の受けている授業受講者人数を答えてもらうアンケートを実施することを考える。

ところが、受講者数の多い授業は、アンケートに答える人も多いと考えられるので、アンケート結果には（受講者数）倍のバイアスがかかってしまう(biased_pmf)。
そこで、集めた回答をそれぞれ（受講者数）で割ったものを真の確率の推定値として用いるのが良い（unbiased_pmf）

実際の受講人数の平均は



In [6]:

    
d = { 7: 8, 12: 8, 17: 14, 22: 4, 
          27: 6, 32: 12, 37: 8, 42: 3, 47: 2 }

pmf = thinkstats2.Pmf(d, label='actual')
print('mean', pmf.Mean())









    



mean 23.6923076923

生徒の中からランダム・サンプルを取った場合に想定される受講人数の分布を求める



In [7]:

    
def BiasPmf(pmf, label):
    new_pmf = pmf.Copy(label=label)

    for x, p in pmf.Items():
        new_pmf.Mult(x, x)
        
    new_pmf.Normalize()
    return new_pmf

アンケート結果と実際の受講者人数の分布を並べて描く



In [19]:

    
biased_pmf = BiasPmf(pmf, label='observed')
thinkplot.PrePlot(2)
thinkplot.Pmfs([pmf, biased_pmf])
thinkplot.Show(xlabel='class size', ylabel='PMF')









    












    





<matplotlib.figure.Figure at 0x10bc476d0>

アンケートで得た観測データから、バイアスを修正した後の受講人数の推定値の分布を求める



In [22]:

    
def UnbiasPmf(pmf, label):
    new_pmf = pmf.Copy(label=label)

    for x, p in pmf.Items():
        new_pmf.Mult(x, 1.0/x)
        
    new_pmf.Normalize()
    return new_pmf

アンケート結果と実際の受講者人数の分布を並べて描く



In [23]:

    
unbiased_pmf = UnbiasPmf(biased_pmf, label='inverted_observed')
thinkplot.PrePlot(2)
thinkplot.Pmfs([pmf, unbiased_pmf])
thinkplot.Show(xlabel='class size', ylabel='PMF')









    












    





<matplotlib.figure.Figure at 0x10bc05f10>

3.5 DataFrame indexing

pandasのインデックス操作いろいろ

適当にデータフレームを作る



In [28]:

    
array = np.random.randn(4, 2)
df = pd.DataFrame(array)

print(df)









    



          0         1
0 -0.043833  1.409555
1  0.668582 -1.233178
2 -0.011795  2.834356
3  0.203322 -1.003067

列に名前をつける（列につけた名前をindexという）



In [30]:

    
columns = ['A', 'B']
df = pd.DataFrame(array, columns=columns)

print(df)









    



          A         B
0 -0.043833  1.409555
1  0.668582 -1.233178
2 -0.011795  2.834356
3  0.203322 -1.003067

行に名前をつける（行に付けた名前をlabelという）



In [33]:

    
index = ['a', 'b', 'c', 'd']
df = pd.DataFrame(array, columns=columns, index=index)

print(df)









    



          A         B
a -0.043833  1.409555
b  0.668582 -1.233178
c -0.011795  2.834356
d  0.203322 -1.003067

ある列の値は、次のようにとり出す



In [36]:

    
print(df['A'])

print("\n or \n")

print(df.A)









    



a   -0.043833
b    0.668582
c   -0.011795
d    0.203322
Name: A, dtype: float64

 or 

a   -0.043833
b    0.668582
c   -0.011795
d    0.203322
Name: A, dtype: float64

ある行の値は、次のようにとり出す



In [37]:

    
print(df.loc['a'])

print("\n or \n")

print(df.iloc[0]) # 行に名前をつけたあとから、行番号で行をとり出すことも出来る









    



A   -0.043833
B    1.409555
Name: a, dtype: float64

 or 

A   -0.043833
B    1.409555
Name: a, dtype: float64

配列でとり出す行のラベルを複数指定することも出来る



In [40]:

    
indices = ['a', 'c']
print(df.loc[indices])

print(df.loc[['a', 'c']]) # 上と同じこと









    



          A         B
a -0.043833  1.409555
c -0.011795  2.834356
          A         B
a -0.043833  1.409555
c -0.011795  2.834356

スライスも使える



In [41]:

    
print(df['a':'c'])









    



          A         B
a -0.043833  1.409555
b  0.668582 -1.233178
c -0.011795  2.834356

スライスでは数字も使える



In [42]:

    
print(df[0:2])









    



          A         B
a -0.043833  1.409555
b  0.668582 -1.233178



In [ ]: