This notebook seeks to convey a core concept of inferential thinking - given some set of observations about a small sample of a population, attempt to draw robust conclusions about the (unobservable) population.

Here we create a hypothetical population through simulation. It is based on the historical discussion in the data8 lecture about estimating the size of foreign bomber fleets from the observations of tail markings.


In [2]:
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
plots.style.use('fivethirtyeight')
# datascience version number of last run of this notebook
version.__version__


Out[2]:
'0.3.dev19'

In [3]:
# The magic number - size of the population that (in the real world) 
# we don't know and want to estimate

def createPopulation():
    def serNo(x):
        return "{:05d}".format(x)
    p = Table([np.arange(1,37*55)],["Ser No"])
    p.set_format("Ser No", serNo)
    return p

In [4]:
# Create a simulation of the population as a table - ordered collection of named columns
population = createPopulation()
population


Out[4]:
Ser No
00001
00002
00003
00004
00005
00006
00007
00008
00009
00010

... (2024 rows omitted)


In [5]:
# computational thinking - simulate observing a sample of the population
sample_size = 10

In [18]:
population.sample(sample_size,with_replacement=True)


Out[18]:
Ser No
00625
01907
01892
01070
01361
01492
01762
01444
00365
00656

In [19]:
# Simulate observing multiple samples
nsamples = 30

In [20]:
# use iteration to create a table of samples 
samples = Table()
for i in range(nsamples):
    name = "sample-"+str(i)
    a_sample = population.sample(sample_size,with_replacement=True)
    samples[name] = a_sample["Ser No"]
samples


Out[20]:
sample-0 sample-1 sample-2 sample-3 sample-4 sample-5 sample-6 sample-7 sample-8 sample-9 sample-10 sample-11 sample-12 sample-13 sample-14 sample-15 sample-16 sample-17 sample-18 sample-19 sample-20 sample-21 sample-22 sample-23 sample-24 sample-25 sample-26 sample-27 sample-28 sample-29
1551 1741 1321 314 921 674 1405 1715 2020 235 154 880 1006 434 1610 1251 1729 892 2033 173 1862 1671 604 1197 1668 231 1833 785 1850 1168
778 1124 741 568 1179 710 1351 146 14 1792 531 120 1216 43 1699 1277 1897 1205 1316 175 436 1786 203 1136 1289 911 1940 859 905 992
1226 1470 344 1538 230 954 195 402 1134 726 1465 939 73 1938 1769 576 22 1209 826 907 280 743 1552 10 1655 1449 1652 1075 1529 119
1193 204 1201 940 739 443 487 1839 1686 494 1923 1735 591 936 1285 1261 413 1403 186 1366 611 1008 1117 531 728 760 531 1515 670 1765
1317 1669 1061 709 534 559 140 189 1013 1930 428 1209 466 1220 1296 476 1057 283 1630 892 1699 1867 1993 1201 956 1210 1582 479 826 1903
1376 1756 1198 186 1159 1704 121 23 647 1506 1299 1879 760 1019 1208 1941 596 1263 505 211 1107 2026 704 701 150 1769 1512 1618 1939 1358
86 735 580 1698 2016 807 414 38 1079 1698 1191 292 1444 1356 645 877 1421 283 477 1744 316 1020 720 16 224 646 1987 1804 256 1990
622 34 605 1174 1432 880 1341 2031 576 2004 1289 945 377 895 567 716 1181 2014 1028 628 577 98 642 1610 440 961 1962 1066 1670 1383
316 1691 994 423 1190 1914 412 207 1722 386 1815 419 1118 894 1038 534 332 91 770 54 2002 6 739 1143 948 350 985 542 1050 557
709 2017 694 168 88 1727 1171 1472 1459 504 1510 463 1763 1611 2018 862 352 14 1490 253 517 412 1299 1271 151 1097 569 1988 1294 956

In [21]:
# gracefully transition between tables and arrays
samples['sample-0']


Out[21]:
array([1551,  778, 1226, 1193, 1317, 1376,   86,  622,  316,  709])

In [23]:
# define a function to capture formally a idea about how to do the estimation
def estimateA(smpl) :
    return np.max(smpl)

In [24]:
estimateA(samples['sample-2'])


Out[24]:
1321

In [26]:
# you might come up with lots of other estimators
def estimateB(smpl) :
    return 2*np.mean(smpl)

In [27]:
#verify it works
estimateA(samples["sample-0"])


Out[27]:
1551

In [29]:
# illustrate list comprehension to explore data
[estimateB(samples[s]) for s in samples]


Out[29]:
[1834.8,
 2488.1999999999998,
 1747.8,
 1543.5999999999999,
 1897.5999999999999,
 2074.4000000000001,
 1407.4000000000001,
 1612.4000000000001,
 2270.0,
 2255.0,
 2321.0,
 1776.2,
 1762.8,
 2069.1999999999998,
 2627.0,
 1954.2,
 1800.0,
 1731.4000000000001,
 2052.1999999999998,
 1280.5999999999999,
 1881.4000000000001,
 2127.4000000000001,
 1914.5999999999999,
 1763.2,
 1641.8,
 1876.8,
 2910.5999999999999,
 2346.1999999999998,
 2397.8000000000002,
 2438.1999999999998]

In [30]:
# Build a tables of estimates
estA = Table([[estimateA(samples[s]) for s in samples]],['ests'])
estA


Out[30]:
ests
1551
2017
1321
1698
2016
1914
1405
2031
2020
2004

... (20 rows omitted)


In [31]:
# Look at the behavior of this estimator as a histogram
estA.hist(range=(1,np.max(estA['ests'])),bins=20)



In [32]:
# Computational thinking: estimator as a higher order function 
# passed in to a function that creates a table of estimate
def estimate(estimator):
    return Table([[estimator(samples[s]) for s in samples]],['ests'])

In [33]:
estB = estimate(estimateB)

In [34]:
estB.hist(range=(1,np.max(estB['ests'])),bins=20)



In [35]:
comp = Table([estA['ests'],estB['ests']],['estA','estB'])

In [36]:
comp


Out[36]:
estA estB
1551 1834.8
2017 2488.2
1321 1747.8
1698 1543.6
2016 1897.6
1914 2074.4
1405 1407.4
2031 1612.4
2020 2270
2004 2255

... (20 rows omitted)


In [37]:
comp.hist(overlay=True, bins=np.arange(1000,2500,50))



In [38]:
# How does these estimates compare with the true size of the population?
population.num_rows


Out[38]:
2034

In [39]:
# Produce a table containing the data associated with a histogram
ebins = comp.bin(bins=np.arange(1000,2500,50))
ebins.show()


bin estA count estB count
1000 0 0
1050 0 0
1100 0 0
1150 0 0
1200 0 0
1250 0 1
1300 1 0
1350 0 0
1400 1 1
1450 0 0
1500 0 1
1550 1 0
1600 1 2
1650 2 0
1700 1 2
1750 2 3
1800 0 2
1850 2 3
1900 5 1
1950 4 1
2000 10 0
2050 0 3
2100 0 1
2150 0 0
2200 0 0
2250 0 2
2300 0 2
2350 0 1
2400 0 1
2450 0 0

In [ ]: