This notebook illustrates the use of tables in conveying the combination of inferential and computational thinking through studying concepts of probability theory. It the Birthday Surprise as a running example, based on a lecture by Ani Adhikari in http://data8.org/


In [1]:
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np
# datascience version number of last run of this notebook
version.__version__


Out[1]:
'0.3.dev19'

Inferential thinking. I sit down next to a random person in class, what is the chance that I have the same birthday as that person?


In [2]:
# what is the chance that we have different birthdays?
364/365


Out[2]:
0.9972602739726028

In [3]:
# what is the chance that we have the same birdthday?
1 - 364/365


Out[3]:
0.002739726027397249

In [4]:
# what is the chance that my birthday is different from both my neighbors?
(364/365) * (363/365)


Out[4]:
0.9917958341152187

In [5]:
# the same as one of them?
1 - (364/365) * (363/365)


Out[5]:
0.008204165884781345

In [6]:
# the same as one of my four neighbors?
1 - (364/365) * (363/365) * (362/365) * (361/365)


Out[6]:
0.02713557369979347

Wow, nearly 3%?

Computational thinking? For a group of N people, what is the probability that at least two of them have the same birthday? An algorithmic question. Given K people with different birthdays, what is the probability that the next random person will have a different birthday than any of them? If we can answer this single step, we have a way to answer the general question? Yes, it is like induction. But it is constructive too. That's an algorithm.

Let's create a little table, kind of like what might think of as a spreadsheet, to collect all of our computational and inferential thoughts about solving this problem.


In [7]:
bday = Table()

In [8]:
# lets start with numbering the days in the year (we'll start with zero).  We don't care if they are
# in calendar order, backwars, sideways or the ways that we encounter them in meeting random people
bday["day"] = range(365)

In [9]:
bday


Out[9]:
day
0
1
2
3
4
5
6
7
8
9

... (355 rows omitted)

Cool, we have a column of the days, but its pretty long and we can see them all on the page, but a graph is a great way to summarize things, so let's look at what we have.


In [10]:
bday.plot()


Yep, as easy as pi. The days in a year. But we really need those diminishing fractions of the days left in the year. OK, that's easy, lets build a new column of the days that are left after we've seen some. And let's look at what we've got


In [11]:
bday['left']= 364-bday['day']
bday


Out[11]:
day left
0 364
1 363
2 362
3 361
4 360
5 359
6 358
7 357
8 356
9 355

... (355 rows omitted)


In [12]:
#of course, we could plot these
bday.plot()



In [13]:
# or on top of each other
bday.plot(overlay=True)


After we have k people with different birthdays, we have 365-k possible days left. But what we really want is the fraction of days in the year left. Obviously, divide by the number of days in the year. Let's do that and see where we are.


In [14]:
bday['frac']=bday['left']/365
bday


Out[14]:
day left frac
0 364 0.99726
1 363 0.994521
2 362 0.991781
3 361 0.989041
4 360 0.986301
5 359 0.983562
6 358 0.980822
7 357 0.978082
8 356 0.975342
9 355 0.972603

... (355 rows omitted)

OK, that looks like its going from 1 to 0 just as we'd expect. And we can see how things are working together, just as we might in a spread sheet. We can focus on the data that comes out of the computation. In a spreadsheet this would be all spread around in the cells. Here the computation is clearly laid out and we can see how it progresses from one step to the next by building up the table.

We might want to select just 'fraction left' to look at.


In [15]:
bday.select('frac').plot()


Ah, but remember the inferential part. Given k people, what is the probability they all have different birthdays. That's the product of these diminishing fractions.


In [16]:
# each column in our table is really a sequence of values
bday["frac"]


Out[16]:
array([ 0.99726027,  0.99452055,  0.99178082,  0.9890411 ,  0.98630137,
        0.98356164,  0.98082192,  0.97808219,  0.97534247,  0.97260274,
        0.96986301,  0.96712329,  0.96438356,  0.96164384,  0.95890411,
        0.95616438,  0.95342466,  0.95068493,  0.94794521,  0.94520548,
        0.94246575,  0.93972603,  0.9369863 ,  0.93424658,  0.93150685,
        0.92876712,  0.9260274 ,  0.92328767,  0.92054795,  0.91780822,
        0.91506849,  0.91232877,  0.90958904,  0.90684932,  0.90410959,
        0.90136986,  0.89863014,  0.89589041,  0.89315068,  0.89041096,
        0.88767123,  0.88493151,  0.88219178,  0.87945205,  0.87671233,
        0.8739726 ,  0.87123288,  0.86849315,  0.86575342,  0.8630137 ,
        0.86027397,  0.85753425,  0.85479452,  0.85205479,  0.84931507,
        0.84657534,  0.84383562,  0.84109589,  0.83835616,  0.83561644,
        0.83287671,  0.83013699,  0.82739726,  0.82465753,  0.82191781,
        0.81917808,  0.81643836,  0.81369863,  0.8109589 ,  0.80821918,
        0.80547945,  0.80273973,  0.8       ,  0.79726027,  0.79452055,
        0.79178082,  0.7890411 ,  0.78630137,  0.78356164,  0.78082192,
        0.77808219,  0.77534247,  0.77260274,  0.76986301,  0.76712329,
        0.76438356,  0.76164384,  0.75890411,  0.75616438,  0.75342466,
        0.75068493,  0.74794521,  0.74520548,  0.74246575,  0.73972603,
        0.7369863 ,  0.73424658,  0.73150685,  0.72876712,  0.7260274 ,
        0.72328767,  0.72054795,  0.71780822,  0.71506849,  0.71232877,
        0.70958904,  0.70684932,  0.70410959,  0.70136986,  0.69863014,
        0.69589041,  0.69315068,  0.69041096,  0.68767123,  0.68493151,
        0.68219178,  0.67945205,  0.67671233,  0.6739726 ,  0.67123288,
        0.66849315,  0.66575342,  0.6630137 ,  0.66027397,  0.65753425,
        0.65479452,  0.65205479,  0.64931507,  0.64657534,  0.64383562,
        0.64109589,  0.63835616,  0.63561644,  0.63287671,  0.63013699,
        0.62739726,  0.62465753,  0.62191781,  0.61917808,  0.61643836,
        0.61369863,  0.6109589 ,  0.60821918,  0.60547945,  0.60273973,
        0.6       ,  0.59726027,  0.59452055,  0.59178082,  0.5890411 ,
        0.58630137,  0.58356164,  0.58082192,  0.57808219,  0.57534247,
        0.57260274,  0.56986301,  0.56712329,  0.56438356,  0.56164384,
        0.55890411,  0.55616438,  0.55342466,  0.55068493,  0.54794521,
        0.54520548,  0.54246575,  0.53972603,  0.5369863 ,  0.53424658,
        0.53150685,  0.52876712,  0.5260274 ,  0.52328767,  0.52054795,
        0.51780822,  0.51506849,  0.51232877,  0.50958904,  0.50684932,
        0.50410959,  0.50136986,  0.49863014,  0.49589041,  0.49315068,
        0.49041096,  0.48767123,  0.48493151,  0.48219178,  0.47945205,
        0.47671233,  0.4739726 ,  0.47123288,  0.46849315,  0.46575342,
        0.4630137 ,  0.46027397,  0.45753425,  0.45479452,  0.45205479,
        0.44931507,  0.44657534,  0.44383562,  0.44109589,  0.43835616,
        0.43561644,  0.43287671,  0.43013699,  0.42739726,  0.42465753,
        0.42191781,  0.41917808,  0.41643836,  0.41369863,  0.4109589 ,
        0.40821918,  0.40547945,  0.40273973,  0.4       ,  0.39726027,
        0.39452055,  0.39178082,  0.3890411 ,  0.38630137,  0.38356164,
        0.38082192,  0.37808219,  0.37534247,  0.37260274,  0.36986301,
        0.36712329,  0.36438356,  0.36164384,  0.35890411,  0.35616438,
        0.35342466,  0.35068493,  0.34794521,  0.34520548,  0.34246575,
        0.33972603,  0.3369863 ,  0.33424658,  0.33150685,  0.32876712,
        0.3260274 ,  0.32328767,  0.32054795,  0.31780822,  0.31506849,
        0.31232877,  0.30958904,  0.30684932,  0.30410959,  0.30136986,
        0.29863014,  0.29589041,  0.29315068,  0.29041096,  0.28767123,
        0.28493151,  0.28219178,  0.27945205,  0.27671233,  0.2739726 ,
        0.27123288,  0.26849315,  0.26575342,  0.2630137 ,  0.26027397,
        0.25753425,  0.25479452,  0.25205479,  0.24931507,  0.24657534,
        0.24383562,  0.24109589,  0.23835616,  0.23561644,  0.23287671,
        0.23013699,  0.22739726,  0.22465753,  0.22191781,  0.21917808,
        0.21643836,  0.21369863,  0.2109589 ,  0.20821918,  0.20547945,
        0.20273973,  0.2       ,  0.19726027,  0.19452055,  0.19178082,
        0.1890411 ,  0.18630137,  0.18356164,  0.18082192,  0.17808219,
        0.17534247,  0.17260274,  0.16986301,  0.16712329,  0.16438356,
        0.16164384,  0.15890411,  0.15616438,  0.15342466,  0.15068493,
        0.14794521,  0.14520548,  0.14246575,  0.13972603,  0.1369863 ,
        0.13424658,  0.13150685,  0.12876712,  0.1260274 ,  0.12328767,
        0.12054795,  0.11780822,  0.11506849,  0.11232877,  0.10958904,
        0.10684932,  0.10410959,  0.10136986,  0.09863014,  0.09589041,
        0.09315068,  0.09041096,  0.08767123,  0.08493151,  0.08219178,
        0.07945205,  0.07671233,  0.0739726 ,  0.07123288,  0.06849315,
        0.06575342,  0.0630137 ,  0.06027397,  0.05753425,  0.05479452,
        0.05205479,  0.04931507,  0.04657534,  0.04383562,  0.04109589,
        0.03835616,  0.03561644,  0.03287671,  0.03013699,  0.02739726,
        0.02465753,  0.02191781,  0.01917808,  0.01643836,  0.01369863,
        0.0109589 ,  0.00821918,  0.00547945,  0.00273973,  0.        ])

Phew that's a lot of numbers... tables gives us a little peek, but we can always look into it for more. Sure beats scrolling through 365 rows in excel!

So we need something to take the running product of a bunch of numbers. These things you'll learn to just build. But lots of folks built useful ones already. That's a beautiful thing about computing - you can naturally build on the work of others. Here we'll use the 'cumulative product' tool from the 'numpy' library. Don't worry you'll see that later. The important thing it that it does what we did for 2, 3 or 4 neighbors - but for all of them.


In [17]:
bday["different"] = np.cumprod(bday["frac"])
bday


Out[17]:
day left frac different
0 364 0.99726 0.99726
1 363 0.994521 0.991796
2 362 0.991781 0.983644
3 361 0.989041 0.972864
4 360 0.986301 0.959538
5 359 0.983562 0.943764
6 358 0.980822 0.925665
7 357 0.978082 0.905376
8 356 0.975342 0.883052
9 355 0.972603 0.858859

... (355 rows omitted)


In [18]:
# finally the probably that at least two people have the same birthday
bday['some same bday'] = 1-bday['different']
bday


Out[18]:
day left frac different some same bday
0 364 0.99726 0.99726 0.00273973
1 363 0.994521 0.991796 0.00820417
2 362 0.991781 0.983644 0.0163559
3 361 0.989041 0.972864 0.0271356
4 360 0.986301 0.959538 0.0404625
5 359 0.983562 0.943764 0.0562357
6 358 0.980822 0.925665 0.0743353
7 357 0.978082 0.905376 0.0946238
8 356 0.975342 0.883052 0.116948
9 355 0.972603 0.858859 0.141141

... (355 rows omitted)


In [19]:
# so 14% with just ten people.  How about the whole story
# Table.select produces a table containing only the selected columns
bday.select('some same bday').plot()



In [20]:
# wow, let's look at the start of that
# Table.take produces a table containing the rows taken from a Table.
bday.take(range(50))


Out[20]:
day left frac different some same bday
0 364 0.99726 0.99726 0.00273973
1 363 0.994521 0.991796 0.00820417
2 362 0.991781 0.983644 0.0163559
3 361 0.989041 0.972864 0.0271356
4 360 0.986301 0.959538 0.0404625
5 359 0.983562 0.943764 0.0562357
6 358 0.980822 0.925665 0.0743353
7 357 0.978082 0.905376 0.0946238
8 356 0.975342 0.883052 0.116948
9 355 0.972603 0.858859 0.141141

... (40 rows omitted)


In [21]:
# Table methods generally produce new tables so they compose naturally.
# Here to convey the essence of the Birthday Surprise
bday.take(range(50)).select('some same bday').plot()



In [22]:
# Since indexing a column by its name gives an array, it can be indexed.
bday['some same bday'][20]


Out[22]:
0.4756953076625503

Now that we understanding this by building it up step by step, could we put it all into one place that we might call a program for answering this question? Sure.


In [23]:
# altogether now like a real program
bd = Table()
bd["day"] = range(365)
bd['left']= 364-bd['day']
bd['frac']=bd['left']/365
bd["different"] = np.cumprod(bd["frac"])
bd['some same bday'] = 1-bd['different']
bd.select('some same bday').take(range(50)).plot()



In [24]:
bd


Out[24]:
day left frac different some same bday
0 364 0.99726 0.99726 0.00273973
1 363 0.994521 0.991796 0.00820417
2 362 0.991781 0.983644 0.0163559
3 361 0.989041 0.972864 0.0271356
4 360 0.986301 0.959538 0.0404625
5 359 0.983562 0.943764 0.0562357
6 358 0.980822 0.925665 0.0743353
7 357 0.978082 0.905376 0.0946238
8 356 0.975342 0.883052 0.116948
9 355 0.972603 0.858859 0.141141

... (355 rows omitted)