Tables illustration of working with computational models of probability

David Culler

This notebook seeks to illustrate simple datascience.Table operations as part of a basic lesson on probability.

Documentation on the datascience module is at http://data8.org/datascience/index.html and of Tables as http://data8.org/datascience/tables.html.


In [100]:
# HIDDEN - generic nonsense for setting up environment
from datascience import *
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
from ipywidgets import interact
# datascience version number of last run of this notebook
version.__version__


Out[100]:
'0.5.19'

Create a table as a model of a stochastic phenomenom

Here we create a single column table as a computational model of a die with each element of the table containing the number of dots on the side. This illustrates the simplest way of constructing a table, Table.with_column.

Then we define a function that models rolling a die. This illustrates the use of Table.sample to take random sample of a table.


In [62]:
die = Table().with_column('side', [1,2,3,4,5,6])
die


Out[62]:
side
1
2
3
4
5
6

In [52]:
# Simulate the roll of a die by sampling from the die table
def roll_die():
    return die.sample(1)['side'][0]

In [63]:
# roll it.  Try this over and over and see what you get
roll_die()


Out[63]:
6

Composition

Build a computational model of rolling a die many times using our roll_die function as a building block. It happens to utilize tables internally, but we have abstracted away from that. Here it is a black box that yields a random roll of a die. Again, we create a table to model the result.


In [64]:
# Simulate rolling it many times, creating a table that records the rolls
num_rolls = 600
rolls = Table().with_column('roll', [roll_die() for i in range(num_rolls)])
rolls


Out[64]:
roll
3
3
1
3
1
3
1
4
3
6

... (590 rows omitted)

Visualization

Above we see just the tip of the table. And, of course, it would be tedious to look at all those rolls. Instead, we want to look at some descriptive statistics of the process. We can do that with Table.hist, which can be used to produce a histogram or a discrete distribution (the default, i.e., normed = True).

The histogram of the rolls shows what we mean by 'uniform at random'. All sides are equally likely to come up on each roll. Thus the number of times each comes up in a large number of rolls is nearly constant. But not quite. The rolls table it self won't change on its own, but every time you run the cell above, you will get a slightly different picture.


In [66]:
bins = np.arange(1,8)
rolls.hist(bins=bins, normed=False)



In [67]:
# Normalize this gives a distribution.  The probability of each side appearing.  1/6.
rolls.hist(normed=True,bins=bins)


Computing on distributions

While visualization is useful for humans in the data exploration process, everything you see you should be able to compute upon. The analog of Table.hist that yields a table, rather than a chart is table.bin. It returns a new table with a row for each bin.

Here we also illustrate doing some computing on the distribution table:

  • A column of a table is accessed using the standard python get syntax: <object> [ <key> ]. This actually yields an object that is a numpy array, but part of the beauty of tables is you don't have to worry about what that is. The beauty of numpy arrays is that you can work with them pretty much like values, i.e., you can scale them by a constant, add them together and things like that.
  • A column is inserted in the table using the standard python set syntax for objects <object> [ <key> ] = <value>. Note that this modifies the table, adding a column if it does not exist ro updating it if it does. The transformations on tables are functional, they produce new tables. Set treats a table like an object and modifies it.

In [70]:
roll_dist = rolls.bin(normed=True,bins=bins).take(range(6))
roll_dist


Out[70]:
bin roll density
1 0.175
2 0.153333
3 0.185
4 0.131667
5 0.148333
6 0.206667

In [71]:
roll_dist['roll density']


Out[71]:
array([ 0.175     ,  0.15333333,  0.185     ,  0.13166667,  0.14833333,
        0.20666667])

In [72]:
roll_dist['Variation'] = (roll_dist['roll density'] - 1/6)/(1/6)
roll_dist


Out[72]:
bin roll density Variation
1 0.175 0.05
2 0.153333 -0.08
3 0.185 0.11
4 0.131667 -0.21
5 0.148333 -0.11
6 0.206667 0.24

In [74]:
# What is the average value of a roll?
sum(roll_dist['bin']*roll_dist['roll density'])


Out[74]:
3.5449999999999999

In [75]:
np.mean(rolls['roll'])


Out[75]:
3.5449999999999999

Statistical thinking

They say "life is about rolling dice". The statistical perspective on the rolls table above would be captured by sampling many times from the die table. We can capture than naturally in a computational abstraction that rolls a die n times.


In [94]:
# Life is about rolling lots of dice.
# Simulate rolling n dice.
def roll(n):
    """Roll n die.  Return a table of the rolls"""
    return die.sample(n, with_replacement=True)

In [96]:
# try it out.  many times
roll(10)


Out[96]:
side
1
1
1
3
5
6
5
3
2
2

Interactive visualization

The central concept of computational thinking - abstraction. Here it is illustrated again by wrapping up the process of rolling many die and visualizing the resulting distribution into a function.

Once we have it as a function, we can illustrate the central concept of inferential thinking - the law of large numbers - through interactive visualization. When a dies is rolled only a few times, the resulting distribution may be very uneven. But when it is rolled many many times, it is extremely rare for the result to be uneven.


In [97]:
def show_die_dist(n):
    """Roll a die n times and show the distribution of sides that appear."""
    roll(n).hist(bins=np.arange(1,8))

In [101]:
# We can now use the ipywidget we had included at the beginning.
interact(show_die_dist, n=(10, 1000, 10))


Out[101]:
<function __main__.show_die_dist>

Likelihood

If we really roll the dice several times in life, what might we expect the overall outcome to be like?

We can extend our computational approach further by simulating the rolling of several die many many times.


In [102]:
num_die = 10

In [103]:
num_rolls = 100

In [104]:
# Remember - referencing a column gives an array
roll(num_die)['side']


Out[104]:
array([6, 2, 2, 5, 2, 4, 5, 2, 5, 2])

In [105]:
# Simulate rolling num_die dice num_rolls times and build a table of the result
rolls = Table(["die_"+str(i) for i in range(num_die)]).with_rows([roll(num_die)['side'] for i in range(num_rolls)])
rolls


Out[105]:
die_0 die_1 die_2 die_3 die_4 die_5 die_6 die_7 die_8 die_9
1 5 3 1 5 6 4 3 5 1
2 1 2 1 2 4 2 1 5 3
6 2 4 3 4 5 1 6 3 6
4 5 1 4 3 6 1 2 2 2
2 2 3 2 3 5 6 4 2 2
6 4 5 5 2 3 3 6 1 4
5 1 1 5 6 3 1 3 3 6
1 3 6 6 4 1 6 3 2 2
1 5 6 3 6 2 5 3 4 4
3 1 5 1 6 2 1 4 2 2

... (90 rows omitted)


In [106]:
# If we think of each row as a life experience, what is the life like?
label = "{}_dice".format(num_die)
sum_rolls = Table().with_column(label, [np.sum(roll(num_die)['side']) for i in range(num_rolls)])
sum_rolls.hist(range=[10,6*num_die], normed=False)
sum_rolls.stats()


Out[106]:
statistic 10_dice
min 24
max 48
median 36
sum 3576

In [111]:
# Or as a distribution
sum_rolls.hist(range=[10,6*num_die],normed=True)



In [118]:
# Or normalize by the number of die ...
#
Table().with_column(label, [np.sum(roll(num_die)['side'])/num_die for i in range(num_rolls)]).hist(normed=False)


In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.


In [ ]: