This notebook seeks to illustrate simple datascience.Table operations as part of a basic lesson on probability.
Documentation on the datascience module is at http://data8.org/datascience/index.html and of Tables as http://data8.org/datascience/tables.html.
In [100]:
# HIDDEN - generic nonsense for setting up environment
from datascience import *
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
from ipywidgets import interact
# datascience version number of last run of this notebook
version.__version__
Out[100]:
Here we create a single column table as a computational model of a die
with each element of the table containing the number of dots on the side.
This illustrates the simplest way of constructing a table, Table.with_column
.
Then we define a function that models rolling a die. This illustrates the
use of Table.sample
to take random sample of a table.
In [62]:
die = Table().with_column('side', [1,2,3,4,5,6])
die
Out[62]:
In [52]:
# Simulate the roll of a die by sampling from the die table
def roll_die():
return die.sample(1)['side'][0]
In [63]:
# roll it. Try this over and over and see what you get
roll_die()
Out[63]:
Build a computational model of rolling a die many times using our roll_die
function as a building block. It happens to utilize tables internally, but we have abstracted away from that. Here it is a black box that yields a random roll of a die. Again, we create a table to model the result.
In [64]:
# Simulate rolling it many times, creating a table that records the rolls
num_rolls = 600
rolls = Table().with_column('roll', [roll_die() for i in range(num_rolls)])
rolls
Out[64]:
Above we see just the tip of the table. And, of course, it would be tedious to look at all those rolls. Instead, we want to look at some descriptive statistics of the process. We can do that with Table.hist
, which
can be used to produce a histogram or a discrete distribution (the default, i.e., normed = True
).
The histogram of the rolls shows what we mean by 'uniform at random'. All sides are equally likely to come up on each roll. Thus the number of times each comes up in a large number of rolls is nearly constant. But not quite. The rolls table it self won't change on its own, but every time you run the cell above, you will get a slightly different picture.
In [66]:
bins = np.arange(1,8)
rolls.hist(bins=bins, normed=False)
In [67]:
# Normalize this gives a distribution. The probability of each side appearing. 1/6.
rolls.hist(normed=True,bins=bins)
While visualization is useful for humans in the data exploration process, everything you see you should be able to compute upon. The analog of Table.hist
that yields a table, rather than a chart is table.bin
. It returns a new table with a row for each bin.
Here we also illustrate doing some computing on the distribution table:
<object> [ <key> ]
. This actually yields an object that is a numpy array, but part of the beauty of tables is you don't have to worry about what that is. The beauty of numpy arrays is that you can work with them pretty much like values, i.e., you can scale them by a constant, add them together and things like that.<object> [ <key> ] = <value>
. Note that this modifies the table, adding a column if it does not exist ro updating it if it does. The transformations on tables are functional, they produce new tables. Set treats a table like an object and modifies it.
In [70]:
roll_dist = rolls.bin(normed=True,bins=bins).take(range(6))
roll_dist
Out[70]:
In [71]:
roll_dist['roll density']
Out[71]:
In [72]:
roll_dist['Variation'] = (roll_dist['roll density'] - 1/6)/(1/6)
roll_dist
Out[72]:
In [74]:
# What is the average value of a roll?
sum(roll_dist['bin']*roll_dist['roll density'])
Out[74]:
In [75]:
np.mean(rolls['roll'])
Out[75]:
In [94]:
# Life is about rolling lots of dice.
# Simulate rolling n dice.
def roll(n):
"""Roll n die. Return a table of the rolls"""
return die.sample(n, with_replacement=True)
In [96]:
# try it out. many times
roll(10)
Out[96]:
The central concept of computational thinking - abstraction. Here it is illustrated again by wrapping up the process of rolling many die and visualizing the resulting distribution into a function.
Once we have it as a function, we can illustrate the central concept of inferential thinking - the law of large numbers - through interactive visualization. When a dies is rolled only a few times, the resulting distribution may be very uneven. But when it is rolled many many times, it is extremely rare for the result to be uneven.
In [97]:
def show_die_dist(n):
"""Roll a die n times and show the distribution of sides that appear."""
roll(n).hist(bins=np.arange(1,8))
In [101]:
# We can now use the ipywidget we had included at the beginning.
interact(show_die_dist, n=(10, 1000, 10))
Out[101]:
In [102]:
num_die = 10
In [103]:
num_rolls = 100
In [104]:
# Remember - referencing a column gives an array
roll(num_die)['side']
Out[104]:
In [105]:
# Simulate rolling num_die dice num_rolls times and build a table of the result
rolls = Table(["die_"+str(i) for i in range(num_die)]).with_rows([roll(num_die)['side'] for i in range(num_rolls)])
rolls
Out[105]:
In [106]:
# If we think of each row as a life experience, what is the life like?
label = "{}_dice".format(num_die)
sum_rolls = Table().with_column(label, [np.sum(roll(num_die)['side']) for i in range(num_rolls)])
sum_rolls.hist(range=[10,6*num_die], normed=False)
sum_rolls.stats()
Out[106]:
In [111]:
# Or as a distribution
sum_rolls.hist(range=[10,6*num_die],normed=True)
In [118]:
# Or normalize by the number of die ...
#
Table().with_column(label, [np.sum(roll(num_die)['side'])/num_die for i in range(num_rolls)]).hist(normed=False)
In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.
In [ ]: