Welcome to General Assembly's Data Science course.
| Instructor: | Alessandro Gagliardi |
| [ADFGagliardi+GA@Gmail.com](mailto:adfgagliardi+ga@gmail.com) | |
| TA: | Kevin Perko |
| [KevinJPerko+GA@Gmail.com](mailto:kevinjperko+ga@gmail.com) | |
| Classes: | 6:00pm-9:00pm, Mondays and Wednesdays |
| January 20 – April 7, 2014 (no class February 17) | |
| Office Hours: | 9:00pm-10:00pm Wednesdays after class |
| or by appointment |
The class will meet every Monday and Wednesday until April 7 except for February 17 which is Presidents' Day.
Kevin, your TA, will hold office hours will be held immediately following class on Wednesdays and either of us will be available by appointment.
But first...
Since today's my birthday, I thought I might have us warm up our brains with...
(credit to Balthazar Rouberol for preparing what follows)
Given a sample of n people, we would like to calculate the probability p that at least one person has the same birthday as any other person in the group.
First: how many know this paradox? Keep the answer to yourselves.
The rest of you: how big do you think this class would have to be in order for there to be >50% chance that two people have the same birthday?
Alternatively, what are the chances that two people in this class have the same birthday (including the TA and me).
P(A) is the probability of at least two people sharing the same birthday. P(A') is the probability that all birthdays are different.
\begin{equation} P(A') = 1 - P(A) \end{equation}From Wikipedia
P(A') could be calculated as P(1) × P(2) × P(3) × ... × P(20).
The 20 independent events correspond to the 20 people, and can be defined in order. Each event can be defined as the corresponding person not sharing his/her birthday with any of the previously analyzed people. For Event 1, there are no previously analyzed people. Therefore, the probability, P(1), that person number 1 does not share his/her birthday with previously analyzed people is 1, or 100%. Ignoring leap years for this analysis, the probability of 1 can also be written as 365/365, for reasons that will become clear below.
For Event 2, the only previously analyzed people is Person 1. Assuming that birthdays are equally likely to happen on each of the 365 days of the year, the probability, P(2), that Person 2 has a different birthday than Person 1 is 364/365. This is because, if Person 2 was born on any of the other 364 days of the year, Persons 1 and 2 will not share the same birthday.
Similarly, if Person 3 is born on any of the 363 days of the year other than the birthdays of Persons 1 and 2, Person 3 will not share their birthday. This makes the probability P(3) = 363/365.
This analysis continues until Person 20 is reached, whose probability of not sharing his/her birthday with people analyzed before, P(20), is 346/365.
P(A') is equal to the product of these individual probabilities:
(1) P(A') = 365/365 × 364/365 × 363/365 × 362/365 × ... × 346/365
The terms of equation (1) can be collected to arrive at:
(2) P(A') = (1/365)^23 × (365 × 364 × 363 × ... × 346)
= 0.588
(3) P(A) = 1 - P(A') = 0.411 = 41.1%
In [1]:
from __future__ import division
import math
def pn_dash(n):
"""Returns probability that no birthday occur the same day in a group of n people."""
return math.factorial(365) / (365**n * math.factorial(365 - n))
def pn(n):
"""Returns probability that birthday of at least 2 people occur same day in group of n people."""
return 1 - pn_dash(n)
In [2]:
# Let's calculate it for 20 people
print '{:0.2f}%' . format(pn(20) * 100)
In [3]:
nb_people = range(0, 85, 5)
p_birthday = [pn(n) for n in nb_people]
for n, p in zip(nb_people, p_birthday):
print 'n = {:2} -> p = {:.2f}%'.format(n, p * 100)
In [4]:
%pylab inline
In [5]:
# Main plot layout
f, ax = plt.subplots()
ax.set_yticks(np.arange(0, 1.1, 0.1))
plt.ylabel('probability')
f.text(x=0.5, y=0.975, s='Probability distribution of birthday collision for a sample of n people', horizontalalignment='center', verticalalignment='top')
plt.plot(nb_people, p_birthday, label='$p(n)$', color='r')
plt.plot(nb_people, [pn_dash(n) for n in nb_people],label='$p(\overline{n})$',color='b')
n_p50, p50 = [(n, pn(n)) for n in xrange(0, 100) if round(pn(n), 2) in [0.5, 0.51]][0]
plt.axhline(y=p50, xmax=n_p50/80., linestyle='--', color='grey')
plt.axvline(x=n_p50, ymax=p50, linestyle='--', color="grey")
plt.legend(loc='center right')
plt.text(n_p50 - 1, -0.055, '23')
Out[5]:
List examples:
From a Taxonomy of Data Science (by Dataists)
A. Obtain
B. Scrub
C. Explore
D. Model
E. Interpret
A. Collect data around user retention, user actions within the product, potentially find data outside of company
B. Extract aggregated values from raw data
C. Examine data to find common distributions and correlations
D. Extract new meaning to predict if user would purchase again
E. Share results (and probably also go back to the drawing board)
At the completion of this course, you will be able to:
The best option is one which includes code highlighting
The most basic data structure is the None type. This is the equivalent of NULL in other languages.
There are four numeric types: int, float, bool, complex.
In [6]:
type(1)
Out[6]:
In [7]:
type(2.5)
Out[7]:
In [8]:
type(True)
Out[8]:
In [9]:
type(2+3j)
Out[9]:
The next basic data type is the Python list.
A list is an ordered collection of elements, and these elements can be of arbitrary type. Lists are mutable, meaning they can be changed in-place.
In [10]:
k = [1, 'b', True]
In [11]:
k[2]
Out[11]:
In [12]:
k[1] = 'a'
In [13]:
k
Out[13]:
Likewise, tuples are immutable arrays of arbitrary elements.
In [14]:
x = (1, 'a', 2.5)
In [15]:
x
Out[15]:
In [16]:
x[0]
Out[16]:
In [17]:
x[0] = 'b'
The string type in Python represents an immutable ordered array of characters (note there is no char type).
Strings support slicing and indexing operations like arrays, and have many other string-specific functions as well.
String processing is one area where Python excels.
Associative arrays (or hash tables) are implemented in Python as the dictionary type.
In [18]:
this_class = {'subject': 'Data Science', 'instructor': 'Alessandro', 'time': 1800, 'is_cool': True}
In [19]:
this_class['subject']
Out[19]:
In [20]:
this_class['is_cool']
Out[20]:
Dictionaries are unordered collections of key-value pairs, and dictionary keys must be immutable.
Another basic Python data type is the set. Sets are unordered mutable collections of distinct elements.
In [21]:
y = set([1, 1, 2, 3, 5, 8])
In [22]:
y
Out[22]:
These are particularly useful for checking membership of an element and for ensuring element uniqueness.