Data Science

Introduction and Overview

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Welcome!

Welcome to General Assembly's Data Science course.

Instructor:Alessandro Gagliardi
[ADFGagliardi+GA@Gmail.com](mailto:adfgagliardi+ga@gmail.com)
TA:Kevin Perko
[KevinJPerko+GA@Gmail.com](mailto:kevinjperko+ga@gmail.com)
Classes:6:00pm-9:00pm, Mondays and Wednesdays
January 20 – April 7, 2014 (no class February 17)
Office Hours:9:00pm-10:00pm Wednesdays after class
or by appointment

The class will meet every Monday and Wednesday until April 7 except for February 17 which is Presidents' Day.

Kevin, your TA, will hold office hours will be held immediately following class on Wednesdays and either of us will be available by appointment.

Who am I?

  • 1997 - 2001 - Studied Computer Science at UC Santa Cruz
  • 2001 - 2002 - Developed web-based educational CRM at TMP.Worldwide in New York
  • 2002 - 2003 - Took some time off
  • 2003 - 2005 - Worked as a independent consultant for startups in New York
  • 2005 - 2010 - Studient Integrative Neuroscience at Rutgers
  • 2010 - 2011 - Taught Psychology and Neuroscience at USF, NDNU, CIIS
  • 2011 - 2014 - Returned to industry as a Data Scientist at Socialize (R.I.P.), Path, and Glassdoor

Who are you?

  • Your name
  • Where you work and what you do there
  • What you hope to get out of this course (in 1 sentence, please)
  1. Lecture
    A. What is Data Science?
    B. Goals of the Course
  2. Lab
    A. Git setup
    B. IPython Notebook setup
    C. Working in Python

But first...

Since today's my birthday, I thought I might have us warm up our brains with...

The Birthday Paradox!

(credit to Balthazar Rouberol for preparing what follows)

Given a sample of n people, we would like to calculate the probability p that at least one person has the same birthday as any other person in the group.

First: how many know this paradox? Keep the answer to yourselves.

The rest of you: how big do you think this class would have to be in order for there to be >50% chance that two people have the same birthday?

Alternatively, what are the chances that two people in this class have the same birthday (including the TA and me).

Assumptions:

  • the probability distribution is uniform
  • all events are independant from each other

P(A) is the probability of at least two people sharing the same birthday. P(A') is the probability that all birthdays are different.

\begin{equation} P(A') = 1 - P(A) \end{equation}

Calculating the probability

From Wikipedia

P(A') could be calculated as P(1) × P(2) × P(3) × ... × P(20).

The 20 independent events correspond to the 20 people, and can be defined in order. Each event can be defined as the corresponding person not sharing his/her birthday with any of the previously analyzed people. For Event 1, there are no previously analyzed people. Therefore, the probability, P(1), that person number 1 does not share his/her birthday with previously analyzed people is 1, or 100%. Ignoring leap years for this analysis, the probability of 1 can also be written as 365/365, for reasons that will become clear below.

For Event 2, the only previously analyzed people is Person 1. Assuming that birthdays are equally likely to happen on each of the 365 days of the year, the probability, P(2), that Person 2 has a different birthday than Person 1 is 364/365. This is because, if Person 2 was born on any of the other 364 days of the year, Persons 1 and 2 will not share the same birthday.

Similarly, if Person 3 is born on any of the 363 days of the year other than the birthdays of Persons 1 and 2, Person 3 will not share their birthday. This makes the probability P(3) = 363/365.

This analysis continues until Person 20 is reached, whose probability of not sharing his/her birthday with people analyzed before, P(20), is 346/365.

P(A') is equal to the product of these individual probabilities:

    (1) P(A') = 365/365 × 364/365 × 363/365 × 362/365 × ... × 346/365

The terms of equation (1) can be collected to arrive at:

    (2) P(A') = (1/365)^23 × (365 × 364 × 363 × ... × 346)
              = 0.588

    (3) P(A) = 1 - P(A') = 0.411 = 41.1%

Generalization

\begin{eqnarray} P(n') &=& 1 \times (1 - \dfrac{1}{365}) \times (1 - \dfrac{2}{365}) \times ... \times (1 - \dfrac{n+1}{365}) \\ &=& \dfrac{365 \times 364 \times ... \times (365 - n + 1)}{365^{n}} \\ &=& \dfrac{365!}{365^{n} . (365 - n)!}\\ \end{eqnarray}
\begin{equation} P(n) = 1 - P(n') = 1 - \dfrac{365!}{365^{n} . (365 - n)!} \end{equation}

In [1]:
from __future__ import division

import math

def pn_dash(n):
    """Returns probability that no birthday occur the same day in a group of n people."""
    return math.factorial(365) / (365**n * math.factorial(365 - n)) 

def pn(n):
    """Returns probability that birthday of at least 2 people occur same day in group of n people."""
    return 1 - pn_dash(n)

In [2]:
# Let's calculate it for 20 people
print '{:0.2f}%' . format(pn(20) * 100)


41.14%

In [3]:
nb_people = range(0, 85, 5)
p_birthday = [pn(n) for n in nb_people]

for n, p in zip(nb_people, p_birthday):
    print 'n = {:2} -> p = {:.2f}%'.format(n, p * 100)


n =  0 -> p = 0.00%
n =  5 -> p = 2.71%
n = 10 -> p = 11.69%
n = 15 -> p = 25.29%
n = 20 -> p = 41.14%
n = 25 -> p = 56.87%
n = 30 -> p = 70.63%
n = 35 -> p = 81.44%
n = 40 -> p = 89.12%
n = 45 -> p = 94.10%
n = 50 -> p = 97.04%
n = 55 -> p = 98.63%
n = 60 -> p = 99.41%
n = 65 -> p = 99.77%
n = 70 -> p = 99.92%
n = 75 -> p = 99.97%
n = 80 -> p = 99.99%

In [4]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [5]:
# Main plot layout
f, ax = plt.subplots()
ax.set_yticks(np.arange(0, 1.1, 0.1))
plt.ylabel('probability')
f.text(x=0.5, y=0.975, s='Probability distribution of birthday collision for a sample of n people', horizontalalignment='center', verticalalignment='top')

plt.plot(nb_people, p_birthday, label='$p(n)$', color='r')
plt.plot(nb_people, [pn_dash(n) for n in nb_people],label='$p(\overline{n})$',color='b')

n_p50, p50 = [(n, pn(n)) for n in xrange(0, 100) if round(pn(n), 2) in [0.5, 0.51]][0]
plt.axhline(y=p50, xmax=n_p50/80., linestyle='--', color='grey')
plt.axvline(x=n_p50, ymax=p50, linestyle='--', color="grey")

plt.legend(loc='center right')
plt.text(n_p50 - 1, -0.055, '23')


Out[5]:
<matplotlib.text.Text at 0x105bd0ad0>

Conclusion

In a group of 23 people, there is a probability of approximately 50% of finding at least two people with the same birthday, contrary to what your intuition could tell you!

A. What is Data Science?

(And what does the birthday paradox have to do with it?)

List examples:

  • Recommendation Engines
    • Collaborative Filtering (Amazon, Netflix)
    • PYMK (LinkedIn, etc.)
    • Other (Pandora, etc.)
  • Data Viz (NYT, etc.)
  • Fraud detection
  • Business Intelligence (Obama 2012, etc.)

Data Science = Data + Science

Data can be:

  1. Most data does not conform to any predetermined schema. Must ETL it first. Will cover next class.
  2. Machine-readable but unpredictable. Good for communication. Bad for analysis. Mongo, XML, JSON, etc. Will cover Monday.
  3. Must be structured prior to analysis. Fields always there, always same type. SQL, R, Pandas. Next week and beyond.
    • Excel is semi-structured but has some structured capabilities. Special case.

Unstructured (e.g. Email, Photos, Books, etc.)

Semi-Structured (e.g. XML, JSON, NoSQL, APIs, etc.)

Structured (e.g. SQL, Data Frames, etc.)

  1. Most data in the world is unstructured. That is, it does not conform to any predetermined computer-readable form. Before we can work with this sort of data, we need to extract it, transform it into something usable, and load it into our system, whatever that system may be. Extract, transform, load is ETL and its one of the less glamorous and more time consuming parts of the job, so it is important to know how to do it efficiently. We will cover this in the next class.
  2. Semi-structured data is a relatively new thing and has a lot to do with the web. Semi-structured data is machine readable but does not conform to a rigid structure. This is both a blessing and a curse. The flexibility it provides makes it a lot easier for different systems to talk to one another. But that same flexibility makes working with the data in aggregate more difficult. Many "NoSQL" databases use a semi-structured schema. Those of you who are over 30 probably remember XML which was a popular standard for semi-structured in the late 90's and early 2000's. Fortunately, XML has largely been replaced by JSON which is easier to read by both humans (because it doesn't have so many angle brackets) and computers (because it is less flexible and therefore less ambiguous). We will be covering how to work with this sort of data next Monday.
  3. Ultimately, Data Scientists need our data to be structured before we can do anything with it. Relational databases are structured. And once we get into Pandas and R, we will learn about data frames which are also structured. Structured data are consistent throughout. For example, a given field will have the same data type no matter what. This is extremely important because it makes it possible to work across all of the data in aggregate. This opens up the possibility to do everything from calculate sums so measuring relationships and predicting outcomes. We will cover the basics of how to work with this kind of data next week and for the most part, the rest of this course will deal with data in this form.
    1. As an aside, you might be wondering where Excel spreadsheets would fit in this list. I would place it somewhere between Structured and Semi-structured. We might call it "mostly structured". Unlike fully structured data, an Excel spreadsheet can accept different datatypes in a column. But unlike semi-structured data, it will complain when you do this (at least, as soon as you try to do an operation on that column). It probably belongs in the semi-structured category but because of the ability to do structured operations in a spreadsheet, I'm reluctant to put it there.

Science can be:

  1. Once we've got data in a form we can use, first we look at it. Sometimes that is enough to derive valuable insights. Time permitting, we will begin this week 3.
  2. Then we might try to model the data which can be useful for making inferences and predictions

Explorations / Explanations

  • Data Visualization (e.g. ggplot2, Tableau, d3.js, etc.)
  • Unsupervised Machine Learning (e.g. clustering, etc.)
  • etc....

Inferences / Predictions

  • Regression Models (e.g. Linear Models, Logistic Regression)
  • Supervised Machine Learning (e.g. Neural Nets, Genetic Algorithms)
  • etc....

Data Science Workflow

From a Taxonomy of Data Science (by Dataists)

A. Obtain

B. Scrub

C. Explore

D. Model

E. Interpret

Workflow Example:

Problem: what are the leading indicators that a user will make a new purchase?

A. Collect data around user retention, user actions within the product, potentially find data outside of company

B. Extract aggregated values from raw data

  1. How many times did a user share through Facebook within a week? A month?
  2. How often did they open up our emails?

C. Examine data to find common distributions and correlations

D. Extract new meaning to predict if user would purchase again

E. Share results (and probably also go back to the drawing board)

B. Goals of the Course

At the completion of this course, you will be able to:

  • Employ the Map/Reduce paradigm to transform big unstructured data
  • Access data from web-based application programming interfaces (APIs)
  • Use Structured Query Language (SQL) functions like JOIN and GROUP
  • Explore and present data through visualizations
  • Apply generalized linear models (GLMs)
  • Detect clusters in multivariate data
  • Predict categories using supervised machine learning techniques

At the completion of this course, you will be able to:

  • Employ the Map/Reduce paradigm to transform big unstructured data
  • Access data from web-based application programming interfaces (APIs)
  • Use Structured Query Language (SQL) functions like JOIN and GROUP
  • Explore and present data through visualizations
  • Apply generalized linear models (GLMs)
  • Detect clusters in multivariate data
  • Predict categories using supervised machine learning techniques

Tentative Course outline:

  1. Intro and Overview
  2. Big Data I: Hadoop
  3. Big Data II: IPython.parallel
  4. APIs and semi-structured data - First Project Proposals Due 1/27
  5. SQL and Data Frames
  6. Data Exploration & Visualization (feedback on proposals returned)
  7. Linear Regression
  8. Logistic Regression - Formal Project Proposals Due 2/3 (including data and methods chosen)
  9. Dimensionality Reduction
  10. Unsupervised Machine Learning: K-Means Clustering (feedback on proposals returned)
  11. Network Analysis
  12. Supervised Machine Learning: K-Nearest Neighbors - Project live on Github 2/19 (no class 2/17)
  13. Supervised Machine Learning: Naive Bayes
  14. Machine Learning in Python: Scikit-Learn
  15. Supervised Machine Learning: Decision Trees & Random Forests - Peer Feedback Due 3/3
  16. Ensemble Techniques
  17. Recommendation Systems
  18. Final Project Working Session
  19. Final Project Working Session
  20. Where to Go Next
  21. Final Project Presentations (10 min. each)
  22. Final Project Presentations (10 min. each)

C. Lab

Checklist

  • Install Anaconda
  • Setup Git

The following was put together by [Jake Vanderplas](http://www.vanderplas.com)

Getting Started: Four Ways to Use Python

1. The Python command-line interpreter

2. Editing Python (.py) files

3. The IPython command-line interpreter

4. The IPython notebook

1. The Python command-line Interpreter

If you have never used the command-line, you're in for a treat

  • Mac OSX: in Finder/Applications, search for "Terminal"

  • Linux/Unix: Ctrl-Alt-t

  • Windows: run "cmd"

1. The Python command-line Interpreter

Type python at the command-line to start the interpreter

1. The Python Command-line Interpreter

Execute a command: Type print "hello world"

1. The Python Command-line Interpreter

Closing the terminal:

  • Either type exit() or type Ctrl-d

2. Editing Python (.py) files

This requires a text editor.

The best option is one which includes code highlighting

  • Linux: gedit, emacs, nano, vim...
  • Mac OSX: textmate, emacs, nano, vim...
  • Windows: NotePad...

GUI-based editors with bells & whistles

  • Linux: KWrite, Scribes, eggy
  • Mac OSX: TextWrangler, SublimeText
  • Windows: NotePad++, SublimeText

2. Editing Python (.py) files

Use your editor to open hello_world.py (here we use OSX's mate though any text editor will do)

Edit the file to say print "hello world"

In the terminal, run python hello_world.py

3. The IPython command-line Interpreter

IPython provides an enhanced command-line interface

It can be started by typing ipython:

Useful features include tab completion, help (?), etc.

3. The IPython command-line Interpreter

Basic use is just like the standard interpreter:

4. The IPython Notebook

The IPython notebook can be started by typing ipython notebook:

4. The IPython Notebook

Your web browser should open to an interactive notebook page

Notice that this slideshow is written as an IPython notebook!

The most basic data structure is the None type. This is the equivalent of NULL in other languages.
There are four numeric types: int, float, bool, complex.


In [6]:
type(1)


Out[6]:
int

In [7]:
type(2.5)


Out[7]:
float

In [8]:
type(True)


Out[8]:
bool

In [9]:
type(2+3j)


Out[9]:
complex

The next basic data type is the Python list.
A list is an ordered collection of elements, and these elements can be of arbitrary type. Lists are mutable, meaning they can be changed in-place.


In [10]:
k = [1, 'b', True]

In [11]:
k[2]


Out[11]:
True

In [12]:
k[1] = 'a'

In [13]:
k


Out[13]:
[1, 'a', True]

Likewise, tuples are immutable arrays of arbitrary elements.


In [14]:
x = (1, 'a', 2.5)

In [15]:
x


Out[15]:
(1, 'a', 2.5)

In [16]:
x[0]


Out[16]:
1

In [17]:
x[0] = 'b'


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-1d938b100406> in <module>()
----> 1 x[0] = 'b'

TypeError: 'tuple' object does not support item assignment

The string type in Python represents an immutable ordered array of characters (note there is no char type).

Strings support slicing and indexing operations like arrays, and have many other string-specific functions as well.

String processing is one area where Python excels.

Associative arrays (or hash tables) are implemented in Python as the dictionary type.


In [18]:
this_class = {'subject': 'Data Science', 'instructor': 'Alessandro', 'time': 1800, 'is_cool': True}

In [19]:
this_class['subject']


Out[19]:
'Data Science'

In [20]:
this_class['is_cool']


Out[20]:
True

Dictionaries are unordered collections of key-value pairs, and dictionary keys must be immutable.

Another basic Python data type is the set. Sets are unordered mutable collections of distinct elements.


In [21]:
y = set([1, 1, 2, 3, 5, 8])

In [22]:
y


Out[22]:
{1, 2, 3, 5, 8}

These are particularly useful for checking membership of an element and for ensuring element uniqueness.

Discussion