Data Science

R & Machine Learning

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Last Time:

Data: Recap
R & ggplot2
Lab

Questions?

Agenda

Project Discussion
Readiness Assessment
R
1. History of R
2. Comparing Python and R
Machine Learning
1. What is Machine learning?
2. Machine Learning Problems
3. Linear Regression
Lab: Multiple Regression & Feature Extraction

Projects!

By February 19:

Go to tinyurl.com/DS-repos and pick two projects to comment on.
- (If you haven't submitted your project, go to tinyurl.com/DS-proj-repo now.)
Browse the repos
- View the images the creators generated to illustrate their data
- View the code the creators used to generate the images
  - If the repo contains .ipynb files:
    - open them in nbviewer.ipython.org
    - or clone the repo and open them in your own ipython notebook
Create two new "issues" (one for each project)
- Tag it as a "bug", "question", "enhancement" or create your own tag.
- Be constructive!
  - If you see a potential problem, identify it (but be respectful, we're all learning).
  - If you don't understand a piece of code, ask them to explain it.
  - If you have a recommendation, be specific. How would you improve their project?

Try to pick projects that don't have any issues yet. (That way everyone gets to benefit from feedback.)

Readiness Assessment

Laptops Closed!



In [1]:

    
%load_ext sql



In [2]:

    
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5









    



Done.






    Out[2]:





    
        eid
        name
        department
        longdepartment
        title
        gender
        seniority
    
    
        1
        John Arnold
        Trading
        ENA Gas Financial
        VP Trading
        Male
        Senior
    
    
        2
        Harry Arora
        Trading
        ENA East Power
        VP Trading
        Male
        Senior
    
    
        3
        Robert Badeer
        Trading
        ENA West Power
        Mgr Trading
        Male
        Junior
    
    
        4
        Susan Bailey
        Legal
        ENA Legal
        Specialist Legal
        Female
        Junior
    
    
        5
        Eric Bass
        Trading
        ENA Gas Texas
        Trader
        Male
        Junior

The key is `eid`. There are 156 distinct titles: 82 of them are `'Senior'` 74 of them are `'Junior'` No titles are both `'Senior'` and `'Junior'`

Which normal form does this violate (1, 2, or 3)?
How would you normalize it?

History of R

before R, there was S

S (later S-Plus) developed at Bell Labs by John Chambers, 1975

Previously: Statisticians used FORTRAN subroutines for their work

Goal: develop a more interactive statistical language

R vs S

S-PLUS is proprietary (owned and sold by Tibco)

R developed by Ross Ihaka and Robert Gentleman

Freely availabled by GNU Public License!

Primarily written in C, FORTRAN, and R

Comparing Python and R (data structures)

Python



In [3]:

    
print(type(1))
print(type(2.5))









    



<type 'int'>
<type 'float'>

R: Numeric structures



In [4]:

    
%load_ext rmagic



In [5]:

    
%%R 
print(class(1))
print(class(2.5))









    





[1] "numeric"
[1] "numeric"



In [6]:

    
%R print(class(as.integer(1)))









    





[1] "integer"

Generally, R does well working between and converting numbers as needed.



In [7]:

    
%R print(class(2.5))









    





[1] "numeric"

Python: Arrays (lists)

Lists in Python maintain their original data type



In [8]:

    
k = [1, 'b', True]  # list of mixed types
k









    Out[8]:





[1, 'b', True]

R: Arrays (Vectors)

R Vectors can only maintain one data structure



In [9]:

    
%%R
str(c(1, 'b', TRUE))   # array of character 
str(c(1, 2, 3, 1233))  # array of numeric
# array of numeric (TRUE converted to 1): 
str(c(1, 2, 3, TRUE))









    





 chr [1:3] "1" "b" "TRUE"
 num [1:4] 1 2 3 1233
 num [1:4] 1 2 3 1

Python: Dicts (key/value objects)



In [10]:

    
email = {'title': 'Good Morning!', 'from':'Bob Loblaw', 'date': 'Tue Mar 3 2013'}
email['title']









    Out[10]:





'Good Morning!'



In [11]:

    
print email









    



{'date': 'Tue Mar 3 2013', 'from': 'Bob Loblaw', 'title': 'Good Morning!'}

R: Lists (dotted pairs)

The nearest equivalent to a Python dict in R is a list object.



In [12]:

    
%%R
email <- list(title = 'Good Morning!', from = 'Bob Loblaw', date = 'Tue Mar 3 2013')
email$title









    





[1] "Good Morning!"



In [13]:

    
%R print(email)









    





$title
[1] "Good Morning!"

$from
[1] "Bob Loblaw"

$date
[1] "Tue Mar 3 2013"

Python: Lists of Lists

Lists in Python maintian original data type, including other lists



In [14]:

    
k = [1, 'b', True, ['new', 'list', 'here']]
k[3]









    Out[14]:





['new', 'list', 'here']

R: Lists of Vectors

in R, combining vectors generates one longer vector.



In [15]:

    
%%R
c(1, 'b', TRUE, c('new', 'list', 'here'))









    





[1] "1"    "b"    "TRUE" "new"  "list" "here"

use lists to maintain vector relationships



In [16]:

    
%%R 
list(1, 'b', TRUE, c('new', 'list', 'here'))









    





[[1]]
[1] 1

[[2]]
[1] "b"

[[3]]
[1] TRUE

[[4]]
[1] "new"  "list" "here"

Python is a General Purpose Language

Pearson's product-moment correlation in Python:



In [17]:

    
from scipy import stats
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
stats.pearsonr(x, y)









    Out[17]:





(0.81642051634484003, 0.0021696288730787888)

R is a Statistical Machine Learning Language

Pearson's product-moment correlation in R:



In [18]:

    
%%R
x <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)
y <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
cor.test(x, y)









    





	Pearson's product-moment correlation

data:  x and y
t = 4.2415, df = 9, p-value = 0.00217
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4243912 0.9506933
sample estimates:
      cor 
0.8164205

MACHINE LEARNING

What is Machine Learning?

from Wikipedia:

Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.”

"The core of machine learning deals with representation and generalization..."

representation – extracting structure from data

generalization – making predictions from data

Representation: Helps you figure out what you’re looking at Q: can you think of examples? Genlzn: helps you figure out what is likely to happen in the future Keep these terms in mind…later we will use them to think about ML problems (write these on the white board)

Machine Learning Problems

Important point: there is lots of ML stuff we won’t talk much about mathematical models of machine learning rigorous analysis of ML algorithms computational complexity

Types of Learning Problems:

Supervised	Making predictions
Unsupervised	Extracting structure

supervised	making predictions	generalization
unsupervised	extracting structure	representation

Q: can you think of examples? Q: how could an algorithm “learn” from data in either of these cases?

Supervised Learning

Process used for making predictions

Sample data is already classified

Process uses pre-classified information to predict unknown space

Credit: Andrew Ng, "Introduction to Machine Learning," Stanford

Unsupervised Learning

Process used for providing structure

No data was pre "structured", attempts to make sense out of independent variables

(you're making up, or the algorithm is making up, your dependent variable)

Credit: Thomson Nguyen, "Introduction to Machine Learning," Lookout

Types of Data:

Continuous	Categorical
Quantitative	Qualitative

The space where data live is called the feature space.

Each point in this space is called a record.

Note: these characterize the dependent (target) variables!

Fitting it all Together

What's the goal?

What data do we have?

How do we determine the right approach?

Q: do you know of any particular models/algorithms that fit into these categories?
Q: are these terms familiar?
Q: can you think of particular algorithms that fit into these categories?
Classification for targeting ads (likely purchasers), regression, clustering (recsys), dim reduction (mtx decomposition) Combo: nnmf (netflix prize)

We will implement solutions using models and algorithms.
Each will fall into one of these four buckets.

Example: Linear Regression

What is a regression model?

A functional relationship between input & response variables

A simple linear regression model captures a linear relationship between an input x and response variable y

$y = \alpha + \beta x + \epsilon$

What do the terms in this model mean?

$y = \alpha + \beta x + \epsilon$

$y =$ response variable (the one we want to predict)

$x =$ input variable (the one we use to train the model)

$\alpha =$ intercept (where the line crosses the y-axis)

$\beta =$ regression coefficient (the model “parameter”)

$\epsilon =$ residual (the prediction error)

$y$ Dependent var, target var, output var
$x$ Indep var, covariate
$\epsilon$ Error term, disturbance
White noise (usually assumed to follow Gaussian distribution)

In R:



In [19]:

    
%%R
x <- anscombe$x1
y <- anscombe$y1



In [20]:

    
%%R
lm(y ~ x)









    





Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     3.0001       0.5001

$y = $ y$, x = $ x$, \alpha = 3, \beta = 0.5, \epsilon = ?$

Common linear regression model data problems

I know the prices for all of these other apartments in my area. What could I get for mine?

What's the relationship between total number of friends, posting activity, and the number of likes a new post would get on Facebook?

Careful! Time series data (believe it or not) does not always handle well with simple regression

Multiple Regression

We can extend this model to several input variables, giving us the multiple linear regression model:

$y = \alpha + \beta_1 x_1 + \ldots + \beta_n x_n + \epsilon$



In [21]:

    
%%R 
x <- read.table('http://www.ats.ucla.edu/stat/examples/chp/p054.txt', sep='\t', h=T)
Y = x$Y
X1 = x$X1
X2 = x$X2
X3 = x$X3
X4 = x$X4
X5 = x$X5

In R:



In [22]:

    
%%R
lm(Y ~ X1 + X2 + X3 + X4 + X5)









    





Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5)

Coefficients:
(Intercept)           X1           X2           X3           X4           X5  
   11.01113      0.69205     -0.10356      0.24906     -0.03346      0.01549

$\alpha = 11.01$ $\beta_1 = 0.692, \beta_2 = -0.104, \beta_3 = 0.249, \beta_4 = -0.033, \beta_5 = 0.015$

How do we fit a regression model to a dataset?

In theory, OLS (ordinary least squares): minimize the sum of squared distances (residuals) between the observed responses and those predicted by the model

In practice, any respectable piece of software will do this for you (even Excel!)

Warning: Linear regression involves several technical assumptions and can lead to mistaken conclusions if those assumptions are not understood and accounted for. If your work depends on linear regression, you should learn more about it first.

eid	name	department	longdepartment	title	gender	seniority
1	John Arnold	Trading	ENA Gas Financial	VP Trading	Male	Senior
2	Harry Arora	Trading	ENA East Power	VP Trading	Male	Senior
3	Robert Badeer	Trading	ENA West Power	Mgr Trading	Male	Junior
4	Susan Bailey	Legal	ENA Legal	Specialist Legal	Female	Junior
5	Eric Bass	Trading	ENA Gas Texas	Trader	Male	Junior

Data Science

R & Machine Learning

Last Time:

Questions?

Agenda

Projects!

Readiness Assessment

Laptops Closed!

History of R

before R, there was S

R vs S

Comparing Python and R (data structures)

Python

R: Numeric structures

Python: Arrays (lists)

R: Arrays (Vectors)

Python: Dicts (key/value objects)

R: Lists (dotted pairs)

Python: Lists of Lists

R: Lists of Vectors

Python is a General Purpose Language

R is a Statistical Machine Learning Language

MACHINE LEARNING

What is Machine Learning?

Machine Learning Problems

Types of Learning Problems:

Supervised Learning

Unsupervised Learning

Types of Data:

Fitting it all Together

What's the goal?

What data do we have?

How do we determine the right approach?

Example: Linear Regression

What is a regression model?

What do the terms in this model mean?

In R:

Common linear regression model data problems

Multiple Regression

In R:

How do we fit a regression model to a dataset?

LAB

Next Time:

2/12: Guest Lecture by Dr. Gheorghe Muresan: Information Retrieval (Search)

Next Week:

2/17: NO CLASS (President's Day)

2/19: CLASSIFICATION Part I: K-Nearest Neighbors

Homework:

Create a new issue on (at least) two (2) other students' projects