Data Science

R & Machine Learning

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Last Time:

  1. Data: Recap
  2. R & ggplot2
  3. Lab

Questions?

Agenda

  1. Project Discussion
  2. Readiness Assessment
  3. R
    1. History of R
    2. Comparing Python and R
  4. Machine Learning
    1. What is Machine learning?
    2. Machine Learning Problems
    3. Linear Regression
  5. Lab: Multiple Regression & Feature Extraction

Projects!

By February 19:

  1. Go to tinyurl.com/DS-repos and pick two projects to comment on.
  2. Browse the repos
    • View the images the creators generated to illustrate their data
    • View the code the creators used to generate the images
      • If the repo contains .ipynb files:
  3. Create two new "issues" (one for each project)
    • Tag it as a "bug", "question", "enhancement" or create your own tag.
    • Be constructive!
      • If you see a potential problem, identify it (but be respectful, we're all learning).
      • If you don't understand a piece of code, ask them to explain it.
      • If you have a recommendation, be specific. How would you improve their project?

Try to pick projects that don't have any issues yet. (That way everyone gets to benefit from feedback.)

Readiness Assessment

Laptops Closed!


In [1]:
%load_ext sql

In [2]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5


Done.
Out[2]:
eid name department longdepartment title gender seniority
1 John Arnold Trading ENA Gas Financial VP Trading Male Senior
2 Harry Arora Trading ENA East Power VP Trading Male Senior
3 Robert Badeer Trading ENA West Power Mgr Trading Male Junior
4 Susan Bailey Legal ENA Legal Specialist Legal Female Junior
5 Eric Bass Trading ENA Gas Texas Trader Male Junior

The key is `eid`. There are 156 distinct titles: 82 of them are `'Senior'` 74 of them are `'Junior'` No titles are both `'Senior'` and `'Junior'`

  1. Which normal form does this violate (1, 2, or 3)?
  2. How would you normalize it?

History of R

before R, there was S

S (later S-Plus) developed at Bell Labs by John Chambers, 1975

Previously: Statisticians used FORTRAN subroutines for their work

Goal: develop a more interactive statistical language

R vs S

S-PLUS is proprietary (owned and sold by Tibco)

R developed by Ross Ihaka and Robert Gentleman

Freely availabled by GNU Public License!

Primarily written in C, FORTRAN, and R

Comparing Python and R (data structures)

Python


In [3]:
print(type(1))
print(type(2.5))


<type 'int'>
<type 'float'>

R: Numeric structures


In [4]:
%load_ext rmagic

In [5]:
%%R 
print(class(1))
print(class(2.5))


[1] "numeric"
[1] "numeric"

In [6]:
%R print(class(as.integer(1)))


[1] "integer"

Generally, R does well working between and converting numbers as needed.


In [7]:
%R print(class(2.5))


[1] "numeric"

Python: Arrays (lists)

Lists in Python maintain their original data type


In [8]:
k = [1, 'b', True]  # list of mixed types
k


Out[8]:
[1, 'b', True]

R: Arrays (Vectors)

R Vectors can only maintain one data structure


In [9]:
%%R
str(c(1, 'b', TRUE))   # array of character 
str(c(1, 2, 3, 1233))  # array of numeric
# array of numeric (TRUE converted to 1): 
str(c(1, 2, 3, TRUE))


 chr [1:3] "1" "b" "TRUE"
 num [1:4] 1 2 3 1233
 num [1:4] 1 2 3 1

Python: Dicts (key/value objects)


In [10]:
email = {'title': 'Good Morning!', 'from':'Bob Loblaw', 'date': 'Tue Mar 3 2013'}
email['title']


Out[10]:
'Good Morning!'

In [11]:
print email


{'date': 'Tue Mar 3 2013', 'from': 'Bob Loblaw', 'title': 'Good Morning!'}

R: Lists (dotted pairs)

The nearest equivalent to a Python dict in R is a list object.


In [12]:
%%R
email <- list(title = 'Good Morning!', from = 'Bob Loblaw', date = 'Tue Mar 3 2013')
email$title


[1] "Good Morning!"

In [13]:
%R print(email)


$title
[1] "Good Morning!"

$from
[1] "Bob Loblaw"

$date
[1] "Tue Mar 3 2013"

Python: Lists of Lists

Lists in Python maintian original data type, including other lists


In [14]:
k = [1, 'b', True, ['new', 'list', 'here']]
k[3]


Out[14]:
['new', 'list', 'here']

R: Lists of Vectors

in R, combining vectors generates one longer vector.


In [15]:
%%R
c(1, 'b', TRUE, c('new', 'list', 'here'))


[1] "1"    "b"    "TRUE" "new"  "list" "here"

use lists to maintain vector relationships


In [16]:
%%R 
list(1, 'b', TRUE, c('new', 'list', 'here'))


[[1]]
[1] 1

[[2]]
[1] "b"

[[3]]
[1] TRUE

[[4]]
[1] "new"  "list" "here"

Python is a General Purpose Language

Pearson's product-moment correlation in Python:


In [17]:
from scipy import stats
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
stats.pearsonr(x, y)


Out[17]:
(0.81642051634484003, 0.0021696288730787888)

R is a Statistical Machine Learning Language

Pearson's product-moment correlation in R:


In [18]:
%%R
x <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)
y <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
cor.test(x, y)


	Pearson's product-moment correlation

data:  x and y
t = 4.2415, df = 9, p-value = 0.00217
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4243912 0.9506933
sample estimates:
      cor 
0.8164205 

MACHINE LEARNING

What is Machine Learning?

from Wikipedia:

Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.”

"The core of machine learning deals with representation and generalization..."

  • representation – extracting structure from data
  • generalization – making predictions from data

Representation: Helps you figure out what you’re looking at Q: can you think of examples? Genlzn: helps you figure out what is likely to happen in the future Keep these terms in mind…later we will use them to think about ML problems (write these on the white board)

Machine Learning Problems

Important point: there is lots of ML stuff we won’t talk much about mathematical models of machine learning rigorous analysis of ML algorithms computational complexity

Types of Learning Problems:

SupervisedMaking predictions
UnsupervisedExtracting structure
supervisedmaking predictionsgeneralization
unsupervisedextracting structurerepresentation

Q: can you think of examples? Q: how could an algorithm “learn” from data in either of these cases?

Supervised Learning

Process used for making predictions

Sample data is already classified

Process uses pre-classified information to predict unknown space

Credit: Andrew Ng, "Introduction to Machine Learning," Stanford

Unsupervised Learning

Process used for providing structure

No data was pre "structured", attempts to make sense out of independent variables

(you're making up, or the algorithm is making up, your dependent variable)

Credit: Thomson Nguyen, "Introduction to Machine Learning," Lookout

Types of Data:

Continuous Categorical
Quantitative Qualitative

The space where data live is called the feature space.

Each point in this space is called a record.

Note: these characterize the dependent (target) variables!

Fitting it all Together

What's the goal?

What data do we have?

How do we determine the right approach?

Q: do you know of any particular models/algorithms that fit into these categories?
Q: are these terms familiar?
Q: can you think of particular algorithms that fit into these categories?
Classification for targeting ads (likely purchasers), regression, clustering (recsys), dim reduction (mtx decomposition) Combo: nnmf (netflix prize)

We will implement solutions using models and algorithms.
Each will fall into one of these four buckets.

Example: Linear Regression

<img src="assets/machine_learning6.png" width="800" >

What is a regression model?

A functional relationship between input & response variables

A simple linear regression model captures a linear relationship between an input x and response variable y

$y = \alpha + \beta x + \epsilon$

What do the terms in this model mean?

$y = \alpha + \beta x + \epsilon$

$y =$ response variable (the one we want to predict)

$x =$ input variable (the one we use to train the model)

$\alpha =$ intercept (where the line crosses the y-axis)

$\beta =$ regression coefficient (the model “parameter”)

$\epsilon =$ residual (the prediction error)

$y$ Dependent var, target var, output var
$x$ Indep var, covariate
$\epsilon$ Error term, disturbance
White noise (usually assumed to follow Gaussian distribution)

In R:


In [19]:
%%R
x <- anscombe$x1
y <- anscombe$y1

In [20]:
%%R
lm(y ~ x)


Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     3.0001       0.5001  

$y = $ y$, x = $ x$, \alpha = 3, \beta = 0.5, \epsilon = ?$

Common linear regression model data problems

I know the prices for all of these other apartments in my area. What could I get for mine?

What's the relationship between total number of friends, posting activity, and the number of likes a new post would get on Facebook?

Careful! Time series data (believe it or not) does not always handle well with simple regression

Multiple Regression

We can extend this model to several input variables, giving us the multiple linear regression model:

$y = \alpha + \beta_1 x_1 + \ldots + \beta_n x_n + \epsilon$


In [21]:
%%R 
x <- read.table('http://www.ats.ucla.edu/stat/examples/chp/p054.txt', sep='\t', h=T)
Y = x$Y
X1 = x$X1
X2 = x$X2
X3 = x$X3
X4 = x$X4
X5 = x$X5

In R:


In [22]:
%%R
lm(Y ~ X1 + X2 + X3 + X4 + X5)


Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5)

Coefficients:
(Intercept)           X1           X2           X3           X4           X5  
   11.01113      0.69205     -0.10356      0.24906     -0.03346      0.01549  

$\alpha = 11.01$ $\beta_1 = 0.692, \beta_2 = -0.104, \beta_3 = 0.249, \beta_4 = -0.033, \beta_5 = 0.015$

How do we fit a regression model to a dataset?

In theory, OLS (ordinary least squares): minimize the sum of squared distances (residuals) between the observed responses and those predicted by the model

In practice, any respectable piece of software will do this for you (even Excel!)

Warning: Linear regression involves several technical assumptions and can lead to mistaken conclusions if those assumptions are not understood and accounted for. If your work depends on linear regression, you should learn more about it first.

LAB

Next Time:

2/12: Guest Lecture by Dr. Gheorghe Muresan: Information Retrieval (Search)

Next Week:

2/17: NO CLASS (President's Day)

2/19: CLASSIFICATION Part I: K-Nearest Neighbors

Homework:

Create a new issue on (at least) two (2) other students' projects