By February 19:
ipython notebookTry to pick projects that don't have any issues yet. (That way everyone gets to benefit from feedback.)
In [1]:
%load_ext sql
In [2]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5
Out[2]:
The key is `eid`. There are 156 distinct titles: 82 of them are `'Senior'` 74 of them are `'Junior'` No titles are both `'Senior'` and `'Junior'`
S (later S-Plus) developed at Bell Labs by John Chambers, 1975
Previously: Statisticians used FORTRAN subroutines for their work
Goal: develop a more interactive statistical language
S-PLUS is proprietary (owned and sold by Tibco)
R developed by Ross Ihaka and Robert Gentleman
Freely availabled by GNU Public License!
Primarily written in C, FORTRAN, and R
In [3]:
print(type(1))
print(type(2.5))
In [4]:
%load_ext rmagic
In [5]:
%%R
print(class(1))
print(class(2.5))
In [6]:
%R print(class(as.integer(1)))
Generally, R does well working between and converting numbers as needed.
In [7]:
%R print(class(2.5))
In [8]:
k = [1, 'b', True] # list of mixed types
k
Out[8]:
In [9]:
%%R
str(c(1, 'b', TRUE)) # array of character
str(c(1, 2, 3, 1233)) # array of numeric
# array of numeric (TRUE converted to 1):
str(c(1, 2, 3, TRUE))
In [10]:
email = {'title': 'Good Morning!', 'from':'Bob Loblaw', 'date': 'Tue Mar 3 2013'}
email['title']
Out[10]:
In [11]:
print email
In [12]:
%%R
email <- list(title = 'Good Morning!', from = 'Bob Loblaw', date = 'Tue Mar 3 2013')
email$title
In [13]:
%R print(email)
In [14]:
k = [1, 'b', True, ['new', 'list', 'here']]
k[3]
Out[14]:
In [15]:
%%R
c(1, 'b', TRUE, c('new', 'list', 'here'))
use lists to maintain vector relationships
In [16]:
%%R
list(1, 'b', TRUE, c('new', 'list', 'here'))
In [17]:
from scipy import stats
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
stats.pearsonr(x, y)
Out[17]:
In [18]:
%%R
x <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)
y <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
cor.test(x, y)
from Wikipedia:
Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.”
"The core of machine learning deals with representation and generalization..."
Representation: Helps you figure out what you’re looking at Q: can you think of examples? Genlzn: helps you figure out what is likely to happen in the future Keep these terms in mind…later we will use them to think about ML problems (write these on the white board)
Important point: there is lots of ML stuff we won’t talk much about mathematical models of machine learning rigorous analysis of ML algorithms computational complexity
| Supervised | Making predictions |
|---|---|
| Unsupervised | Extracting structure |
| supervised | making predictions | generalization |
|---|---|---|
| unsupervised | extracting structure | representation |
Q: can you think of examples? Q: how could an algorithm “learn” from data in either of these cases?
Sample data is already classified
Process uses pre-classified information to predict unknown space
Credit: Andrew Ng, "Introduction to Machine Learning," Stanford
No data was pre "structured", attempts to make sense out of independent variables
(you're making up, or the algorithm is making up, your dependent variable)
Credit: Thomson Nguyen, "Introduction to Machine Learning," Lookout
The space where data live is called the feature space.
Each point in this space is called a record.
Note: these characterize the dependent (target) variables!
Q: do you know of any particular models/algorithms that fit into these categories?
Q: are these terms familiar?
Q: can you think of particular algorithms that fit into these categories?
Classification for targeting ads (likely purchasers), regression, clustering (recsys), dim reduction (mtx decomposition)
Combo: nnmf (netflix prize)
We will implement solutions using models and algorithms.
Each will fall into one of these four buckets.
<img src="assets/machine_learning6.png" width="800" >
A functional relationship between input & response variables
A simple linear regression model captures a linear relationship between an input x and response variable y
$y = \alpha + \beta x + \epsilon$
$y =$ response variable (the one we want to predict)
$x =$ input variable (the one we use to train the model)
$\alpha =$ intercept (where the line crosses the y-axis)
$\beta =$ regression coefficient (the model “parameter”)
$\epsilon =$ residual (the prediction error)
$y$ Dependent var, target var, output var
$x$ Indep var, covariate
$\epsilon$ Error term, disturbance
White noise (usually assumed to follow Gaussian distribution)
In [19]:
%%R
x <- anscombe$x1
y <- anscombe$y1
In [20]:
%%R
lm(y ~ x)
$y = $ y$, x = $ x$, \alpha = 3, \beta = 0.5, \epsilon = ?$
I know the prices for all of these other apartments in my area. What could I get for mine?
What's the relationship between total number of friends, posting activity, and the number of likes a new post would get on Facebook?
Careful! Time series data (believe it or not) does not always handle well with simple regression
In [21]:
%%R
x <- read.table('http://www.ats.ucla.edu/stat/examples/chp/p054.txt', sep='\t', h=T)
Y = x$Y
X1 = x$X1
X2 = x$X2
X3 = x$X3
X4 = x$X4
X5 = x$X5
In [22]:
%%R
lm(Y ~ X1 + X2 + X3 + X4 + X5)
$\alpha = 11.01$ $\beta_1 = 0.692, \beta_2 = -0.104, \beta_3 = 0.249, \beta_4 = -0.033, \beta_5 = 0.015$
In theory, OLS (ordinary least squares): minimize the sum of squared distances (residuals) between the observed responses and those predicted by the model
In practice, any respectable piece of software will do this for you (even Excel!)
Warning: Linear regression involves several technical assumptions and can lead to mistaken conclusions if those assumptions are not understood and accounted for. If your work depends on linear regression, you should learn more about it first.