Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com
| Resource | GET | PUT | POST | DELETE |
|---|---|---|---|---|
| Collection URI, such as http://example.com/resources | List the URIs and perhaps other details of the collection's members. | Replace the entire collection with another collection. | Create a new entry in the collection. The new entry's URI is assigned automatically and is usually returned by the operation. | Delete the entire collection. |
| Element URI, such as http://example.com/resources/item17 | Retrieve a representation of the addressed member of the collection, expressed in an appropriate Internet media type. | Replace the addressed member of the collection, or if it doesn't exist, create it. | Not generally used. Treat the addressed member as a collection in its own right and create a new entry in it. | Delete the addressed member of the collection. |
but first...any questions?
In [1]:
%load_ext rmagic
In [2]:
%%R
anscombe1 <- anscombe[,c('x1', 'y1')]
Consider the following dataset:
In [3]:
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange",
cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))
abline(lm(y1 ~ x1, data = anscombe), col = "blue")
t(anscombe1)
In [4]:
%%R
apply(anscombe1, 2, mean)
In [5]:
%%R
apply(anscombe1, 2, var)
In [6]:
%%R
cor(anscombe$x1, anscombe$y1)
In [7]:
%%R
lm(y1 ~ x1, data = anscombe)
Q: Does everyone know what the mean is? Q: Does anyone know what the variance is? Q: More importantly, do you know what the variance tells you? Q: Does everyone know what correl is? Q: More importantly, do you know what correl tells you? Q: Does everyone know what a line of best fit is? Q: More importantly, do you know what the line of best fit tells you? (note: we will look at this next time)
In [8]:
%%R
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## or ff[[2]] <- as.name(paste0("y", i))
## ff[[3]] <- as.name(paste0("x", i))
mods[[i]] <- lmi <- lm(ff, data = anscombe)
print(anova(lmi))
}
Now, suppose I give you three more datasets with exactly the same characteristics…
Q: how similar are these datasets?
In [9]:
%%R
anscombe
In [10]:
%%R
mydata=with(anscombe, data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4),
group=gl(4,nrow(anscombe))))
aggregate(.~group, data=mydata, mean)
In [11]:
%%R
aggregate(.~group, data=mydata, var)
In [12]:
%%R -w 960 -h 480 -u px
library(ggplot2)
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) +
geom_smooth(method='lm', alpha=0, fullrange=TRUE) + facet_wrap(~group)
Q: how similar are these datasets?
A: not very!
visualing distributions
Like Pandas, R uses Data Frames.
In [13]:
%%R -o anscombe
is(anscombe)
Data Frames (or 'data.frame's) are derived from lists which are derived from oldClasses, which are derived from vectors. Therefore a data.frame is a list which is an oldClass which is a vector.
is() is similar to type() in Python and can be used to tell you what type an object is.
Another important concept in R are formulas.
y1 ~ x1 models a relationship between y1 and x1 such that y1 is determined by x1.
In [14]:
%%R
lm(y1 ~ x1, data = anscombe)
One of the most useful functions in both R and Pandas is plot().
One of the biggest differences between R and Python is that in R, objects don't have methods. So instead of calling DataFrame.plot() as we would in Python, we call plot(data.frame).
In [15]:
%%R
plot(y1 ~ x1, data = anscombe)
plot() accepts many arguments that can be used to make it prettier and easier to understand.
In [16]:
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange",
cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))
R is all about models. One of the simplest is the linear model which tries to find a linear relationship between the independant (or explanatory) and dependant (or response) variables.
In [17]:
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange",
cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))
abline(lm(y1 ~ x1, data = anscombe), col = "blue")
In [18]:
%%R
cor.test(anscombe$x1, anscombe$y1)
In [19]:
%%R
anova(lm(y1 ~ x1, data = anscombe))
R likes tables long and narrow:
In [20]:
%%R
anscombe
In [21]:
%%R
with(anscombe, data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4),
group=gl(4,nrow(anscombe))))
In [22]:
%%R
ggplot(subset(mydata, group==1), aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4)
In [23]:
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) +
facet_wrap(~group)
In [24]:
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) +
geom_smooth(method='lm', alpha=0, fullrange=TRUE) + facet_wrap(~group)
In [25]:
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) +
geom_smooth(method='lm') + facet_wrap(~group)