Data Science

Visualizing Data with R and ggplot2

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Agenda

Data: Recap
Science: Exploration & Visualization
Lab

Data Recap

Python
Big Data / Unstructured Data
1. Hadoop
2. IPython.parallel & StarCluster
APIs & NoSQL / Semi-Structured Data
SQL & Normalization / Structured Data

Day 1

Birthday Paradox

Moral: Test Assumptions, Be Rigorous

Data Science Workflow:

Obtain
Scrub
Explore
Model
Interpret

Python

Working with strings!
You wrote an app!

Day 2

Big Data

Volume
Velocity
Variety

Map/Reduce

Moving Code to Data
Word Count Example

Elastic MapReduce

You scripted and ran a Hadoop job!

Day 3

Amazon Web Services

IAM: Identity and Access Management
S3: Simple Storage Service
EC2: Elastic Cloud Compute
EMR: Elastic MapReduce

IPython.parallel and StarCluster

Bringing data to code
You processed 500GB of wikipedia data!

Day 4

RESTful web API HTTP methods

Resource	GET	PUT	POST	DELETE
Collection URI, such as http://example.com/resources	List the URIs and perhaps other details of the collection's members.	Replace the entire collection with another collection.	Create a new entry in the collection. The new entry's URI is assigned automatically and is usually returned by the operation.	Delete the entire collection.
Element URI, such as http://example.com/resources/item17	Retrieve a representation of the addressed member of the collection, expressed in an appropriate Internet media type.	Replace the addressed member of the collection, or if it doesn't exist, create it.	Not generally used. Treat the addressed member as a collection in its own right and create a new entry in it.	Delete the addressed member of the collection.

(non-relational) Database Management Systems

ACID
- Atomicity
- Consistency
- Isolation
- Durability
CAP
- Consistency
- Availability
- Partition Tolerance
CAP Theorem
- Pick two!

Twitter and Mongo

You pulled data from Twitter's API and stored it in Mongo!
You then queried Mongo to discover facts about Twitter data!

Day 5

Normalization

"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key (so help me Codd)."

Munging Data with Pandas

Using Pandas you transformed and plotted Enron's email data

This brings us to...

Science!!!

but first...any questions?

Projects

Code
- must be in Github along with at least one visualization
Data
- If small (<1 MB) put in Github.
- If medium (i.e. [1, 10) MB) put in S3 with instructions on how to retrieve in Github
- If large, put small or medium sample in Github or S3 (your choice)

Key Objectives

Become familiar with the R environment
Explore data in R
Visualize data using ggplot2

Visualization as a Medium



In [1]:

    
%load_ext rmagic



In [2]:

    
%%R
anscombe1 <- anscombe[,c('x1', 'y1')]

Consider the following dataset:



In [3]:

    
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange", 
     cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))
abline(lm(y1 ~ x1, data = anscombe), col = "blue")
t(anscombe1)









    





    [,1] [,2]  [,3] [,4]  [,5]  [,6] [,7] [,8]  [,9] [,10] [,11]
x1 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00  7.00  5.00
y1  8.04 6.95  7.58 8.81  8.33  9.96 7.24 4.26 10.84  4.82  5.68



In [4]:

    
%%R
apply(anscombe1, 2, mean)









    





      x1       y1 
9.000000 7.500909



In [5]:

    
%%R
apply(anscombe1, 2, var)









    





       x1        y1 
11.000000  4.127269



In [6]:

    
%%R
cor(anscombe$x1, anscombe$y1)









    





[1] 0.8164205



In [7]:

    
%%R
lm(y1 ~ x1, data = anscombe)









    





Call:
lm(formula = y1 ~ x1, data = anscombe)

Coefficients:
(Intercept)           x1  
     3.0001       0.5001

Q: Does everyone know what the mean is? Q: Does anyone know what the variance is? Q: More importantly, do you know what the variance tells you? Q: Does everyone know what correl is? Q: More importantly, do you know what correl tells you? Q: Does everyone know what a line of best fit is? Q: More importantly, do you know what the line of best fit tells you? (note: we will look at this next time)



In [8]:

    
%%R
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  ## or   ff[[2]] <- as.name(paste0("y", i))
  ##      ff[[3]] <- as.name(paste0("x", i))
  mods[[i]] <- lmi <- lm(ff, data = anscombe)
  print(anova(lmi))
}









    





Analysis of Variance Table

Response: y1
          Df Sum Sq Mean Sq F value  Pr(>F)   
x1         1 27.510 27.5100   17.99 0.00217 **
Residuals  9 13.763  1.5292                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y2
          Df Sum Sq Mean Sq F value   Pr(>F)   
x2         1 27.500 27.5000  17.966 0.002179 **
Residuals  9 13.776  1.5307                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y3
          Df Sum Sq Mean Sq F value   Pr(>F)   
x3         1 27.470 27.4700  17.972 0.002176 **
Residuals  9 13.756  1.5285                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y4
          Df Sum Sq Mean Sq F value   Pr(>F)   
x4         1 27.490 27.4900  18.003 0.002165 **
Residuals  9 13.742  1.5269                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now, suppose I give you three more datasets with exactly the same characteristics…

Q: how similar are these datasets?



In [9]:

    
%%R
anscombe









    





   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89



In [10]:

    
%%R
mydata=with(anscombe, data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), 
                                 group=gl(4,nrow(anscombe))))
aggregate(.~group, data=mydata, mean)









    





  group xVal     yVal
1     1    9 7.500909
2     2    9 7.500909
3     3    9 7.500000
4     4    9 7.500909



In [11]:

    
%%R
aggregate(.~group, data=mydata, var)









    





  group xVal     yVal
1     1   11 4.127269
2     2   11 4.127629
3     3   11 4.122620
4     4   11 4.123249



In [12]:

    
%%R -w 960 -h 480 -u px
library(ggplot2)
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) + 
    geom_smooth(method='lm', alpha=0, fullrange=TRUE) + facet_wrap(~group)

Q: how similar are these datasets?

A: not very!

visualing distributions

R

Like Pandas, R uses Data Frames.



In [13]:

    
%%R -o anscombe
is(anscombe)









    





[1] "data.frame" "list"       "oldClass"   "vector"

Data Frames (or 'data.frame's) are derived from lists which are derived from oldClasses, which are derived from vectors. Therefore a data.frame is a list which is an oldClass which is a vector.

is() is similar to type() in Python and can be used to tell you what type an object is.

Linear Models:

Independent (Explanatory) Variables
Dependant (Response) Variables

Another important concept in R are formulas.
y1 ~ x1 models a relationship between y1 and x1 such that y1 is determined by x1.



In [14]:

    
%%R
lm(y1 ~ x1, data = anscombe)









    





Call:
lm(formula = y1 ~ x1, data = anscombe)

Coefficients:
(Intercept)           x1  
     3.0001       0.5001

One of the most useful functions in both R and Pandas is plot().

One of the biggest differences between R and Python is that in R, objects don't have methods. So instead of calling DataFrame.plot() as we would in Python, we call plot(data.frame).



In [15]:

    
%%R
plot(y1 ~ x1, data = anscombe)

plot() accepts many arguments that can be used to make it prettier and easier to understand.



In [16]:

    
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange", 
     cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))

R is all about models. One of the simplest is the linear model which tries to find a linear relationship between the independant (or explanatory) and dependant (or response) variables.

Plotting a linear regression:



In [17]:

    
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange", 
     cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))
abline(lm(y1 ~ x1, data = anscombe), col = "blue")

Correlation



In [18]:

    
%%R
cor.test(anscombe$x1, anscombe$y1)









    





	Pearson's product-moment correlation

data:  anscombe$x1 and anscombe$y1
t = 4.2415, df = 9, p-value = 0.00217
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4243912 0.9506933
sample estimates:
      cor 
0.8164205

Analysis of Variance (ANOVA)



In [19]:

    
%%R
anova(lm(y1 ~ x1, data = anscombe))









    





Analysis of Variance Table

Response: y1
          Df Sum Sq Mean Sq F value  Pr(>F)   
x1         1 27.510 27.5100   17.99 0.00217 **
Residuals  9 13.763  1.5292                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R likes tables long and narrow:



In [20]:

    
%%R
anscombe









    





   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89



In [21]:

    
%%R
with(anscombe, data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), 
                                 group=gl(4,nrow(anscombe))))









    





   xVal  yVal group
1    10  8.04     1
2     8  6.95     1
3    13  7.58     1
4     9  8.81     1
5    11  8.33     1
6    14  9.96     1
7     6  7.24     1
8     4  4.26     1
9    12 10.84     1
10    7  4.82     1
11    5  5.68     1
12   10  9.14     2
13    8  8.14     2
14   13  8.74     2
15    9  8.77     2
16   11  9.26     2
17   14  8.10     2
18    6  6.13     2
19    4  3.10     2
20   12  9.13     2
21    7  7.26     2
22    5  4.74     2
23   10  7.46     3
24    8  6.77     3
25   13 12.74     3
26    9  7.11     3
27   11  7.81     3
28   14  8.84     3
29    6  6.08     3
30    4  5.39     3
31   12  8.15     3
32    7  6.42     3
33    5  5.73     3
34    8  6.58     4
35    8  5.76     4
36    8  7.71     4
37    8  8.84     4
38    8  8.47     4
39    8  7.04     4
40    8  5.25     4
41   19 12.50     4
42    8  5.56     4
43    8  7.91     4
44    8  6.89     4

ggplot2

ggplot2 layers geometric elements



In [22]:

    
%%R
ggplot(subset(mydata, group==1), aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4)

`facet_wrap` allows you to split your plot by group:



In [23]:

    
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) + 
                                     facet_wrap(~group)

geom_smooth can display a linear model fit:



In [24]:

    
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) + 
    geom_smooth(method='lm', alpha=0, fullrange=TRUE) + facet_wrap(~group)



In [25]:

    
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) + 
    geom_smooth(method='lm') + facet_wrap(~group)

Lab

Discussion

ggplot2 vs. other visualization libraries

vs. the native R plot library
vs. matplotlib
vs. D3.js
vs. Google Viz
vs. Tableau
vs. ???

Ease of Use vs. Expressive Power (vs. Interactivity (vs. ?))

Next Time:

PROJECTS LIVE!

REGRESSION AND REGULARIZATION