Data Science

Visualizing Data with R and ggplot2

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Agenda

  1. Data: Recap
  2. Science: Exploration & Visualization
  3. Lab

Data Recap

  1. Python
  2. Big Data / Unstructured Data
    1. Hadoop
    2. IPython.parallel & StarCluster
  3. APIs & NoSQL / Semi-Structured Data
  4. SQL & Normalization / Structured Data

Day 1

Birthday Paradox

Moral: Test Assumptions, Be Rigorous

Data Science Workflow:

  1. Obtain
  2. Scrub
  3. Explore
  4. Model
  5. Interpret

Python

  • Working with strings!
  • You wrote an app!

Day 2

Big Data

  • Volume
  • Velocity
  • Variety

Map/Reduce

  • Moving Code to Data
  • Word Count Example

Elastic MapReduce

  • You scripted and ran a Hadoop job!

Day 3

Amazon Web Services

  • IAM: Identity and Access Management
  • S3: Simple Storage Service
  • EC2: Elastic Cloud Compute
  • EMR: Elastic MapReduce

IPython.parallel and StarCluster

  • Bringing data to code
  • You processed 500GB of wikipedia data!

Day 4

RESTful web API HTTP methods

ResourceGETPUTPOSTDELETE
Collection URI, such as http://example.com/resourcesList the URIs and perhaps other details of the collection's members.Replace the entire collection with another collection.Create a new entry in the collection. The new entry's URI is assigned automatically and is usually returned by the operation.Delete the entire collection.
Element URI, such as http://example.com/resources/item17Retrieve a representation of the addressed member of the collection, expressed in an appropriate Internet media type.Replace the addressed member of the collection, or if it doesn't exist, create it.Not generally used. Treat the addressed member as a collection in its own right and create a new entry in it.Delete the addressed member of the collection.

(non-relational) Database Management Systems

  • ACID
    • Atomicity
    • Consistency
    • Isolation
    • Durability
  • CAP
    • Consistency
    • Availability
    • Partition Tolerance
  • CAP Theorem
    • Pick two!

Twitter and Mongo

  • You pulled data from Twitter's API and stored it in Mongo!
  • You then queried Mongo to discover facts about Twitter data!

Day 5

Normalization

  • "[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key (so help me Codd)."

Munging Data with Pandas

  • Using Pandas you transformed and plotted Enron's email data

This brings us to...

Science!!!

but first...any questions?

Projects

  • Code
    • must be in Github along with at least one visualization
  • Data
    • If small (<1 MB) put in Github.
    • If medium (i.e. [1, 10) MB) put in S3 with instructions on how to retrieve in Github
    • If large, put small or medium sample in Github or S3 (your choice)

Key Objectives

  • Become familiar with the R environment
  • Explore data in R
  • Visualize data using ggplot2

Visualization as a Medium


In [1]:
%load_ext rmagic

In [2]:
%%R
anscombe1 <- anscombe[,c('x1', 'y1')]

Consider the following dataset:


In [3]:
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange", 
     cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))
abline(lm(y1 ~ x1, data = anscombe), col = "blue")
t(anscombe1)


    [,1] [,2]  [,3] [,4]  [,5]  [,6] [,7] [,8]  [,9] [,10] [,11]
x1 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00  7.00  5.00
y1  8.04 6.95  7.58 8.81  8.33  9.96 7.24 4.26 10.84  4.82  5.68

In [4]:
%%R
apply(anscombe1, 2, mean)


      x1       y1 
9.000000 7.500909 

In [5]:
%%R
apply(anscombe1, 2, var)


       x1        y1 
11.000000  4.127269 

In [6]:
%%R
cor(anscombe$x1, anscombe$y1)


[1] 0.8164205

In [7]:
%%R
lm(y1 ~ x1, data = anscombe)


Call:
lm(formula = y1 ~ x1, data = anscombe)

Coefficients:
(Intercept)           x1  
     3.0001       0.5001  

Q: Does everyone know what the mean is? Q: Does anyone know what the variance is? Q: More importantly, do you know what the variance tells you? Q: Does everyone know what correl is? Q: More importantly, do you know what correl tells you? Q: Does everyone know what a line of best fit is? Q: More importantly, do you know what the line of best fit tells you? (note: we will look at this next time)


In [8]:
%%R
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  ## or   ff[[2]] <- as.name(paste0("y", i))
  ##      ff[[3]] <- as.name(paste0("x", i))
  mods[[i]] <- lmi <- lm(ff, data = anscombe)
  print(anova(lmi))
}


Analysis of Variance Table

Response: y1
          Df Sum Sq Mean Sq F value  Pr(>F)   
x1         1 27.510 27.5100   17.99 0.00217 **
Residuals  9 13.763  1.5292                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y2
          Df Sum Sq Mean Sq F value   Pr(>F)   
x2         1 27.500 27.5000  17.966 0.002179 **
Residuals  9 13.776  1.5307                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y3
          Df Sum Sq Mean Sq F value   Pr(>F)   
x3         1 27.470 27.4700  17.972 0.002176 **
Residuals  9 13.756  1.5285                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y4
          Df Sum Sq Mean Sq F value   Pr(>F)   
x4         1 27.490 27.4900  18.003 0.002165 **
Residuals  9 13.742  1.5269                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now, suppose I give you three more datasets with exactly the same characteristics…

Q: how similar are these datasets?


In [9]:
%%R
anscombe


   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89

In [10]:
%%R
mydata=with(anscombe, data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), 
                                 group=gl(4,nrow(anscombe))))
aggregate(.~group, data=mydata, mean)


  group xVal     yVal
1     1    9 7.500909
2     2    9 7.500909
3     3    9 7.500000
4     4    9 7.500909

In [11]:
%%R
aggregate(.~group, data=mydata, var)


  group xVal     yVal
1     1   11 4.127269
2     2   11 4.127629
3     3   11 4.122620
4     4   11 4.123249

In [12]:
%%R -w 960 -h 480 -u px
library(ggplot2)
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) + 
    geom_smooth(method='lm', alpha=0, fullrange=TRUE) + facet_wrap(~group)


Q: how similar are these datasets?

A: not very!

visualing distributions

R

Like Pandas, R uses Data Frames.


In [13]:
%%R -o anscombe
is(anscombe)


[1] "data.frame" "list"       "oldClass"   "vector"    

Data Frames (or 'data.frame's) are derived from lists which are derived from oldClasses, which are derived from vectors. Therefore a data.frame is a list which is an oldClass which is a vector.

is() is similar to type() in Python and can be used to tell you what type an object is.

Linear Models:

  • Independent (Explanatory) Variables
  • Dependant (Response) Variables

Another important concept in R are formulas.
y1 ~ x1 models a relationship between y1 and x1 such that y1 is determined by x1.


In [14]:
%%R
lm(y1 ~ x1, data = anscombe)


Call:
lm(formula = y1 ~ x1, data = anscombe)

Coefficients:
(Intercept)           x1  
     3.0001       0.5001  

One of the most useful functions in both R and Pandas is plot().

One of the biggest differences between R and Python is that in R, objects don't have methods. So instead of calling DataFrame.plot() as we would in Python, we call plot(data.frame).


In [15]:
%%R
plot(y1 ~ x1, data = anscombe)


plot() accepts many arguments that can be used to make it prettier and easier to understand.


In [16]:
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange", 
     cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))


R is all about models. One of the simplest is the linear model which tries to find a linear relationship between the independant (or explanatory) and dependant (or response) variables.

Plotting a linear regression:


In [17]:
%%R
plot(y1 ~ x1, data = anscombe, col = "red", pch = 21, bg = "orange", 
     cex = 1.2, xlim = c(3, 19), ylim = c(3, 13))
abline(lm(y1 ~ x1, data = anscombe), col = "blue")


Correlation


In [18]:
%%R
cor.test(anscombe$x1, anscombe$y1)


	Pearson's product-moment correlation

data:  anscombe$x1 and anscombe$y1
t = 4.2415, df = 9, p-value = 0.00217
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4243912 0.9506933
sample estimates:
      cor 
0.8164205 

Analysis of Variance (ANOVA)


In [19]:
%%R
anova(lm(y1 ~ x1, data = anscombe))


Analysis of Variance Table

Response: y1
          Df Sum Sq Mean Sq F value  Pr(>F)   
x1         1 27.510 27.5100   17.99 0.00217 **
Residuals  9 13.763  1.5292                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R likes tables long and narrow:


In [20]:
%%R
anscombe


   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89

In [21]:
%%R
with(anscombe, data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), 
                                 group=gl(4,nrow(anscombe))))


   xVal  yVal group
1    10  8.04     1
2     8  6.95     1
3    13  7.58     1
4     9  8.81     1
5    11  8.33     1
6    14  9.96     1
7     6  7.24     1
8     4  4.26     1
9    12 10.84     1
10    7  4.82     1
11    5  5.68     1
12   10  9.14     2
13    8  8.14     2
14   13  8.74     2
15    9  8.77     2
16   11  9.26     2
17   14  8.10     2
18    6  6.13     2
19    4  3.10     2
20   12  9.13     2
21    7  7.26     2
22    5  4.74     2
23   10  7.46     3
24    8  6.77     3
25   13 12.74     3
26    9  7.11     3
27   11  7.81     3
28   14  8.84     3
29    6  6.08     3
30    4  5.39     3
31   12  8.15     3
32    7  6.42     3
33    5  5.73     3
34    8  6.58     4
35    8  5.76     4
36    8  7.71     4
37    8  8.84     4
38    8  8.47     4
39    8  7.04     4
40    8  5.25     4
41   19 12.50     4
42    8  5.56     4
43    8  7.91     4
44    8  6.89     4

ggplot2

ggplot2 layers geometric elements


In [22]:
%%R
ggplot(subset(mydata, group==1), aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4)


facet_wrap allows you to split your plot by group:


In [23]:
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) + 
                                     facet_wrap(~group)


geom_smooth can display a linear model fit:


In [24]:
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) + 
    geom_smooth(method='lm', alpha=0, fullrange=TRUE) + facet_wrap(~group)



In [25]:
%%R
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point(col = "red", size=4) + 
    geom_smooth(method='lm') + facet_wrap(~group)


Lab

Discussion

ggplot2 vs. other visualization libraries

  • vs. the native R plot library
  • vs. matplotlib
  • vs. D3.js
  • vs. Google Viz
  • vs. Tableau
  • vs. ???

Ease of Use vs. Expressive Power (vs. Interactivity (vs. ?))

Next Time:

PROJECTS LIVE!

REGRESSION AND REGULARIZATION