1. Introduction

Motivation:

  1. Finding and exctracting data from various sources
  2. Pronciples of tidy data, and how to make data tidy
  3. Practical implementation through a range of R packages

Raw Data -> Processing script -> tidy data -> data analysis -> data communication

  • All processing steps should be recorded

Downloading and Reading Data (basic)

  • File and directory handling:

setwd(), getwd()

file.exists("directoryName")

dir.create("directoryName")


In [1]:
if (!file.exists("data/")) {
    dir.create("data/")
}

download.file() to download a file from internet


In [2]:
fileUrl <- "http://www.brianknaus.com/software/srtoolbox/s_4_1_sequence80.txt"
download.file(fileUrl, destfile="./data/s_4_1_sequence80.txt", method="curl")
  • Read in data:

read.table()

read.csv()

read.fwf() to read files that have fixed-width-format

If the file contains quotation marks, set qoute=""

Reading excel files:

read.xlsx() from library xlsx


In [3]:
if(!file.exists("data/")) {dir.create("data")}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"
download.file(fileUrl, destfile="./data/cameras.xlsx", method="curl")
dateDownloaded <- date

require(xlsx)
cameraData <- read.xlsx("data/cameras.xlsx", sheetIndex=1, header=T)
head(cameraData)


packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: xlsx

packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: rJava

packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: xlsxjars

Out[3]:
                         address direction      street  crossStreet
1       S CATON AVE & BENSON AVE       N/B   Caton Ave   Benson Ave
2       S CATON AVE & BENSON AVE       S/B   Caton Ave   Benson Ave
3 WILKENS AVE & PINE HEIGHTS AVE       E/B Wilkens Ave Pine Heights
4        THE ALAMEDA & E 33RD ST       S/B The Alameda      33rd St
5        E 33RD ST & THE ALAMEDA       E/B      E 33rd  The Alameda
6        ERDMAN AVE & N MACON ST       E/B      Erdman     Macon St
                intersection                      Location.1
1     Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
2     Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
3 Wilkens Ave & Pine Heights  (39.2720252302, -76.676960806)
4     The Alameda  & 33rd St (39.3285013141, -76.5953545714)
5      E 33rd  & The Alameda (39.3283410623, -76.5953594625)
6         Erdman  & Macon St (39.3068045671, -76.5593167803)
  • Use colIndex= and rowIndex= to read specefic columns/rows

In [4]:
read.xlsx("data/cameras.xlsx", sheetIndex=1, header=T, colIndex=2:3, rowIndex=1:5)


Out[4]:
  direction      street
1       N/B   Caton Ave
2       S/B   Caton Ave
3       E/B Wilkens Ave
4       S/B The Alameda
  • Other packages for reading excel files XLConnect

Reading JSON files

  • JSON: Javascript Object Notation
  • Lightweight data storage
  • Common format for data ftom APIs

In JSON, Data stored as

  • numbers (double)
  • string ("")
  • boolean (true or false)
  • array (ordered, [, , ])
  • object (unordered, {key1:value, key2:value})

In [5]:
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/mirjalil/repos")
head(names(jsonData))
names(jsonData$owner)
jsonData$owner$login


Out[5]:
[1] "id"        "name"      "full_name" "owner"     "private"   "html_url" 
Out[5]:
 [1] "login"               "id"                  "avatar_url"         
 [4] "gravatar_id"         "url"                 "html_url"           
 [7] "followers_url"       "following_url"       "gists_url"          
[10] "starred_url"         "subscriptions_url"   "organizations_url"  
[13] "repos_url"           "events_url"          "received_events_url"
[16] "type"                "site_admin"         
Out[5]:
 [1] "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil"
 [7] "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil"
[13] "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil"
[19] "mirjalil" "mirjalil" "mirjalil"

Writing data.frames to JSON format

toJSON()


In [6]:
myjson <- toJSON(iris[1:2,], pretty=T)
cat(myjson)


[
	{
		"Sepal.Length" : 5.1,
		"Sepal.Width" : 3.5,
		"Petal.Length" : 1.4,
		"Petal.Width" : 0.2,
		"Species" : "setosa"
	},
	{
		"Sepal.Length" : 4.9,
		"Sepal.Width" : 3,
		"Petal.Length" : 1.4,
		"Petal.Width" : 0.2,
		"Species" : "setosa"
	}
]

Reading XML files


In [7]:
require(XML)
fileUrl <- "http://www.w3schools.com/xml/simple.xml"

xmlDoc <- xmlTreeParse(fileUrl, useInternal=T)

rootNode <- xmlRoot(xmlDoc)
names(rootNode)

xmlName(rootNode)


packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: XML

Out[7]:
  food   food   food   food   food 
"food" "food" "food" "food" "food" 
Out[7]:
[1] "breakfast_menu"

Accessing elements of XML


In [8]:
rootNode[[1]]

rootNode[[1]][[1]]


Out[8]:
<food>
  <name>Belgian Waffles</name>
  <price>$5.95</price>
  <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
  <calories>650</calories>
</food> 
Out[8]:
<name>Belgian Waffles</name> 

In [9]:
xmlSApply(rootNode, xmlValue)


Out[9]:
                                                                                                                    food 
                              "Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650" 
                                                                                                                    food 
                   "Strawberry Belgian Waffles$7.95Light Belgian waffles covered with strawberries and whipped cream900" 
                                                                                                                    food 
"Berry-Berry Belgian Waffles$8.95Light Belgian waffles covered with an assortment of fresh berries and whipped cream900" 
                                                                                                                    food 
                                               "French Toast$4.50Thick slices made from our homemade sourdough bread600" 
                                                                                                                    food 
                        "Homestyle Breakfast$6.95Two eggs, bacon or sausage, toast, and our ever-popular hash browns950" 

Using xpath


In [10]:
xpathSApply(rootNode, "//name", xmlValue)


Out[10]:
[1] "Belgian Waffles"             "Strawberry Belgian Waffles" 
[3] "Berry-Berry Belgian Waffles" "French Toast"               
[5] "Homestyle Breakfast"        

In [11]:
xpathSApply(rootNode, "//price", xmlValue)


Out[11]:
[1] "$5.95" "$7.95" "$8.95" "$4.50" "$6.95"

Using data.table package

  • Often faster and more moemory efficient than data.frame
  • All the functions that work on data.frames, also work on data.tables
  • Much faster at subseeting, group, and updating

In [10]:
library(data.table)
df <- data.frame(x=rnorm(9), y=rep(c("a", "b", "c"), each=3), z=rnorm(9))

dt <- data.table(x=rnorm(9), y=rep(c("a", "b", "c"), each=3), z=rnorm(9))

## to see all the data.tables in memory
tables()


     NAME NROW MB COLS  KEY
[1,] dt      9 1  x,y,z    
Total: 1MB

In [11]:
## subsetting
dt[1:3, list(x,y)]

dt[, table(y)]


Error in `[.data.frame`(x, i, j): object 'x' not found
Error in table(y): object 'y' not found

In [12]:
## Passing a list of functions to apply:
dt[, list(mean(x), sum(z))]


Error in mean(x): object 'x' not found

adding a new column:


In [13]:
is.data.table(dt)
dt[,w:=z^2]

# Note: some commands of data.table are still not supported in IPython-Rkernel


Out[13]:
[1] TRUE
Error in `:=`(w, z^2): Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").

In [14]:
dt[,a:=x>0]
dt[,b:=mean(x+w), by=a]


Error in `:=`(a, x > 0): Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
Error in `:=`(b, mean(x + w)): Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").

Note: some commands of data.table are still not supported in IPython-Rkernel

Multiple operations:


In [15]:
dt[, m:={tmp <- x+z; log2(tmp+5)}]


Error in `:=`(m, {: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").

Keys

  • Using keys to subset and sort much faster

In [16]:
DT <- data.table(x=rep(c("a", "b", "c"), each=100), y=rnorm(300))
setkey(DT, x)
DT['a']


Error in `[.data.frame`(x, i): undefined columns selected
  • Using keys to facilitate joing two data tables:

In [17]:
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'), z=5:7)
setkey(DT1, x)
setkey(DT2, x)
merged(DT1, DT2)


Error in eval(expr, envir, enclos): could not find function "merged"

Reading data from files faster

Using fread() is much faster than read.table()


In [18]:
big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
tfile=tempfile()

write.table(big_df, file=tfile, row.names=T, sep="\t", quote=F)

## timeing by fread:
system.time(fread(tfile))

## timing by read.table
system.time(read.table(tfile, header=T, sep="\t"))


Out[18]:
   user  system elapsed 
  0.776   0.008   0.783 
Out[18]:
   user  system elapsed 
 24.840   0.144  25.013 

3. Reading From MySQL

  • On a server, there might be multiple databases, and on each databases might be multiple tables
  • Make a connection to a database-serve, and send a query:

In [19]:
require(RMySQL)
ucscDb <- dbConnect(MySQL(), user="genome", 
                    host="genome-mysql.cse.ucsc.edu")
result_allDBs <- dbGetQuery(ucscDb, "show databases;")
dbDisconnect(ucscDb)
head(result_allDBs)


packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: RMySQL

packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: DBI

Out[19]:
[1] TRUE
Out[19]:
            Database
1 information_schema
2            ailMel1
3            allMis1
4            anoCar1
5            anoCar2
6            anoGam1

In [20]:
#Specify a database:
hg19 <- dbConnect(MySQL(), user="genome", db="hg19", 
                  host="genome-mysql.cse.ucsc.edu")

allTables <- dbListTables(hg19)
length(allTables)
allTables[1:5]


Out[20]:
[1] 11003
Out[20]:
[1] "HInv"         "HInvGeneMrna" "acembly"      "acemblyClass" "acemblyPep"  

In [21]:
# list all the fields (column name) of a table:
dbListFields(hg19, "affyU133Plus2")


Out[21]:
 [1] "bin"         "matches"     "misMatches"  "repMatches"  "nCount"     
 [6] "qNumInsert"  "qBaseInsert" "tNumInsert"  "tBaseInsert" "strand"     
[11] "qName"       "qSize"       "qStart"      "qEnd"        "tName"      
[16] "tSize"       "tStart"      "tEnd"        "blockCount"  "blockSizes" 
[21] "qStarts"     "tStarts"    

In [22]:
# find out the number of rows of a specific table:
dbGetQuery(hg19, "SELECT count(*) from affyU133Plus2")


Out[22]:
  count(*)
1    58463
  • Get the table as a data.frame dbReadTable():

In [23]:
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData[,1:8], n=3)


Out[23]:
  bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert
1 585     530          4          0     23          3          41          3
2 585    3355         17          0    109          9          67          9
3 585    4156         14          0     83         16          18          2
  • Select only a subset of data with dbSendQuery()

In [24]:
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 30")
affyMis <- fetch(query)
quantile(affyMis$misMatches)


Out[24]:
  0%  25%  50%  75% 100% 
   1    1    3    6   30 
  • Fetch a small number of rows:

In [25]:
affyMisSmall <- fetch(query, n=10)
dbClearResult(query)
dim(affyMisSmall)


Out[25]:
[1] TRUE
Out[25]:
[1] 10 22
  • Close the connection when you are done

In [26]:
dbDisconnect(hg19)


Out[26]:
[1] TRUE

4. Data in HDF5 Format

www.hdfgroup.org

  • HDF5 used for storing large data sets
  • HDF == Hierarchical Data Format

  • groups containing datasets + metadata

  • datasets are multdimensional array of data

  • To install, use

    source("http://bioconductor.org/biocLite.R")

    biocLite("rhdf5")


In [27]:
library(rhdf5)
# create a new hdf5 file
created <- h5createFile("examples/example.h5")
# the output is just a boolean
created


simpleMessage in message("file '", file, "' already exists."): file 'examples/example.h5' already exists.

Out[27]:
[1] FALSE

In [28]:
created <- h5createGroup("examples/example.h5", "foo")
created <- h5createGroup("examples/example.h5", "baa")
created <- h5createGroup("examples/example.h5", "foo/subgroup")
h5ls("examples/example.h5")


simpleMessage in message("Can not create group. Object with name '", group, "' already exists."): Can not create group. Object with name 'foo' already exists.

simpleMessage in message("Can not create group. Object with name '", group, "' already exists."): Can not create group. Object with name 'baa' already exists.

simpleMessage in message("Can not create group. Object with name '", group, "' already exists."): Can not create group. Object with name 'foo/subgroup' already exists.

Out[28]:
          group     name       otype   dclass       dim
0             /      baa   H5I_GROUP                   
1             /       df H5I_DATASET COMPOUND         5
2             /      foo   H5I_GROUP                   
3          /foo        A H5I_DATASET  INTEGER     5 x 2
4          /foo subgroup   H5I_GROUP                   
5 /foo/subgroup        B H5I_DATASET    FLOAT 5 x 2 x 2

Writing/adding data to a specific group


In [29]:
A <- matrix(1:10, nr=5, nc=2)
h5write(A, "examples/example.h5", "foo/A")

B <- array(seq(0.1, 2.0, by=0.1), dim=c(5,2,2))
attr(B, "scale") <- "liter"
h5write(B, "examples/example.h5", "foo/subgroup/B")
h5ls("examples/example.h5")


Out[29]:
          group     name       otype   dclass       dim
0             /      baa   H5I_GROUP                   
1             /       df H5I_DATASET COMPOUND         5
2             /      foo   H5I_GROUP                   
3          /foo        A H5I_DATASET  INTEGER     5 x 2
4          /foo subgroup   H5I_GROUP                   
5 /foo/subgroup        B H5I_DATASET    FLOAT 5 x 2 x 2

In [30]:
## Writea data.frame
df <- data.frame(1L:5L, seq(0,1,length.out=5), 
                 c("ab", "cde", "fhgi", "a", "s"), stringAsFactors=FALSE)
h5write(df, "examples/example.h5", "df")
h5ls("examples/example.h5")


Error in h5writeDataset.data.frame(obj, loc$H5Identifier, name, ...): Cannot write data.frame. Object already exists. Subsetting for compound datatype not supported.
Out[30]:
          group     name       otype   dclass       dim
0             /      baa   H5I_GROUP                   
1             /       df H5I_DATASET COMPOUND         5
2             /      foo   H5I_GROUP                   
3          /foo        A H5I_DATASET  INTEGER     5 x 2
4          /foo subgroup   H5I_GROUP                   
5 /foo/subgroup        B H5I_DATASET    FLOAT 5 x 2 x 2

Reading data from HDF5


In [32]:
readA <- h5read("examples/example.h5", "foo/A")
readB <- h5read("examples/example.h5", "foo/subgroup/B")
read.df <- h5read("examples/example.h5", "df")
readA


Out[32]:
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

Modifying an Existing Datasets


In [33]:
h5write(c(12,13,14), "examples/example.h5", "foo/A", index=list(1:3, 1))
h5read("examples/example.h5", "foo/A")


Out[33]:
     [,1] [,2]
[1,]   12    6
[2,]   13    7
[3,]   14    8
[4,]    4    9
[5,]    5   10

5. Reading Data from the Web (webscraping)

Webscraping: programmatically extracting data from HTML code


In [34]:
# create a connection
con <- url("http://scholar.google.com/citations?user=8YTgJ_cAAAAJ&hl=en")
# readlines from the connection
htmlCode <- readLines(con)
# close the connection
close(con)


simpleWarning in readLines(con): incomplete final line found on 'http://scholar.google.com/citations?user=8YTgJ_cAAAAJ&hl=en'

Process web data by XML package


In [19]:
library(XML)

url.address <- "http://scholar.google.com/citations?user=8YTgJ_cAAAAJ&hl=en"

html <- htmlTreeParse(url.address, useInternalNodes=T)

xpathSApply(html, "//title", xmlValue)

xpathSApply(html, "//td[@id='col-citedby']", xmlValue)


Out[19]:
[1] "Vahid Mirjalili - Google Scholar Citations"
Out[19]:
 [1] "Cited by" "39"       "15"       "7"        "4"        "1"       
 [7] ""         ""         ""         ""        

Using GET Command (httr package)


In [24]:
library(httr)
html2 <- GET(url.address)

content2 <- content(html2, as="text")

parsedHtml <- htmlParse(content2, asText=TRUE)

xpathSApply(parsedHtml, "//title", xmlValue)


Out[24]:
[1] "Vahid Mirjalili - Google Scholar Citations"

Example: parsing worldcup-2014 HTML


In [15]:
require(XML)

wcup <- htmlTreeParse("http://www.fifa.com/worldcup/matches/", useInternal=TRUE)

matches <- xpathSApply(wcup, "//span[starts-with(@class, 't-nText ')]", xmlValue)
#head(matches)

m_days <- xpathSApply(wcup, "//div[@class='mu-i-datetime']", xmlValue)
#head(m_days)

wcup.matches <- data.frame(
    team_A = matches[seq(1, 130, by=2)],
    team_B = matches[seq(2, 130, by=2)],
    datetime = m_days[1:65]
)

#head(wcup.matches)

## Print the German games:
wcup.matches[wcup.matches$team_A == "Germany" | wcup.matches$team_B == "Germany", ]


Out[15]:
    team_A   team_B                       datetime
12 Germany Portugal 16 Jun 2014 - 13:00 Local time
29 Germany Portugal 16 Jun 2014 - 13:00 Local time
45 Germany    Ghana 21 Jun 2014 - 16:00 Local time
63     USA  Germany 26 Jun 2014 - 13:00 Local time

6. Getting Data from APIs

Using httr package to get data from APIs (twitter, facebook, github, ...)

  • Create a developer account at https://dev.twitter.com/apps
  • Create an application, with authentication keys: OAuth
  • Your keys are private to you!!

In [5]:
library(httr)

## read the api keys stored in a file
auth_keys = read.table("/home/vahid/Desktop/tw.api.keys")

myapp = oauth_app("twitter", 
                  key=auth_keys[1,1], secret = auth_keys[2,1])

sig = sign_oauth1.0(myapp, 
                    token=auth_keys[3,1], token_secret=auth_keys[4,1])

## Get the home timeline
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)

## Extract its contents
json1 = content(homeTL)
json2 = jsonlite::fromJSON(toJSON(json1))

json2[1:2, 1:4]


Out[5]:
                      created_at           id             id_str
1 Sat Jun 14 13:32:02 +0000 2014 4.778057e+17 477805666992922624
2 Sat Jun 14 12:51:01 +0000 2014 4.777953e+17 477795344810397696
                                                 text
1                      Ireland http://t.co/sPW4UvQIgI
2 RGolf: NGSL Scrabble http://t.co/fNoahvHa2o #rstats

useful parameters:

  • exclude_replies
  • count
  • since_id

    • In general, httr package allows GET, POST, PUT, DELETE requests if authenticated
    • httr works well with Twitter, Facebook, Google, Github, ..

Github API:

Information on GitHub useful for hiring programmers or scientists with particular skill.

To read data from github API:


In [ ]:
## reginster an application at https://github.com/settings/applications
### use  http://localhost:1410  fot callback URL

myapp <- oauth_app("github", "74c8e68e2ca64160346d", secret="...")

myapp
> <ouath_app> github
>  key:    74c8e68e2ca64160346d
>  secret: <hidden>

github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)
> Waiting for authentication in browser...
> Press Esc/Ctrl + C to abort
> Authentication complete.

req <- GET("https://api.github.com/users/mirjalil/repos", config(token = github_token))
stop_for_status(req)
content(req)

7. Reading from Other Sources

Use connections to interact directly with other files/sources:

  • file: open a connection to a text file
  • url: open a connection to a url
  • gzfile: open a connection to a .gz file
  • bzfile: open a connection to a .bz2 file

The Foreign Package

  • loads data from other statistical softwares such as Minitab, S, SAS, SPSS, Stata, Systat
    • read.arff (Weka)
    • read.dta (Stata)
    • read.mtp (Minitab)
    • read.octave (Octave)
    • read.spss (SPSS)
    • read.xport (SAS)

In [11]:
# example reading arff data:

library(foreign)
d.weka <- read.arff("examples/anneal.arff")
str(d.weka[,1:5])


'data.frame':	898 obs. of  5 variables:
 $ family      : Factor w/ 2 levels "TN","ZS": NA NA NA NA NA NA NA NA NA NA ...
 $ product-type: Factor w/ 1 level "C": 1 1 1 1 1 1 1 1 1 1 ...
 $ steel       : Factor w/ 7 levels "A","K","M","R",..: 1 4 4 1 1 1 4 1 4 1 ...
 $ carbon      : num  8 0 0 0 0 0 0 0 0 0 ...
 $ hardness    : num  0 0 0 60 60 45 0 0 0 0 ...

Other Database Connections

R Packages for Reading Images

  • jpeg::readJPEG()
  • readbitmap::read.bmp()
  • png::readPNG()
  • EBImage (Bioconductor)

In [8]:
# example: reading image from 101_ObjectCategories

library(jpeg)

d.img <- readJPEG("~/datasets/101_ObjectCategories/pyramid/image_0001.jpg")

str(d.img)


 num [1:189, 1:300, 1:3] 0.965 0.969 0.973 0.973 0.973 ...

Reading Geographic Information System (GIS) Data

  • rdgal::
  • rgeos::
  • raster::

Reading Musical Data

  • tuneR::
  • seewave::