1. Introduction

Motivation:

Finding and exctracting data from various sources
Pronciples of tidy data, and how to make data tidy
Practical implementation through a range of R packages

Raw Data -> Processing script -> tidy data -> data analysis -> data communication

All processing steps should be recorded

Downloading and Reading Data (basic)

File and directory handling:

setwd(), getwd()

file.exists("directoryName")

dir.create("directoryName")



In [1]:

    
if (!file.exists("data/")) {
    dir.create("data/")
}

download.file() to download a file from internet



In [2]:

    
fileUrl <- "http://www.brianknaus.com/software/srtoolbox/s_4_1_sequence80.txt"
download.file(fileUrl, destfile="./data/s_4_1_sequence80.txt", method="curl")

Read in data:

read.table()

read.csv()

read.fwf() to read files that have fixed-width-format

If the file contains quotation marks, set qoute=""

Reading excel files:

read.xlsx() from library xlsx



In [3]:

    
if(!file.exists("data/")) {dir.create("data")}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"
download.file(fileUrl, destfile="./data/cameras.xlsx", method="curl")
dateDownloaded <- date

require(xlsx)
cameraData <- read.xlsx("data/cameras.xlsx", sheetIndex=1, header=T)
head(cameraData)









    



packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: xlsx

packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: rJava

packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: xlsxjars







    Out[3]:





                         address direction      street  crossStreet
1       S CATON AVE & BENSON AVE       N/B   Caton Ave   Benson Ave
2       S CATON AVE & BENSON AVE       S/B   Caton Ave   Benson Ave
3 WILKENS AVE & PINE HEIGHTS AVE       E/B Wilkens Ave Pine Heights
4        THE ALAMEDA & E 33RD ST       S/B The Alameda      33rd St
5        E 33RD ST & THE ALAMEDA       E/B      E 33rd  The Alameda
6        ERDMAN AVE & N MACON ST       E/B      Erdman     Macon St
                intersection                      Location.1
1     Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
2     Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
3 Wilkens Ave & Pine Heights  (39.2720252302, -76.676960806)
4     The Alameda  & 33rd St (39.3285013141, -76.5953545714)
5      E 33rd  & The Alameda (39.3283410623, -76.5953594625)
6         Erdman  & Macon St (39.3068045671, -76.5593167803)

Use colIndex= and rowIndex= to read specefic columns/rows



In [4]:

    
read.xlsx("data/cameras.xlsx", sheetIndex=1, header=T, colIndex=2:3, rowIndex=1:5)









    Out[4]:





  direction      street
1       N/B   Caton Ave
2       S/B   Caton Ave
3       E/B Wilkens Ave
4       S/B The Alameda

Other packages for reading excel files XLConnect

Reading JSON files

JSON: Javascript Object Notation
Lightweight data storage
Common format for data ftom APIs

In JSON, Data stored as

numbers (double)
string ("")
boolean (true or false)
array (ordered, [, , ])
object (unordered, {key1:value, key2:value})



In [5]:

    
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/mirjalil/repos")
head(names(jsonData))
names(jsonData$owner)
jsonData$owner$login









    Out[5]:





[1] "id"        "name"      "full_name" "owner"     "private"   "html_url" 






    Out[5]:





 [1] "login"               "id"                  "avatar_url"         
 [4] "gravatar_id"         "url"                 "html_url"           
 [7] "followers_url"       "following_url"       "gists_url"          
[10] "starred_url"         "subscriptions_url"   "organizations_url"  
[13] "repos_url"           "events_url"          "received_events_url"
[16] "type"                "site_admin"         






    Out[5]:





 [1] "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil"
 [7] "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil"
[13] "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil" "mirjalil"
[19] "mirjalil" "mirjalil" "mirjalil"

Writing data.frames to JSON format

toJSON()



In [6]:

    
myjson <- toJSON(iris[1:2,], pretty=T)
cat(myjson)









    



[
	{
		"Sepal.Length" : 5.1,
		"Sepal.Width" : 3.5,
		"Petal.Length" : 1.4,
		"Petal.Width" : 0.2,
		"Species" : "setosa"
	},
	{
		"Sepal.Length" : 4.9,
		"Sepal.Width" : 3,
		"Petal.Length" : 1.4,
		"Petal.Width" : 0.2,
		"Species" : "setosa"
	}
]

Reading XML files



In [7]:

    
require(XML)
fileUrl <- "http://www.w3schools.com/xml/simple.xml"

xmlDoc <- xmlTreeParse(fileUrl, useInternal=T)

rootNode <- xmlRoot(xmlDoc)
names(rootNode)

xmlName(rootNode)









    



packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: XML







    Out[7]:





  food   food   food   food   food 
"food" "food" "food" "food" "food" 






    Out[7]:





[1] "breakfast_menu"

Accessing elements of XML



In [8]:

    
rootNode[[1]]

rootNode[[1]][[1]]









    Out[8]:





<food>
  <name>Belgian Waffles</name>
  <price>$5.95</price>
  <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
  <calories>650</calories>
</food> 






    Out[8]:





<name>Belgian Waffles</name>



In [9]:

    
xmlSApply(rootNode, xmlValue)









    Out[9]:





                                                                                                                    food 
                              "Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650" 
                                                                                                                    food 
                   "Strawberry Belgian Waffles$7.95Light Belgian waffles covered with strawberries and whipped cream900" 
                                                                                                                    food 
"Berry-Berry Belgian Waffles$8.95Light Belgian waffles covered with an assortment of fresh berries and whipped cream900" 
                                                                                                                    food 
                                               "French Toast$4.50Thick slices made from our homemade sourdough bread600" 
                                                                                                                    food 
                        "Homestyle Breakfast$6.95Two eggs, bacon or sausage, toast, and our ever-popular hash browns950"

Using xpath



In [10]:

    
xpathSApply(rootNode, "//name", xmlValue)









    Out[10]:





[1] "Belgian Waffles"             "Strawberry Belgian Waffles" 
[3] "Berry-Berry Belgian Waffles" "French Toast"               
[5] "Homestyle Breakfast"



In [11]:

    
xpathSApply(rootNode, "//price", xmlValue)









    Out[11]:





[1] "$5.95" "$7.95" "$8.95" "$4.50" "$6.95"

Using data.table package

Often faster and more moemory efficient than data.frame
All the functions that work on data.frames, also work on data.tables
Much faster at subseeting, group, and updating



In [10]:

    
library(data.table)
df <- data.frame(x=rnorm(9), y=rep(c("a", "b", "c"), each=3), z=rnorm(9))

dt <- data.table(x=rnorm(9), y=rep(c("a", "b", "c"), each=3), z=rnorm(9))

## to see all the data.tables in memory
tables()









    



     NAME NROW MB COLS  KEY
[1,] dt      9 1  x,y,z    
Total: 1MB



In [11]:

    
## subsetting
dt[1:3, list(x,y)]

dt[, table(y)]









    



Error in `[.data.frame`(x, i, j): object 'x' not found






    



Error in table(y): object 'y' not found



In [12]:

    
## Passing a list of functions to apply:
dt[, list(mean(x), sum(z))]









    



Error in mean(x): object 'x' not found

adding a new column:



In [13]:

    
is.data.table(dt)
dt[,w:=z^2]

# Note: some commands of data.table are still not supported in IPython-Rkernel









    Out[13]:





[1] TRUE






    



Error in `:=`(w, z^2): Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").



In [14]:

    
dt[,a:=x>0]
dt[,b:=mean(x+w), by=a]









    



Error in `:=`(a, x > 0): Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").






    



Error in `:=`(b, mean(x + w)): Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").

Note: some commands of data.table are still not supported in IPython-Rkernel

Multiple operations:



In [15]:

    
dt[, m:={tmp <- x+z; log2(tmp+5)}]









    



Error in `:=`(m, {: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").

Keys

Using keys to subset and sort much faster



In [16]:

    
DT <- data.table(x=rep(c("a", "b", "c"), each=100), y=rnorm(300))
setkey(DT, x)
DT['a']









    



Error in `[.data.frame`(x, i): undefined columns selected

Using keys to facilitate joing two data tables:



In [17]:

    
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'), z=5:7)
setkey(DT1, x)
setkey(DT2, x)
merged(DT1, DT2)









    



Error in eval(expr, envir, enclos): could not find function "merged"

Reading data from files faster

Using fread() is much faster than read.table()



In [18]:

    
big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
tfile=tempfile()

write.table(big_df, file=tfile, row.names=T, sep="\t", quote=F)

## timeing by fread:
system.time(fread(tfile))

## timing by read.table
system.time(read.table(tfile, header=T, sep="\t"))









    Out[18]:





   user  system elapsed 
  0.776   0.008   0.783 






    Out[18]:





   user  system elapsed 
 24.840   0.144  25.013

3. Reading From MySQL

On a server, there might be multiple databases, and on each databases might be multiple tables
Make a connection to a database-serve, and send a query:



In [19]:

    
require(RMySQL)
ucscDb <- dbConnect(MySQL(), user="genome", 
                    host="genome-mysql.cse.ucsc.edu")
result_allDBs <- dbGetQuery(ucscDb, "show databases;")
dbDisconnect(ucscDb)
head(result_allDBs)









    



packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: RMySQL

packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: DBI







    Out[19]:





[1] TRUE






    Out[19]:





            Database
1 information_schema
2            ailMel1
3            allMis1
4            anoCar1
5            anoCar2
6            anoGam1



In [20]:

    
#Specify a database:
hg19 <- dbConnect(MySQL(), user="genome", db="hg19", 
                  host="genome-mysql.cse.ucsc.edu")

allTables <- dbListTables(hg19)
length(allTables)
allTables[1:5]









    Out[20]:





[1] 11003






    Out[20]:





[1] "HInv"         "HInvGeneMrna" "acembly"      "acemblyClass" "acemblyPep"



In [21]:

    
# list all the fields (column name) of a table:
dbListFields(hg19, "affyU133Plus2")









    Out[21]:





 [1] "bin"         "matches"     "misMatches"  "repMatches"  "nCount"     
 [6] "qNumInsert"  "qBaseInsert" "tNumInsert"  "tBaseInsert" "strand"     
[11] "qName"       "qSize"       "qStart"      "qEnd"        "tName"      
[16] "tSize"       "tStart"      "tEnd"        "blockCount"  "blockSizes" 
[21] "qStarts"     "tStarts"



In [22]:

    
# find out the number of rows of a specific table:
dbGetQuery(hg19, "SELECT count(*) from affyU133Plus2")









    Out[22]:





  count(*)
1    58463

Get the table as a data.frame dbReadTable():



In [23]:

    
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData[,1:8], n=3)









    Out[23]:





  bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert
1 585     530          4          0     23          3          41          3
2 585    3355         17          0    109          9          67          9
3 585    4156         14          0     83         16          18          2

Select only a subset of data with dbSendQuery()



In [24]:

    
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 30")
affyMis <- fetch(query)
quantile(affyMis$misMatches)









    Out[24]:





  0%  25%  50%  75% 100% 
   1    1    3    6   30

Fetch a small number of rows:



In [25]:

    
affyMisSmall <- fetch(query, n=10)
dbClearResult(query)
dim(affyMisSmall)









    Out[25]:





[1] TRUE






    Out[25]:





[1] 10 22

Close the connection when you are done



In [26]:

    
dbDisconnect(hg19)









    Out[26]:





[1] TRUE

4. Data in HDF5 Format

www.hdfgroup.org

HDF5 used for storing large data sets
HDF == Hierarchical Data Format
groups containing datasets + metadata
datasets are multdimensional array of data
To install, use

source("http://bioconductor.org/biocLite.R")

biocLite("rhdf5")



In [27]:

    
library(rhdf5)
# create a new hdf5 file
created <- h5createFile("examples/example.h5")
# the output is just a boolean
created









    



simpleMessage in message("file '", file, "' already exists."): file 'examples/example.h5' already exists.







    Out[27]:





[1] FALSE



In [28]:

    
created <- h5createGroup("examples/example.h5", "foo")
created <- h5createGroup("examples/example.h5", "baa")
created <- h5createGroup("examples/example.h5", "foo/subgroup")
h5ls("examples/example.h5")









    



simpleMessage in message("Can not create group. Object with name '", group, "' already exists."): Can not create group. Object with name 'foo' already exists.

simpleMessage in message("Can not create group. Object with name '", group, "' already exists."): Can not create group. Object with name 'baa' already exists.

simpleMessage in message("Can not create group. Object with name '", group, "' already exists."): Can not create group. Object with name 'foo/subgroup' already exists.







    Out[28]:





          group     name       otype   dclass       dim
0             /      baa   H5I_GROUP                   
1             /       df H5I_DATASET COMPOUND         5
2             /      foo   H5I_GROUP                   
3          /foo        A H5I_DATASET  INTEGER     5 x 2
4          /foo subgroup   H5I_GROUP                   
5 /foo/subgroup        B H5I_DATASET    FLOAT 5 x 2 x 2

Writing/adding data to a specific group



In [29]:

    
A <- matrix(1:10, nr=5, nc=2)
h5write(A, "examples/example.h5", "foo/A")

B <- array(seq(0.1, 2.0, by=0.1), dim=c(5,2,2))
attr(B, "scale") <- "liter"
h5write(B, "examples/example.h5", "foo/subgroup/B")
h5ls("examples/example.h5")









    Out[29]:





          group     name       otype   dclass       dim
0             /      baa   H5I_GROUP                   
1             /       df H5I_DATASET COMPOUND         5
2             /      foo   H5I_GROUP                   
3          /foo        A H5I_DATASET  INTEGER     5 x 2
4          /foo subgroup   H5I_GROUP                   
5 /foo/subgroup        B H5I_DATASET    FLOAT 5 x 2 x 2



In [30]:

    
## Writea data.frame
df <- data.frame(1L:5L, seq(0,1,length.out=5), 
                 c("ab", "cde", "fhgi", "a", "s"), stringAsFactors=FALSE)
h5write(df, "examples/example.h5", "df")
h5ls("examples/example.h5")









    



Error in h5writeDataset.data.frame(obj, loc$H5Identifier, name, ...): Cannot write data.frame. Object already exists. Subsetting for compound datatype not supported.






    Out[30]:





          group     name       otype   dclass       dim
0             /      baa   H5I_GROUP                   
1             /       df H5I_DATASET COMPOUND         5
2             /      foo   H5I_GROUP                   
3          /foo        A H5I_DATASET  INTEGER     5 x 2
4          /foo subgroup   H5I_GROUP                   
5 /foo/subgroup        B H5I_DATASET    FLOAT 5 x 2 x 2

Reading data from HDF5



In [32]:

    
readA <- h5read("examples/example.h5", "foo/A")
readB <- h5read("examples/example.h5", "foo/subgroup/B")
read.df <- h5read("examples/example.h5", "df")
readA









    Out[32]:





     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

Modifying an Existing Datasets



In [33]:

    
h5write(c(12,13,14), "examples/example.h5", "foo/A", index=list(1:3, 1))
h5read("examples/example.h5", "foo/A")









    Out[33]:





     [,1] [,2]
[1,]   12    6
[2,]   13    7
[3,]   14    8
[4,]    4    9
[5,]    5   10

5. Reading Data from the Web (webscraping)

Webscraping: programmatically extracting data from HTML code



In [34]:

    
# create a connection
con <- url("http://scholar.google.com/citations?user=8YTgJ_cAAAAJ&hl=en")
# readlines from the connection
htmlCode <- readLines(con)
# close the connection
close(con)









    



simpleWarning in readLines(con): incomplete final line found on 'http://scholar.google.com/citations?user=8YTgJ_cAAAAJ&hl=en'

Process web data by XML package



In [19]:

    
library(XML)

url.address <- "http://scholar.google.com/citations?user=8YTgJ_cAAAAJ&hl=en"

html <- htmlTreeParse(url.address, useInternalNodes=T)

xpathSApply(html, "//title", xmlValue)

xpathSApply(html, "//td[@id='col-citedby']", xmlValue)









    Out[19]:





[1] "Vahid Mirjalili - Google Scholar Citations"






    Out[19]:





 [1] "Cited by" "39"       "15"       "7"        "4"        "1"       
 [7] ""         ""         ""         ""

Using GET Command (httr package)



In [24]:

    
library(httr)
html2 <- GET(url.address)

content2 <- content(html2, as="text")

parsedHtml <- htmlParse(content2, asText=TRUE)

xpathSApply(parsedHtml, "//title", xmlValue)









    Out[24]:





[1] "Vahid Mirjalili - Google Scholar Citations"

Example: parsing worldcup-2014 HTML



In [15]:

    
require(XML)

wcup <- htmlTreeParse("http://www.fifa.com/worldcup/matches/", useInternal=TRUE)

matches <- xpathSApply(wcup, "//span[starts-with(@class, 't-nText ')]", xmlValue)
#head(matches)

m_days <- xpathSApply(wcup, "//div[@class='mu-i-datetime']", xmlValue)
#head(m_days)

wcup.matches <- data.frame(
    team_A = matches[seq(1, 130, by=2)],
    team_B = matches[seq(2, 130, by=2)],
    datetime = m_days[1:65]
)

#head(wcup.matches)

## Print the German games:
wcup.matches[wcup.matches$team_A == "Germany" | wcup.matches$team_B == "Germany", ]









    Out[15]:





    team_A   team_B                       datetime
12 Germany Portugal 16 Jun 2014 - 13:00 Local time
29 Germany Portugal 16 Jun 2014 - 13:00 Local time
45 Germany    Ghana 21 Jun 2014 - 16:00 Local time
63     USA  Germany 26 Jun 2014 - 13:00 Local time

6. Getting Data from APIs

Using httr package to get data from APIs (twitter, facebook, github, ...)

Create a developer account at https://dev.twitter.com/apps
Create an application, with authentication keys: OAuth
Your keys are private to you!!



In [5]:

    
library(httr)

## read the api keys stored in a file
auth_keys = read.table("/home/vahid/Desktop/tw.api.keys")

myapp = oauth_app("twitter", 
                  key=auth_keys[1,1], secret = auth_keys[2,1])

sig = sign_oauth1.0(myapp, 
                    token=auth_keys[3,1], token_secret=auth_keys[4,1])

## Get the home timeline
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)

## Extract its contents
json1 = content(homeTL)
json2 = jsonlite::fromJSON(toJSON(json1))

json2[1:2, 1:4]









    Out[5]:





                      created_at           id             id_str
1 Sat Jun 14 13:32:02 +0000 2014 4.778057e+17 477805666992922624
2 Sat Jun 14 12:51:01 +0000 2014 4.777953e+17 477795344810397696
                                                 text
1                      Ireland http://t.co/sPW4UvQIgI
2 RGolf: NGSL Scrabble http://t.co/fNoahvHa2o #rstats

To lear more about other parameters to get statuses/home_timeline:

https://dev.twitter.com/docs/api/1.1/get/statuses/home_timeline

useful parameters:

exclude_replies
count
since_id
- In general, httr package allows GET, POST, PUT, DELETE requests if authenticated
- httr works well with Twitter, Facebook, Google, Github, ..

Github API:

Information on GitHub useful for hiring programmers or scientists with particular skill.

To read data from github API:

Reginster an application at https://github.com/settings/applications
Get your key and secret pair for aithntication



In [ ]:

    
## reginster an application at https://github.com/settings/applications
### use  http://localhost:1410  fot callback URL

myapp <- oauth_app("github", "74c8e68e2ca64160346d", secret="...")

myapp
> <ouath_app> github
>  key:    74c8e68e2ca64160346d
>  secret: <hidden>

github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)
> Waiting for authentication in browser...
> Press Esc/Ctrl + C to abort
> Authentication complete.

req <- GET("https://api.github.com/users/mirjalil/repos", config(token = github_token))
stop_for_status(req)
content(req)

7. Reading from Other Sources

Use connections to interact directly with other files/sources:

file: open a connection to a text file
url: open a connection to a url
gzfile: open a connection to a .gz file
bzfile: open a connection to a .bz2 file

The Foreign Package

loads data from other statistical softwares such as Minitab, S, SAS, SPSS, Stata, Systat
- read.arff (Weka)
- read.dta (Stata)
- read.mtp (Minitab)
- read.octave (Octave)
- read.spss (SPSS)
- read.xport (SAS)



In [11]:

    
# example reading arff data:

library(foreign)
d.weka <- read.arff("examples/anneal.arff")
str(d.weka[,1:5])









    



'data.frame':	898 obs. of  5 variables:
 $ family      : Factor w/ 2 levels "TN","ZS": NA NA NA NA NA NA NA NA NA NA ...
 $ product-type: Factor w/ 1 level "C": 1 1 1 1 1 1 1 1 1 1 ...
 $ steel       : Factor w/ 7 levels "A","K","M","R",..: 1 4 4 1 1 1 4 1 4 1 ...
 $ carbon      : num  8 0 0 0 0 0 0 0 0 0 ...
 $ hardness    : num  0 0 0 60 60 45 0 0 0 0 ...

Other Database Connections

RPostgresSQL: https://code.google.com/p/rpostgressql
RODBC: provides an interface to multiple databases such as RPostgresSQL, SQLite, MySQL, Microsoft Access
RMongo, and RMongoDb

R Packages for Reading Images

jpeg::readJPEG()
readbitmap::read.bmp()
png::readPNG()
EBImage (Bioconductor)



In [8]:

    
# example: reading image from 101_ObjectCategories

library(jpeg)

d.img <- readJPEG("~/datasets/101_ObjectCategories/pyramid/image_0001.jpg")

str(d.img)









    



 num [1:189, 1:300, 1:3] 0.965 0.969 0.973 0.973 0.973 ...

Reading Geographic Information System (GIS) Data

rdgal::
rgeos::
raster::

Reading Musical Data

tuneR::
seewave::

Getting and Cleaninig Data in R