After having 1) set up the data analysis problem and 2) forged a good idea of the ideal data set (and corresponding analysis) for the problem, the first stage of the data analysis process is to start gathering relevant data in order to build the best possible approximation of your ideal data set.

The internet (which is basically a giant interconnected file system hosting data files) is a great place to start. This data, however, may come from various websites and be in various formats. It will need to be unified and be put in a regular form--that is, into a collection of numerical arrays--on which to apply the data analysis algorithms described in the previous sections.

Here, we will learn how to use R to:

  • download files from the internet
  • store them into directories on your local file systems
  • load data files in various formats (.csv, JSON, and XML) into appropriate R objects

In particular, we will spend some time on the XML format, which stands for eXtensible Markup Language. The reason why this format is particularly interesting is that instead of structuring data in tabular form, it structures it in the form of a graph. A graph is a collection of nodes containing information. These nodes are related to each other through links, and one can access the information contained on one node just by traveling from node to node using the links.

Data structured in this form encompass HTML pages (HTML can be roughly understood as a dialect of XML), the information hosted by internet websites (websites are related to each other by links, and one can regard the Google search tool as a way to bypass the internet's graph structure to get to the information directly), a computer file system (where many users store their data as they go, forming an increasingly intricate graph of directories and subdirectories), or the data of user profiles in social media networks.

Graphical data is not regular, and statistical analysis is not readily applicable for data structured in this form. It needs to first be transformed into regular data--i.e., cubes of numbers--before statistical analysis can take place.

Libraries for data gathering

various data sources (XML, jason, HTML, etc.) $\longrightarrow$ raw data tables (csv, xls, etc.)

This involves

  • storing the raw data into the local file system
  • loading the raw data into R (or Python) using library (or modules) corresponding to the raw data format
  • creating data frames containing the raw data
  • saving the data in tabular format (as csv, xls, etc.)

R packages for data gathering

You'll find here a list of all availabe R packages.

To install R packages on the OskiBox, the best way is to type in

sudo apt-get r-cran-PackageName

The cran website maitains a comprehensive list of the R packages available this way.

To use the package after installation, you'll inkove the command:

library('package_name')

Python packages for data gathering

You'll find here a list of all availabe Python packages (or modules).

To install a Python package to osky box the best way is :

sudo pip install package_name

To use the package after installation, you'll inkove the command:

import package_name

Let us start by learning how to manipulate the local filesystem in your computer from R.

Manipulating the file systems


In [1]:
%load_ext rmagic

R offers a bunch of functions to

  • list the files and directories (see here):

      list.files(path = ".")    
      list.dirs(path = ".")
  • create and modify files (see here for details):

      file.create(path1, path2, etc.)
      file.remove(path1, path2, etc.)
  • create directories (see here for details)

      dir.create(path)
  • navigate the file system (see here):

    getwd()
    setwd(dir)

Let us create

  • a directory 'DATA' to help us organize our data
  • a subdirectory 'raw' to store the raw data collected from the web
  • a subdirectory 'cleaned' to stored our final data after having cleaned the raw data

and then let us set move to the raw directory.


In [2]:
%%R

data_dir    = './data'
raw_dir     = paste(data_dir, '/raw', sep='')
cleaned_dir = paste(data_dir, '/cleaned', sep='')

dir.create(data_dir)
dir.create(raw_dir)
dir.create(cleaned_dir)

In [3]:
!ls data/


cleaned  raw

Downloading files from the web

To download a file from the internet in R and store it on the local file system, one uses the function:

    download.file(fileUrl, destfile)

It is also a good practice to keep track of the date the data was downloaded, since data from the web may likely be updated. For that, one can

  • retrieve the current data with the function date()
  • include the date as part of the data file name

However, from within the notebook, it is simpler to use bash curl command whose syntax is the following:

`curl yourURL > localFile`

In [4]:
%%R
url = 'http://benoitdherin.github.io/data-analysis-with-R/data/houses.csv'
download.file(url, 'myHouses.csv')


trying URL 'http://benoitdherin.github.io/data-analysis-with-R/data/houses.csv'
Content type 'text/csv; charset=utf-8' length 3489 bytes
opened URL
==================================================
downloaded 3489 bytes


In [13]:
%%R 

df = read.csv('myHouses.csv', header=T)
print(head(df))


  House Taxes Bedrooms Baths Quadrant NW  price size   lot
1     1  1360        3   2.0       NW  1 145000 1240 18000
2     2  1050        1   1.0       NW  1  68000  370 25000
3     3  1010        3   1.5       NW  1 115000 1130 25000
4     4   830        3   2.0       SW  0  69000 1120 17000
5     5  2150        3   2.0       NW  1 163000 1710 14000
6     6  1230        3   2.0       NW  1  69900 1010  8000

Alternatively, one could download the file immediately into R without storing it first on the local file system:


In [16]:
%%R

dff = read.csv(url, header=T)
print(head(dff))


  House Taxes Bedrooms Baths Quadrant NW  price size   lot
1     1  1360        3   2.0       NW  1 145000 1240 18000
2     2  1050        1   1.0       NW  1  68000  370 25000
3     3  1010        3   1.5       NW  1 115000 1130 25000
4     4   830        3   2.0       SW  0  69000 1120 17000
5     5  2150        3   2.0       NW  1 163000 1710 14000
6     6  1230        3   2.0       NW  1  69900 1010  8000

Example: The city of Baltimore offers an API (Application Programming Interface) allowing us to gather data on speed infractions in different formats:

  • csv (Comma Separated Values)

  • xml (Extensible Markup Language)

  • jason (JavaScript Object Notation)

  • xls (Microsoft Excel file format)

In this context, a website API is a set of rules that describe how to issues requests to a given webserver, say benoitdherin.github.io by adding requests strings after the webserver name in its url, say this string is data-analysis-with-R. If type this string into a browser the website . For instance, This request string tells the webserver what to return to your browser as a result of this urlThe general format for an url is the following:

http://benoitdherin.github.io/data-analysis-with-R/data/houses.csv

http://web.server.name/re/que/st

The re/que/st is a string describing to the webserver what you'd like it returns to you. The simplest form of a request is the path (location) of a file (ressource) on the webserver. The ressource that's returned to you is then that file:

Copied and pasted from the website:


In [17]:
protocol    = 'https://'
website     =  'data.baltimorecity.gov'
location    =  '/api/views/dz54-2aru/'
csvRequest  = 'rows.csv?accessType=DOWNLOAD'
xmlRequest  = 'rows.xml?accessType=DOWNLOAD'
jsonRequest = 'rows.json?accessType=DOWNLOAD'

To access this different data format, one neeed to use different urls:


In [18]:
csv_url = protocol  + website + location + csvRequest
xml_url = protocol  + website + location + csvRequest
json_url = protocol + website + location + csvRequest
xls_url = protocol  + website + location + csvRequest

print csv_url; print xml_url; print json_url; print xls_url

!curl $csv_url > baltimore.csv
!curl $csv_url > baltimore.xml
!curl $csv_url > baltimore.json
!curl $csv_url > baltimore.xls


https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD
https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD
https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD
https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9294    0  9294    0     0  27566      0 --:--:-- --:--:-- --:--:-- 27578
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9294    0  9294    0     0  37915      0 --:--:-- --:--:-- --:--:-- 37934
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9294    0  9294    0     0  34646      0 --:--:-- --:--:-- --:--:-- 34550
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9294    0  9294    0     0  34228      0 --:--:-- --:--:-- --:--:-- 34295

Let us prepare prepare the names of the files in which we will store the data:

Loading data into R

Different data formats are loaded with different functions. The resulting R object storing the data also depends on the format.

Tabular format

Data files in tabular format such as csv are loaded directly into data frame with the the function:

read.table(file, sep, header)

In [19]:
%%R 

df = read.table('baltimore.csv', sep=',', header=TRUE)
print(head(df))


                         address direction      street  crossStreet
1       S CATON AVE & BENSON AVE       N/B   Caton Ave   Benson Ave
2       S CATON AVE & BENSON AVE       S/B   Caton Ave   Benson Ave
3 WILKENS AVE & PINE HEIGHTS AVE       E/B Wilkens Ave Pine Heights
4        THE ALAMEDA & E 33RD ST       S/B The Alameda      33rd St
5        E 33RD ST & THE ALAMEDA       E/B      E 33rd  The Alameda
6        ERDMAN AVE & N MACON ST       E/B      Erdman     Macon St
                intersection                      Location.1
1     Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
2     Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
3 Wilkens Ave & Pine Heights  (39.2720252302, -76.676960806)
4     The Alameda  & 33rd St (39.3285013141, -76.5953545714)
5      E 33rd  & The Alameda (39.3283410623, -76.5953594625)
6         Erdman  & Macon St (39.3068045671, -76.5593167803)

XML documents

The tree structure of an xml document

Let us create a simple XML document.

Consider the following data for two students in a given class given in tabular form:


In [20]:
%%file course.csv
name, midterm, final, homework, section, major
Bob Durant, 55, 88, 99, 1, STAT
Agnes Thomas, 99, 90, 99, 2, ECON


Overwriting course.csv

Let's read this data nicely into a data frame:


In [21]:
%%R 
df = read.csv('course.csv'); print(df)


          name midterm final homework section major
1   Bob Durant      55    88       99       1  STAT
2 Agnes Thomas      99    90       99       2  ECON

The same data can be stored in XML format in the following way:


In [22]:
%%file stat133.xml
<?xml version="1.0" encoding="ISO-8859-1"?>  


<stat133 name='Computing with Data' department='STAT'>
        
    <student sid='1232243'>
        <name>Bob Durant</name>
        <midterm>55</midterm>
        <final>88</final>
        <homework>99</homework>
        <section>1</section>
        <major>STAT</major>
    </student>
    
    <student sid='3213453'>
        <name>Agnes Thomas</name>
        <midterm>99</midterm>
        <final>90</final>
        <homework>99</homework>
        <section>2</section>
        <major>ECON</major>
    </student>
    
</stat133>


Overwriting stat133.xml

The first line

<?xml version="1.0" encoding="ISO-8859-1"?> 

provides informations about

  • which version of XML we are structuring our data with (version 1.0)
  • which type of character encoding we are using (ISO-8859-1)

The whole course data is surrounded by two XML tags

<stat133> ... </stat133>

  • The first tag is an opening tag indicating the beginning of the stat 133 data:
<tag>


  • The second tag is a closing tag incating the end of the stat 133 data:
</tag>

  • Two such tags together with all the data in between tag forms an xml node:
<tag>...data...</tag>

The data in between the stat133 node is itself a collection of xml nodes:

<student sid='1232243'> ... </student>
<student sid='3213453'> ... </student>

  • These nodes are the children nodes of the enclosing node (the stat133 node here)
  • The enclosing node is called the parent node

Pictorially, one may represent the children/parent relationship as an arrow:

parent node (stat133 node) $\longrightarrow$ children node (student node)

With this interpretation, an xml document structures data in the form of a tree, called the xml tree.

  • The root of the tree is the node whose tag encloses all other nodes
  • The leafs of the tree are the nodes enclosing no other node

In our class example:

  • the stat133 node is the root
  • the tags enclosing actual data about the students are the leafs:
    <name>Bob Durant</name>
    <midterm>55</midterm>
    <final>88</final>
    <homework>99</homework>
    <section>1</section>
    <major>STAT</major>

So each student node contains five leafs.

The values of an xml node is the collection of all the data contained by all the leafs stemming from this node.

Example: Consider the following xml code.

<A>
    <B>value1</B>
    <C>value2</C>
    <D>
        <E> 
            <G>value3</G> 
        </E>
        <F>value4</F>
    </D>
</A>

The value of

  • node A is the collection {value1, value2, value2, value4}

  • node B is {value1}

  • node D is {value3, value4}

  • etc.

Here is a tree representation of the xml code above:

An xml node can also have several attributes:

<tag attribute1=value attribue2=value etc.>...data...</tag>

The node attributes are specified in the opening tag.

In our class example:

<stat133 name='Computing with Data' department='STAT'>

    <student sid='1232243'> ...data... </student>

    <student sid='3213453'> ...data... </student>

</stat133>

  • the stat133 node has two attribues: name and department
  • the student nodes have only one attribute: sid

XML node summary

Each XML node in a XML document may have

  • several attributes specified in the opening tag
  • several children nodes in between the opening tag and the closing tag
  • one parent node enclosing it
  • a value given by all the data enclosed by the leafs steming out of it

The collection of all XML nodes with the parent/children relation ship form the XML tree:

  • the XML node enclosing all other nodes is the root of the XML document
  • the XML nodes enclosing no other nodes are the leafs of the XML document

Parsing an XML document with R

In R, the XML library contains functions and classes to represent XML documents.

Libraries in R corresponds to modules in Python.


In [23]:
%%R
library(XML)


Attaching package: ‘XML’

The following object is masked from ‘package:tools’:

    toHTML

To install this r package in the oskibox, open a termina and type in

sudo apt-get install r-cran-xml

Remark: The keyword library in R corresponds to the import keyword in Python.

The XML library provides two main classes:

A class

XMLDocument

representing whole XML documents including

  • The header of the XML document, that is, the data (also called meta data) contained in
<?xml version="1.0" encoding="ISO-8859-1"?>  

  • The body of the XML document, i.e., the sequence of XML tags representing the actual data

A class

XMLNode

representing the XML nodes in the XML tree where the information is stored.

The first steps to load XML data into R is to use the functions:

xml_doc = xmlParse(file_adress)
root_node = xmlRoot(xml_doc)

  • xmlParse takes as input the XML file address and returns the corresponding XMLDocument object
  • xmlRoot takes an XMLDocument and returns the root node as a XMLNode object.

In [24]:
%%R

xml_doc = xmlParse('stat133.xml')

stat133_node = xmlRoot(xml_doc)

To navigate the XML tree the XML library offers the following functions:

xmlAttrs(node)
xmlValue(node)
xmlChildren(node)
xmlParent(node)

that take an XMLNode object and return the corresponding information.


In [25]:
%%R
#returns the attributes as a character vector
stat133_attrs = xmlAttrs(stat133_node) 
print(stat133_attrs)


                 name            department 
"Computing with Data"                "STAT" 

In [26]:
%%R
#return the children nodes as a list of XMLNode objects

stat133_children = xmlChildren(stat133_node) # return chi

student1_node = stat133_children[[1]]
student2_node = stat133_children[[2]]

#printing an XMLNode yields back the corresponding XML code
print(student1_node)


<student sid="1232243">
  <name>Bob Durant</name>
  <midterm>55</midterm>
  <final>88</final>
  <homework>99</homework>
  <section>1</section>
  <major>STAT</major>
</student> 

In [27]:
%%R

parent_node = xmlParent(student1_node)

print(parent_node)


<stat133 name="Computing with Data" department="STAT">
  <student sid="1232243">
    <name>Bob Durant</name>
    <midterm>55</midterm>
    <final>88</final>
    <homework>99</homework>
    <section>1</section>
    <major>STAT</major>
  </student>
  <student sid="3213453">
    <name>Agnes Thomas</name>
    <midterm>99</midterm>
    <final>90</final>
    <homework>99</homework>
    <section>2</section>
    <major>ECON</major>
  </student>
</stat133> 

In [28]:
%%R

for (student in xmlChildren(stat133_node))
{
    for(node in xmlChildren(student)) print(xmlValue(node))
}


[1] "Bob Durant"
[1] "55"
[1] "88"
[1] "99"
[1] "1"
[1] "STAT"
[1] "Agnes Thomas"
[1] "99"
[1] "90"
[1] "99"
[1] "2"
[1] "ECON"

One can also retrieve a child node by its tag name, using the construct:

child_node = node[['name']]

that returns the child node corresponding to 'tag'.


In [29]:
%%R

name_node  = student1_node[['name']]
name_value = xmlValue(name_node)

print(name_value)


[1] "Bob Durant"

The tag names of a given node can be retrieved using the function

names(node)

that returns a character vector containing the tag names of the children nodes.

Remark: One can do that because an object of any class in R is just an enhanced list. Thus we can access some of the methods of the underlying list; here the name method that returns the element labels of a list.


In [30]:
%%R

tag_names = names(student1_node)
print(tag_names)


      name    midterm      final   homework    section      major 
    "name"  "midterm"    "final" "homework"  "section"    "major" 

In [31]:
%%R

for(student in xmlChildren(stat133_node))
{
    for(tag_name in names(student))
    {
       cat(tag_name, ':', xmlValue(student[[tag_name]]), '\n')
    }
    cat('\n-----------\n')
}


name : Bob Durant 
midterm : 55 
final : 88 
homework : 99 
section : 1 
major : STAT 

-----------
name : Agnes Thomas 
midterm : 99 
final : 90 
homework : 99 
section : 2 
major : ECON 

-----------

A digression on lists: the double bracket operator

As we saw, the single bracket operator

list[range]

allows us to retrive a sublist corresponding to the range.

In case, the range is a single index then

list[i]

is a sublist with one element, namely the element at position i.

CAUTION: A list with a single element is different that the element itself.


In [32]:
%%R

element1 = c('a', 'b', 'c')   #character vector
element2 = c(1,2)             #numeric vector

my_list  = list(element1, element2)

x = my_list[1]

In [33]:
%%R

print(class(element1))
print(class(x))


[1] "character"
[1] "list"

In [34]:
%%R

print(x[1])


[[1]]
[1] "a" "b" "c"


In [35]:
%%R

print(element1[1])


[1] "a"

To retrieve the actual list element at a given position i, one needs to use the double bracket operator:

list[[i]]

This also valid, when one uses element labels instead of position indices:

list[['label']] #retrieves the actual list element 
list['label']   #yields back a list containing the list element

In [36]:
%%R

x = list(names=c('Bob', 'Luc', 'Paul'), grades=c(45, 76,12))
print(x)


$names
[1] "Bob"  "Luc"  "Paul"

$grades
[1] 45 76 12


In [37]:
%%R
print(x['names'])
print(x[['names']])


$names
[1] "Bob"  "Luc"  "Paul"

[1] "Bob"  "Luc"  "Paul"

A digression on lists: lapply and sapply

Given a function f and a list

$$x = (x_1, x_2,\dots,x_n)$$

the list apply function:

lapply(x, f)

will return the list

$$ (f(x_1),\dots,f(x_n)). $$

In [38]:
%%R

f = function(x) return(c(mean=mean(x), sd=sd(x)))

var1 = c(1,2,3,4,5,6,8,9)
var2 = c(18,34,2)
    
x = list(V1=var1, V2=var2)

y = lapply(x, f)

print(y)


$V1
    mean       sd 
4.750000 2.815772 

$V2
mean   sd 
  18   16 

Now suppose the function $f$ returns a single number from a numeric vector:


In [39]:
%%R

f = function(x) return(mean=mean(x))

y = lapply(x, f)

print(y)


$V1
[1] 4.75

$V2
[1] 18

In this case, it makes sense that the output of lapply would rather be a numeric vector rather than a list of numeric vectors with one element each.

For that purpose, one has the simplify apply function

sapply(x, f, args)

that will try to simplify the output into a vector of the proper type.

Remark: The last argument args can be omitted, but it can also be used to pass additional parameters to the function $f$, if needed.


In [40]:
%%R

y = sapply(x, f)

print(y)


   V1    V2 
 4.75 18.00 

From XML document to data frames

Once data has been

  • downloaded into the local file system in some given format
  • uploaded onto R using the appropriate library

one needs to convert the data representation into data frames.

Namely, data analysis methods mostly applies to data put data cubes or data frames.

Data in loaded onto R as XML documents are not immediately ready for analysis.

To convert them into data frame, one may use complicated for loops and branching statements.

A much better way is to use R built-in vectorized capabilities along with the apply function family.

As an example, let's try to transform our XML object representation of the class data back into a data frame:


In [41]:
%%R

print(stat133_node)


<stat133 name="Computing with Data" department="STAT">
  <student sid="1232243">
    <name>Bob Durant</name>
    <midterm>55</midterm>
    <final>88</final>
    <homework>99</homework>
    <section>1</section>
    <major>STAT</major>
  </student>
  <student sid="3213453">
    <name>Agnes Thomas</name>
    <midterm>99</midterm>
    <final>90</final>
    <homework>99</homework>
    <section>2</section>
    <major>ECON</major>
  </student>
</stat133> 

Given

  • a list of the student nodes
  • a tag name such as 'final'

one may be able to construct a vector representing the the values of corresponding variable.


In [42]:
%%R

students = xmlChildren(stat133_node)

Let us now write a function that

  • takes a student node and a tag name as input

  • returns the tag value for the corresponding student

The idea is then to apply this function to the list of students, we have just constructed:


In [43]:
%%R

get_value = function(node, tag) return(xmlValue(node[[tag]]))

name = get_value(students[[1]], 'name')
print(name)


[1] "Bob Durant"

Using this function along with apply, we can now extract the first column or our data frame:


In [44]:
%%R

names = sapply(students, get_value, 'name')
print(names)


       student        student 
  "Bob Durant" "Agnes Thomas" 

We can now package this line of code into a function that takes a list of students and


In [45]:
%%R

g = function(tag, student_nodes) return(sapply(student_nodes, get_value, tag))

print(g('midterm', students))


student student 
   "55"    "99" 

In [46]:
%%R

tag_names = names(students[[1]])

data = lapply(tag_names, g, students)

In [47]:
%%R

df = data.frame(data)
print(df)


          name midterm final homework section major
1   Bob Durant      55    88       99       1  STAT
2 Agnes Thomas      99    90       99       2  ECON