After having 1) set up the data analysis problem and 2) forged a good idea of the ideal data set (and corresponding analysis) for the problem, the first stage of the data analysis process is to start gathering relevant data in order to build the best possible approximation of your ideal data set.
The internet (which is basically a giant interconnected file system hosting data files) is a great place to start. This data, however, may come from various websites and be in various formats. It will need to be unified and be put in a regular form--that is, into a collection of numerical arrays--on which to apply the data analysis algorithms described in the previous sections.
Here, we will learn how to use R to:
In particular, we will spend some time on the XML format, which stands for eXtensible Markup Language. The reason why this format is particularly interesting is that instead of structuring data in tabular form, it structures it in the form of a graph. A graph is a collection of nodes containing information. These nodes are related to each other through links, and one can access the information contained on one node just by traveling from node to node using the links.
Data structured in this form encompass HTML pages (HTML can be roughly understood as a dialect of XML), the information hosted by internet websites (websites are related to each other by links, and one can regard the Google search tool as a way to bypass the internet's graph structure to get to the information directly), a computer file system (where many users store their data as they go, forming an increasingly intricate graph of directories and subdirectories), or the data of user profiles in social media networks.
Graphical data is not regular, and statistical analysis is not readily applicable for data structured in this form. It needs to first be transformed into regular data--i.e., cubes of numbers--before statistical analysis can take place.
various data sources (XML, jason, HTML, etc.) $\longrightarrow$ raw data tables (csv, xls, etc.)
This involves
You'll find here a list of all availabe R packages.
To install R packages on the OskiBox, the best way is to type in
sudo apt-get r-cran-PackageName
The cran website maitains a comprehensive list of the R packages available this way.
To use the package after installation, you'll inkove the command:
library('package_name')
You'll find here a list of all availabe Python packages (or modules).
To install a Python package to osky box the best way is :
sudo pip install package_name
To use the package after installation, you'll inkove the command:
import package_name
Let us start by learning how to manipulate the local filesystem in your computer from R.
In [1]:
%load_ext rmagic
R offers a bunch of functions to
list the files and directories (see here):
list.files(path = ".")
list.dirs(path = ".")
create and modify files (see here for details):
file.create(path1, path2, etc.)
file.remove(path1, path2, etc.)
create directories (see here for details)
dir.create(path)
navigate the file system (see here):
getwd()
setwd(dir)
Let us create
'DATA'
to help us organize our data'raw'
to store the raw data collected from the web'cleaned'
to stored our final data after having cleaned the raw dataand then let us set move to the raw directory.
In [2]:
%%R
data_dir = './data'
raw_dir = paste(data_dir, '/raw', sep='')
cleaned_dir = paste(data_dir, '/cleaned', sep='')
dir.create(data_dir)
dir.create(raw_dir)
dir.create(cleaned_dir)
In [3]:
!ls data/
To download a file from the internet in R and store it on the local file system, one uses the function:
download.file(fileUrl, destfile)
It is also a good practice to keep track of the date the data was downloaded, since data from the web may likely be updated. For that, one can
date()
However, from within the notebook, it is simpler to use bash curl
command whose syntax is the following:
`curl yourURL > localFile`
In [4]:
%%R
url = 'http://benoitdherin.github.io/data-analysis-with-R/data/houses.csv'
download.file(url, 'myHouses.csv')
In [13]:
%%R
df = read.csv('myHouses.csv', header=T)
print(head(df))
Alternatively, one could download the file immediately into R without storing it first on the local file system:
In [16]:
%%R
dff = read.csv(url, header=T)
print(head(dff))
Example: The city of Baltimore offers an API (Application Programming Interface) allowing us to gather data on speed infractions in different formats:
csv (Comma Separated Values)
xml (Extensible Markup Language)
jason (JavaScript Object Notation)
xls (Microsoft Excel file format)
In this context, a website API is a set of rules that describe how to issues requests to a given webserver, say benoitdherin.github.io
by adding requests strings after the webserver name in its url
, say this string is data-analysis-with-R
. If type this string into a browser the website . For instance, This request string tells the webserver what to return to your browser as a result of this urlThe general format for an url is the following:
http://benoitdherin.github.io/data-analysis-with-R/data/houses.csv
http://web.server.name/re/que/st
The re/que/st
is a string describing to the webserver what you'd like it returns to you. The simplest form of a request is the path (location) of a file (ressource) on the webserver. The ressource that's returned to you is then that file:
Copied and pasted from the website:
In [17]:
protocol = 'https://'
website = 'data.baltimorecity.gov'
location = '/api/views/dz54-2aru/'
csvRequest = 'rows.csv?accessType=DOWNLOAD'
xmlRequest = 'rows.xml?accessType=DOWNLOAD'
jsonRequest = 'rows.json?accessType=DOWNLOAD'
To access this different data format, one neeed to use different urls:
In [18]:
csv_url = protocol + website + location + csvRequest
xml_url = protocol + website + location + csvRequest
json_url = protocol + website + location + csvRequest
xls_url = protocol + website + location + csvRequest
print csv_url; print xml_url; print json_url; print xls_url
!curl $csv_url > baltimore.csv
!curl $csv_url > baltimore.xml
!curl $csv_url > baltimore.json
!curl $csv_url > baltimore.xls
Let us prepare prepare the names of the files in which we will store the data:
Different data formats are loaded with different functions. The resulting R object storing the data also depends on the format.
Data files in tabular format such as csv are loaded directly into data frame with the the function:
read.table(file, sep, header)
In [19]:
%%R
df = read.table('baltimore.csv', sep=',', header=TRUE)
print(head(df))
Let us create a simple XML document.
Consider the following data for two students in a given class given in tabular form:
In [20]:
%%file course.csv
name, midterm, final, homework, section, major
Bob Durant, 55, 88, 99, 1, STAT
Agnes Thomas, 99, 90, 99, 2, ECON
Let's read this data nicely into a data frame:
In [21]:
%%R
df = read.csv('course.csv'); print(df)
The same data can be stored in XML format in the following way:
In [22]:
%%file stat133.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<stat133 name='Computing with Data' department='STAT'>
<student sid='1232243'>
<name>Bob Durant</name>
<midterm>55</midterm>
<final>88</final>
<homework>99</homework>
<section>1</section>
<major>STAT</major>
</student>
<student sid='3213453'>
<name>Agnes Thomas</name>
<midterm>99</midterm>
<final>90</final>
<homework>99</homework>
<section>2</section>
<major>ECON</major>
</student>
</stat133>
The first line
<?xml version="1.0" encoding="ISO-8859-1"?>
provides informations about
The whole course data is surrounded by two XML tags
<stat133> ... </stat133>
<tag>
</tag>
<tag>...data...</tag>
The data in between the stat133
node is itself a collection of xml nodes:
<student sid='1232243'> ... </student>
<student sid='3213453'> ... </student>
stat133
node here)Pictorially, one may represent the children/parent relationship as an arrow:
parent node (stat133 node) $\longrightarrow$ children node (student node)
With this interpretation, an xml document structures data in the form of a tree, called the xml tree.
In our class example:
<name>Bob Durant</name>
<midterm>55</midterm>
<final>88</final>
<homework>99</homework>
<section>1</section>
<major>STAT</major>
So each student node contains five leafs.
The values of an xml node is the collection of all the data contained by all the leafs stemming from this node.
Example: Consider the following xml code.
<A>
<B>value1</B>
<C>value2</C>
<D>
<E>
<G>value3</G>
</E>
<F>value4</F>
</D>
</A>
The value of
node A is the collection {value1, value2, value2, value4}
node B is {value1}
node D is {value3, value4}
etc.
Here is a tree representation of the xml code above:
An xml node can also have several attributes:
<tag attribute1=value attribue2=value etc.>...data...</tag>
The node attributes are specified in the opening tag.
In our class example:
<stat133 name='Computing with Data' department='STAT'>
<student sid='1232243'> ...data... </student>
<student sid='3213453'> ...data... </student>
</stat133>
name
and department
sid
Each XML node in a XML document may have
The collection of all XML nodes with the parent/children relation ship form the XML tree:
In R, the XML library contains functions and classes to represent XML documents.
Libraries in R corresponds to modules in Python.
In [23]:
%%R
library(XML)
To install this r package in the oskibox, open a termina and type in
sudo apt-get install r-cran-xml
Remark: The keyword library
in R corresponds to the import
keyword in Python.
The XML library provides two main classes:
A class
XMLDocument
representing whole XML documents including
<?xml version="1.0" encoding="ISO-8859-1"?>
A class
XMLNode
representing the XML nodes in the XML tree where the information is stored.
The first steps to load XML data into R is to use the functions:
xml_doc = xmlParse(file_adress)
root_node = xmlRoot(xml_doc)
xmlParse
takes as input the XML file address and returns the corresponding XMLDocument
objectxmlRoot
takes an XMLDocument
and returns the root node as a XMLNode
object.
In [24]:
%%R
xml_doc = xmlParse('stat133.xml')
stat133_node = xmlRoot(xml_doc)
To navigate the XML tree the XML library offers the following functions:
xmlAttrs(node)
xmlValue(node)
xmlChildren(node)
xmlParent(node)
that take an XMLNode
object and return the corresponding information.
In [25]:
%%R
#returns the attributes as a character vector
stat133_attrs = xmlAttrs(stat133_node)
print(stat133_attrs)
In [26]:
%%R
#return the children nodes as a list of XMLNode objects
stat133_children = xmlChildren(stat133_node) # return chi
student1_node = stat133_children[[1]]
student2_node = stat133_children[[2]]
#printing an XMLNode yields back the corresponding XML code
print(student1_node)
In [27]:
%%R
parent_node = xmlParent(student1_node)
print(parent_node)
In [28]:
%%R
for (student in xmlChildren(stat133_node))
{
for(node in xmlChildren(student)) print(xmlValue(node))
}
One can also retrieve a child node by its tag name, using the construct:
child_node = node[['name']]
that returns the child node corresponding to 'tag'
.
In [29]:
%%R
name_node = student1_node[['name']]
name_value = xmlValue(name_node)
print(name_value)
The tag names of a given node can be retrieved using the function
names(node)
that returns a character vector containing the tag names of the children nodes.
Remark: One can do that because an object of any class in R is just an enhanced list. Thus we can access some of the methods of the underlying list; here the name
method that returns the element labels of a list.
In [30]:
%%R
tag_names = names(student1_node)
print(tag_names)
In [31]:
%%R
for(student in xmlChildren(stat133_node))
{
for(tag_name in names(student))
{
cat(tag_name, ':', xmlValue(student[[tag_name]]), '\n')
}
cat('\n-----------\n')
}
As we saw, the single bracket operator
list[range]
allows us to retrive a sublist corresponding to the range.
In case, the range is a single index then
list[i]
is a sublist with one element, namely the element at position i.
CAUTION: A list with a single element is different that the element itself.
In [32]:
%%R
element1 = c('a', 'b', 'c') #character vector
element2 = c(1,2) #numeric vector
my_list = list(element1, element2)
x = my_list[1]
In [33]:
%%R
print(class(element1))
print(class(x))
In [34]:
%%R
print(x[1])
In [35]:
%%R
print(element1[1])
To retrieve the actual list element at a given position i, one needs to use the double bracket operator:
list[[i]]
This also valid, when one uses element labels instead of position indices:
list[['label']] #retrieves the actual list element
list['label'] #yields back a list containing the list element
In [36]:
%%R
x = list(names=c('Bob', 'Luc', 'Paul'), grades=c(45, 76,12))
print(x)
In [37]:
%%R
print(x['names'])
print(x[['names']])
Given a function f
and a list
the list apply function:
lapply(x, f)
will return the list
$$ (f(x_1),\dots,f(x_n)). $$
In [38]:
%%R
f = function(x) return(c(mean=mean(x), sd=sd(x)))
var1 = c(1,2,3,4,5,6,8,9)
var2 = c(18,34,2)
x = list(V1=var1, V2=var2)
y = lapply(x, f)
print(y)
Now suppose the function $f$ returns a single number from a numeric vector:
In [39]:
%%R
f = function(x) return(mean=mean(x))
y = lapply(x, f)
print(y)
In this case, it makes sense that the output of lapply would rather be a numeric vector rather than a list of numeric vectors with one element each.
For that purpose, one has the simplify apply function
sapply(x, f, args)
that will try to simplify the output into a vector of the proper type.
Remark: The last argument args
can be omitted, but it can also be used to pass additional parameters to the function $f$, if needed.
In [40]:
%%R
y = sapply(x, f)
print(y)
Once data has been
one needs to convert the data representation into data frames.
Namely, data analysis methods mostly applies to data put data cubes or data frames.
Data in loaded onto R as XML documents are not immediately ready for analysis.
To convert them into data frame, one may use complicated for
loops and branching statements.
A much better way is to use R built-in vectorized capabilities along with the apply function family.
As an example, let's try to transform our XML object representation of the class data back into a data frame:
In [41]:
%%R
print(stat133_node)
Given
one may be able to construct a vector representing the the values of corresponding variable.
In [42]:
%%R
students = xmlChildren(stat133_node)
Let us now write a function that
takes a student node and a tag name as input
returns the tag value for the corresponding student
The idea is then to apply this function to the list of students, we have just constructed:
In [43]:
%%R
get_value = function(node, tag) return(xmlValue(node[[tag]]))
name = get_value(students[[1]], 'name')
print(name)
Using this function along with apply, we can now extract the first column or our data frame:
In [44]:
%%R
names = sapply(students, get_value, 'name')
print(names)
We can now package this line of code into a function that takes a list of students and
In [45]:
%%R
g = function(tag, student_nodes) return(sapply(student_nodes, get_value, tag))
print(g('midterm', students))
In [46]:
%%R
tag_names = names(students[[1]])
data = lapply(tag_names, g, students)
In [47]:
%%R
df = data.frame(data)
print(df)