In this notebook, you'll
List the team members contributing to this notebook, along with their responsabilities:
I advise you to work at least in pairs for each project notebook, as you did for the homework assignments. Of course, all team members may participate to each notebook.
In markdown cells, you'll
Remark: If your data is not directly available through the web (for instance, you requested it from a company, or got it from Inna or Henry), place you data into your personal bdrive, and provide a link to it, that you'll use in this notebook to download it into the data folder, as explained below).
In code cells, you'll
create the following directories:
./data
./data/raw # to store your raw data
./data/cleaned # to store the cleanned data
./data/simulated # to store simulated data
./visualizations # to store your plots
head
method)Write scripts into the ./script
directory using the he magic command
%%file ./script/file_name
Your scripts should contain functions and objects allowing you to
./data/raw
This way, the functions and objects contained in your scripts will be accessible in other notebooks through the import
command in Python (see example below).
Remark 1: In the code cells, you may use Python, R, or Bash, as you find more convenient.
Remark 2: Try to make your notebook as readable and usable (by others) as possible.
Here is an example of how to package your code into scripts reusable in other notebook.
You'll need to do something similar to that. You'll also need to write explanations as outlined above in markdown cells.
In [2]:
%%bash
mkdir ../script ../data ../data/raw ../data/cleaned ../data/simulated ../visualizations
ls -r
In this example, I am using R. You want to use Python instead.
In [4]:
%load_ext rmagic
TO DO: list sources, display data sample, explain data content, etc.
We now download an XML source containing plant data, and display an typical plant entry, represented as a XML node:
In [6]:
%%R
library(XML)
plant_url = 'http://www.stat.berkeley.edu/classes/s133/data/plant_catalog.xml'
plant_file = './data/raw/plant.xml'
download.file(plant_url, plant_file, method="curl")
xml_doc = xmlParse(plant_file)
root_node = xmlRoot(xml_doc)
plant_nodes = xmlChildren(root_node)
print(plant_nodes[[1]])
The dataframe column labels are the tags of the XML nodes, contained in the plant XML nodes:
In [7]:
%%R
column_names = names(plant_nodes[[1]])
cat(column_names)
Given a column name, we create a data frame column containing the column values:
In [11]:
%%R
common_tag = column_names[1]
get_value = function(plant_node, tag) xmlValue(plant_node[[tag]])
common_column = sapply(plant_nodes, get_value, common_tag)
cat(common_column)
We package the code above into a function that creates the data frame column corresponding from a tag name and a list of plant nodes:
In [12]:
%%R
get_column = function(tag, plants) sapply(plants, get_value, tag)
cat((get_column('COMMON', plant_nodes)))
We are now ready to retrieve all the columns of our data frame into a list of vectors, which we will use to construct our data frame:
In [13]:
%%R
data = lapply(column_names, get_column, plant_nodes)
plant_df = data.frame(data, stringsAsFactors = FALSE)
print(head(plant_df))
Using the data frame we just created, we save the plant data into a csv file into the raw data directory:
In [14]:
%%R
write.csv(plant_df, './data/raw/plants_to_be_cleaned.csv')
In [ ]:
!head ./data/raw/plants_to_be_cleaned.csv
In [15]:
%%file ./script/plant_df-R
get_value = function(plant_node, tag) xmlValue(plant_node[[tag]])
get_column = function(tag, plants) sapply(plants, get_value, tag)
create_df_from_plant_xml = function(plant_file){
require(XML)
xml_doc = xmlParse(plant_file)
root_node = xmlRoot(xml_doc)
plant_nodes = xmlChildren(root_node)
column_names = names(plant_nodes[[1]])
data = lapply(column_names, get_column, plant_nodes)
return(data.frame(data, stringsAsFactors = FALSE))
}
Now, our function converting a raw plant XML file into a R data frame can be used in other notebooks, using the source
command:
In [16]:
%%R
source('./script/plant_df-R')
print(head(create_df_from_plant_xml(plant_file)))
In [ ]: