Data Gathering

In this notebook, you'll

  • create the directory structure for your project
  • download the raw data for your project
  • write functions and classes to transform raw data (e.g in XML, jason, etc.) into data frames
  • explain what your are doing

Team members responsible for this notebook:

List the team members contributing to this notebook, along with their responsabilities:

  • team member 1 name: team member 1 responsabilities
  • team member 2 name: team member 2 responsabilities
  • etc.

I advise you to work at least in pairs for each project notebook, as you did for the homework assignments. Of course, all team members may participate to each notebook.

Instructions:

In markdown cells, you'll

  • list and describe the data sources for your class project
    • provide the data urls
    • describe the file formats
    • explain the data content of each file

Remark: If your data is not directly available through the web (for instance, you requested it from a company, or got it from Inna or Henry), place you data into your personal bdrive, and provide a link to it, that you'll use in this notebook to download it into the data folder, as explained below).

In code cells, you'll

  • create the following directories:

      ./data
      ./data/raw        # to store your raw data
      ./data/cleaned    # to store the cleanned data
      ./data/simulated  # to store simulated data
      ./visualizations  # to store your plots
  • display samples of your data files (e.g. for a csv files, use data frames and the head method)
  • Write scripts into the ./script directory using the he magic command

    %%file ./script/file_name

Your scripts should contain functions and objects allowing you to

  • download the actual raw data in the directory ./data/raw
  • load the downloaded the raw data files into data frames

This way, the functions and objects contained in your scripts will be accessible in other notebooks through the import command in Python (see example below).

Remark 1: In the code cells, you may use Python, R, or Bash, as you find more convenient.

Remark 2: Try to make your notebook as readable and usable (by others) as possible.

Example

Here is an example of how to package your code into scripts reusable in other notebook.

You'll need to do something similar to that. You'll also need to write explanations as outlined above in markdown cells.

Creating the directory structure


In [1]:
%%bash

mkdir ./script ./data ./data/raw ./data/cleaned ./data/simulated ./visualizations
ls -r


visualizations
script
NB4_project_report.ipynb
NB3_data_analysis.ipynb
NB2_data_cleaning.ipynb
NB1_data_gathering.ipynb
file.zip
data

Downloading and displaying raw data

In this example, I am using R. You want to use Python instead.


In [2]:
%%bash

curl "http://www.ope.ed.gov/security/dataFiles/Crime2013EXCEL.zip" > file.zip


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.3M  100 39.3M    0     0   296k      0  0:02:15  0:02:15 --:--:--  302k

In [3]:
%%bash

cp file.zip ./data/raw/data.zip
cd ./data/raw
unzip -o data.zip


Archive:  data.zip
  inflating: Publicpropertydiscipline101112_Doc.doc  
  inflating: Publicpropertycrime101112_Doc.doc  
  inflating: Publicpropertyarrest101112_Doc.doc  
  inflating: Oncampusarrest101112_Doc.doc  
  inflating: Noncampushate101112_Doc.doc  
  inflating: Noncampusdiscipline101112_Doc.doc  
  inflating: Oncampuscrime101112_Doc.doc  
  inflating: Noncampuscrime101112_Doc.doc  
  inflating: Noncampusarrest101112_Doc.doc  
  inflating: Oncampusdiscipline101112_Doc.doc  
  inflating: publicpropertyhate101112.xlsx  
  inflating: noncampusarrest101112.xls  
  inflating: Publicpropertydiscipline101112.xls  
  inflating: Publicpropertycrime101112.xls  
  inflating: oncampushate101112.xlsx  
  inflating: Publicpropertyarrest101112.xls  
  inflating: residencehallhate101112.xlsx  
  inflating: oncampusdiscipline101112.xls  
  inflating: oncampusarrest101112.xls  
  inflating: Residencehallfire12.xls  
  inflating: oncampuscrime101112.xls  
  inflating: noncampushate101112.xlsx  
  inflating: noncampusdiscipline101112.xls  
  inflating: noncampuscrime101112.xls  
  inflating: Residencehallfire11.xls  
  inflating: Residencehallfire10.xls  
  inflating: Residencehalldiscipline101112.xls  
  inflating: Residencehallcrime101112.xls  
  inflating: Residencehallarrest101112.xls  
  inflating: reportedhate101112.xlsx  
  inflating: reportedcrime101112.xls  
  inflating: reportedarrest101112.xls  
  inflating: reporteddiscipline101112.xls  
  inflating: ReadMe.txt              
  inflating: Residencehallfire12_Doc.doc  
  inflating: Residencehallfire11_Doc.doc  
  inflating: Residencehallfire10_Doc.doc  
  inflating: Residencehallhate101112_Doc.doc  
  inflating: Residencehalldiscipline101112_Doc.doc  
  inflating: Residencehallcrime101112_Doc.doc  
  inflating: Residencehallarrest101112_Doc.doc  
  inflating: Oncampushate101112_Doc.doc  
  inflating: Reportedhate101112_Doc.doc  
  inflating: Reporteddiscipline101112_Doc.doc  
  inflating: Reportedcrime101112_Doc.doc  
  inflating: Reportedarrest101112_Doc.doc  
  inflating: Publicpropertyhate101112_Doc.doc  

In [4]:
%%bash

cd ./data/raw
#cp oncampuscrime101112.xls oncampuscrime101112.csv

In [4]:

TO DO: list sources, display data sample, explain data content, etc.

We now download an XML source containing plant data, and display an typical plant entry, represented as a XML node:

Loading raw data into data frames


In [5]:
%load_ext rmagic

Our form of raw data is .xls, so we will use read.xls to read our raw data into R as a DataFrame.


In [6]:
%%R

install.packages("gdata")
library(gdata)


Installing package into ‘/home/oski/R/i686-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
also installing the dependency ‘gtools’

trying URL 'http://mirror.its.dal.ca/cran/src/contrib/gtools_3.3.1.tar.gz'
Content type 'application/x-gzip' length 54766 bytes (53 Kb)
opened URL
==================================================
downloaded 53 Kb

trying URL 'http://mirror.its.dal.ca/cran/src/contrib/gdata_2.13.3.tar.gz'
Content type 'application/x-gzip' length 717134 bytes (700 Kb)
opened URL
==================================================
downloaded 700 Kb


The downloaded source packages are in
	‘/tmp/Rtmp7FFawi/downloaded_packages’
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: ‘gdata’

The following object is masked from ‘package:stats’:

    nobs

The following object is masked from ‘package:utils’:

    object.size


In [ ]:
#data <- read.csv("~/project2/stat133-project2/examples/data/raw/oncampuscrime101112.csv",header = TRUE)

Given a column name, we create a data frame column containing the column values:


In [17]:
%%R

print(dim(data))
print(names(data))
print(typeof(data))
print(head(data))


[1] 12498     1
[1] "��.�...�"
[1] "list"
            ��.�...�
1                   
2               �
3                   
4  less-than 2-year/
5                   
6       PO Box HCU\b

We package the code above into a function that creates the data frame column corresponding from a tag name and a list of plant nodes:


In [12]:
%%R

get_column = function(tag, plants) sapply(plants, get_value, tag)
    
cat((get_column('COMMON', plant_nodes)))


Bloodroot Columbine Marsh Marigold Cowslip Dutchman's-Breeches Ginger, Wild Hepatica Liverleaf Jack-In-The-Pulpit Mayapple Phlox, Woodland Phlox, Blue Spring-Beauty Trillium Wake Robin Violet, Dog-Tooth Trout Lily Adder's-Tongue Anemone Grecian Windflower Bee Balm Bergamot Black-Eyed Susan Buttercup Crowfoot Butterfly Weed Cinquefoil Primrose Gentian Blue Gentian Jacob's Ladder Greek Valerian California Poppy Shooting Star Snakeroot Cardinal Flower

We are now ready to retrieve all the columns of our data frame into a list of vectors, which we will use to construct our data frame:


In [13]:
%%R

data = lapply(column_names, get_column, plant_nodes)

plant_df = data.frame(data, stringsAsFactors = FALSE)

print(head(plant_df))


               COMMON              BOTANICAL ZONE        LIGHT PRICE
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady $2.44
2           Columbine   Aquilegia canadensis    3 Mostly Shady $9.37
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny $6.81
4             Cowslip       Caltha palustris    4 Mostly Shady $9.90
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady $6.44
6        Ginger, Wild       Asarum canadense    3 Mostly Shady $9.03
  AVAILABILITY
1       031599
2       030699
3       051799
4       030699
5       012099
6       041899

Using the data frame we just created, we save the plant data into a csv file into the raw data directory:


In [14]:
%%R

write.csv(plant_df, './data/raw/plants_to_be_cleaned.csv')

In [ ]:
!head ./data/raw/plants_to_be_cleaned.csv

Packaging the gathering code into reusable functions


In [15]:
%%file ./script/plant_df-R

get_value  = function(plant_node, tag) xmlValue(plant_node[[tag]])
    
get_column = function(tag, plants) sapply(plants, get_value, tag)
    
create_df_from_plant_xml = function(plant_file){
    require(XML)
    xml_doc      = xmlParse(plant_file)
    root_node    = xmlRoot(xml_doc)
    plant_nodes  = xmlChildren(root_node)
    column_names = names(plant_nodes[[1]])

    data     = lapply(column_names, get_column, plant_nodes)
    return(data.frame(data, stringsAsFactors = FALSE))
}


Writing ./script/plant_df-R

Now, our function converting a raw plant XML file into a R data frame can be used in other notebooks, using the source command:


In [16]:
%%R

source('./script/plant_df-R')

print(head(create_df_from_plant_xml(plant_file)))


               COMMON              BOTANICAL ZONE        LIGHT PRICE
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady $2.44
2           Columbine   Aquilegia canadensis    3 Mostly Shady $9.37
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny $6.81
4             Cowslip       Caltha palustris    4 Mostly Shady $9.90
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady $6.44
6        Ginger, Wild       Asarum canadense    3 Mostly Shady $9.03
  AVAILABILITY
1       031599
2       030699
3       051799
4       030699
5       012099
6       041899

In [ ]: