Data Gathering

In this notebook, we will download our data from US department of education and load the excel format data into python. Then we write out our data as .csv format.

Team members responsible for this notebook:

List the team members contributing to this notebook, along with their responsabilities:

  • team member 1 name: team member 1 responsabilities
  • team member 2 name: team member 2 responsabilities
  • etc.

I advise you to work at least in pairs for each project notebook, as you did for the homework assignments. Of course, all team members may participate to each notebook.

Creating the directory structure


In [1]:
%%bash

mkdir ./script ./data ./data/raw ./data/cleaned ./data/simulated ./visualizations
ls -r


visualizations
script
NB4_project_report.ipynb
NB3_data_analysis.ipynb
NB2_data_cleaning.ipynb
NB1_data_gathering.ipynb
file.zip
data

Downloading and displaying raw data

I am using bash to download the zip file.


In [2]:
%%bash

curl "http://www.ope.ed.gov/security/dataFiles/Crime2013EXCEL.zip" > file.zip


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.3M  100 39.3M    0     0   296k      0  0:02:15  0:02:15 --:--:--  302k

Copy the zip file to ./data/raw and upzip it.


In [3]:
%%bash

cp file.zip ./data/raw/data.zip
cd ./data/raw
unzip -o data.zip


Archive:  data.zip
  inflating: Publicpropertydiscipline101112_Doc.doc  
  inflating: Publicpropertycrime101112_Doc.doc  
  inflating: Publicpropertyarrest101112_Doc.doc  
  inflating: Oncampusarrest101112_Doc.doc  
  inflating: Noncampushate101112_Doc.doc  
  inflating: Noncampusdiscipline101112_Doc.doc  
  inflating: Oncampuscrime101112_Doc.doc  
  inflating: Noncampuscrime101112_Doc.doc  
  inflating: Noncampusarrest101112_Doc.doc  
  inflating: Oncampusdiscipline101112_Doc.doc  
  inflating: publicpropertyhate101112.xlsx  
  inflating: noncampusarrest101112.xls  
  inflating: Publicpropertydiscipline101112.xls  
  inflating: Publicpropertycrime101112.xls  
  inflating: oncampushate101112.xlsx  
  inflating: Publicpropertyarrest101112.xls  
  inflating: residencehallhate101112.xlsx  
  inflating: oncampusdiscipline101112.xls  
  inflating: oncampusarrest101112.xls  
  inflating: Residencehallfire12.xls  
  inflating: oncampuscrime101112.xls  
  inflating: noncampushate101112.xlsx  
  inflating: noncampusdiscipline101112.xls  
  inflating: noncampuscrime101112.xls  
  inflating: Residencehallfire11.xls  
  inflating: Residencehallfire10.xls  
  inflating: Residencehalldiscipline101112.xls  
  inflating: Residencehallcrime101112.xls  
  inflating: Residencehallarrest101112.xls  
  inflating: reportedhate101112.xlsx  
  inflating: reportedcrime101112.xls  
  inflating: reportedarrest101112.xls  
  inflating: reporteddiscipline101112.xls  
  inflating: ReadMe.txt              
  inflating: Residencehallfire12_Doc.doc  
  inflating: Residencehallfire11_Doc.doc  
  inflating: Residencehallfire10_Doc.doc  
  inflating: Residencehallhate101112_Doc.doc  
  inflating: Residencehalldiscipline101112_Doc.doc  
  inflating: Residencehallcrime101112_Doc.doc  
  inflating: Residencehallarrest101112_Doc.doc  
  inflating: Oncampushate101112_Doc.doc  
  inflating: Reportedhate101112_Doc.doc  
  inflating: Reporteddiscipline101112_Doc.doc  
  inflating: Reportedcrime101112_Doc.doc  
  inflating: Reportedarrest101112_Doc.doc  
  inflating: Publicpropertyhate101112_Doc.doc  

In [1]:
%%bash

pwd


/home/oski/project2/stat133-project2/examples

Introduction for Data Source

In this data source, we choose the "on campus crime data".This raw dataset contains over 10,000 postsecondary institutions with information about different types of crime and also information regarding the intuitions such as private/public, gender ratio and geographical location. For our interest, we choose the specific file about "on-campus crime" to focus on. The data provided by government website are comprehensive enough, but there are some minor typos that we need to fix when loading the data.

Loading raw data into data frames


In [1]:
%load_ext rmagic

Our form of raw data is .xls, so we will use pd.read_excel to load the data as a DataFrame.


In [2]:
import pandas as pd
import xlrd

In [3]:
ls ~/project2/stat133-project2/examples/data/raw/oncampuscrime101112.xls


/home/oski/project2/stat133-project2/examples/data/raw/oncampuscrime101112.xls

In [4]:
crime_file = '/home/oski/project2/stat133-project2/examples/data/raw/oncampuscrime101112.xls'
sheet_name = 'oncampuscrime101112'

data = pd.read_excel(crime_file, sheet_name, index_col=None, na_values=['NA'])

Write the DataFrame into R and save as a .csv file to be cleaned.


In [5]:
%%R -i data

print(head(data))


   UNITID_P                              INSTNM
0 100654001            Alabama A & M University
1 100663001 University of Alabama at Birmingham
2 100663002 University of Alabama at Birmingham
3 100690001                  Amridge University
4 100706001 University of Alabama in Huntsville
5 100724001            Alabama State University
                              BRANCH
0                        Main Campus
1                        Main Campus
2 Huntsville Regional Medical Campus
3                        Main Campus
4                        Main Campus
5                        Main Campus
                                             Address       City State       ZIP
0                                   4900 MERIDIAN ST     NORMAL    AL     35762
1 ADMINISTRATION BLDG SUITE 1070, 701 20th Street So BIRMINGHAM    AL 352940110
2                             301 Governors Drive SW Huntsville    AL     35801
3                                     1200 TAYLOR RD MONTGOMERY    AL 361173553
4                                    301 SPARKMAN DR HUNTSVILLE    AL     35899
5                               915 SOUTH JACKSON ST MONTGOMERY    AL 361040271
  sector_cd                        Sector_desc men_total women_total Total
0         1            Public, 4-year or above      2224        2629  4853
1         1            Public, 4-year or above      7163       10836 17999
2         1            Public, 4-year or above      7163       10836 17999
3         2 Private nonprofit, 4-year or above       321         382   703
4         1            Public, 4-year or above      4207        3429  7636
5         1            Public, 4-year or above      2276        3540  5816
  MURD10 NEG_M10 FORCIB10 NONFOR10 ROBBE10 AGG_A10 BURGLA10 VEHIC10 ARSON10
0      0       0        5        0       3       5       73       2       1
1      1       0        1        0       2       0       20       1       1
2      0       0        0        0       0       0        1       0       0
3      0       0        0        0       0       0        0       0       0
4      3       0        0        0       1       3       24       3       0
5      0       0        0        0       5       1       34       0       0
  MURD11 NEG_M11 FORCIB11 NONFOR11 ROBBE11 AGG_A11 BURGLA11 VEHIC11 ARSON11
0      0       0        4        0       1       8       58       2       2
1      0       0        2        0       2       0       11       5       4
2      0       0        0        0       0       0        0       0       0
3      0       0        0        0       0       0        0       0       0
4      0       0        4        0       1       1       38       2       0
5      0       0        1        0       0       5       23       2       0
  MURD12 NEG_M12 FORCIB12 NONFOR12 ROBBE12 AGG_A12 BURGLA12 VEHIC12 ARSON12
0      0       0        2        0       6       2        3       2       0
1      0       0        2        0       1       0        5       6       1
2      0       0        0        0       0       0        0       0       0
3      0       0        0        0       0       0        0       0       0
4      0       0        3        0       0       1       22       0       0
5      0       0        1        0       1       8       45       0       0
  FILTER10 FILTER11 FILTER12
0        1        1        1
1        1        1        1
2        1        1        1
3        1        1        1
4        1        1        1
5        1        1        1

It is the right one, and the data examples are shown above.


In [6]:
%%R

print(dim(data))
print(names(data))


[1] 11064    42
 [1] "UNITID_P"    "INSTNM"      "BRANCH"      "Address"     "City"       
 [6] "State"       "ZIP"         "sector_cd"   "Sector_desc" "men_total"  
[11] "women_total" "Total"       "MURD10"      "NEG_M10"     "FORCIB10"   
[16] "NONFOR10"    "ROBBE10"     "AGG_A10"     "BURGLA10"    "VEHIC10"    
[21] "ARSON10"     "MURD11"      "NEG_M11"     "FORCIB11"    "NONFOR11"   
[26] "ROBBE11"     "AGG_A11"     "BURGLA11"    "VEHIC11"     "ARSON11"    
[31] "MURD12"      "NEG_M12"     "FORCIB12"    "NONFOR12"    "ROBBE12"    
[36] "AGG_A12"     "BURGLA12"    "VEHIC12"     "ARSON12"     "FILTER10"   
[41] "FILTER11"    "FILTER12"   

Using the data frame we just created, we save the plant data into a csv file into the raw data directory:


In [11]:
%%R

write.csv(data, '/home/oski/project2/stat133-project2/examples/data/raw/oncampuscrime_to_be_cleaned.csv')

In [15]:
ls ~/project2/stat133-project2/examples/data/raw/oncampuscrime_to_be_cleaned.csv


/home/oski/project2/stat133-project2/examples/data/raw/oncampuscrime_to_be_cleaned.csv