In [1]:
%%bash
mkdir ./script ./data ./data/raw ./data/cleaned ./data/simulated ./visualizations
ls -r
I am using bash to download the zip file.
In [2]:
%%bash
curl "http://www.ope.ed.gov/security/dataFiles/Crime2013EXCEL.zip" > file.zip
Copy the zip file to ./data/raw and upzip it.
In [3]:
%%bash
cp file.zip ./data/raw/data.zip
cd ./data/raw
unzip -o data.zip
In [1]:
%%bash
pwd
In this data source, we choose the "on campus crime data".This raw dataset contains over 10,000 postsecondary institutions with information about different types of crime and also information regarding the intuitions such as private/public, gender ratio and geographical location. For our interest, we choose the specific file about "on-campus crime" to focus on. The data provided by government website are comprehensive enough, but there are some minor typos that we need to fix when loading the data.
In [1]:
%load_ext rmagic
Our form of raw data is .xls, so we will use pd.read_excel to load the data as a DataFrame.
In [2]:
import pandas as pd
import xlrd
In [3]:
ls ~/project2/stat133-project2/examples/data/raw/oncampuscrime101112.xls
In [4]:
crime_file = '/home/oski/project2/stat133-project2/examples/data/raw/oncampuscrime101112.xls'
sheet_name = 'oncampuscrime101112'
data = pd.read_excel(crime_file, sheet_name, index_col=None, na_values=['NA'])
Write the DataFrame into R and save as a .csv file to be cleaned.
In [5]:
%%R -i data
print(head(data))
It is the right one, and the data examples are shown above.
In [6]:
%%R
print(dim(data))
print(names(data))
Using the data frame we just created, we save the plant data into a csv file into the raw data directory:
In [11]:
%%R
write.csv(data, '/home/oski/project2/stat133-project2/examples/data/raw/oncampuscrime_to_be_cleaned.csv')
In [15]:
ls ~/project2/stat133-project2/examples/data/raw/oncampuscrime_to_be_cleaned.csv