(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)
In this assignment, we will practice some basic skills on working with data. We will see how to read, access, clean, filter and plot a dataset.
The data that we will be using are freely available online, by the City of Boston. A brief description of the 311 service can be found here. You can download the whole dataset, which extends from 2011 until now (updated every day) from here. However, it contains over 555,000 records and, as a result, it's more than 245 MB in size. In the homework repository, you can find a small part of it that covers the months of January-February 2015, under the filename
During these two months, Boston experienced one of the highest levels of snowfalls. We will take this opportunity and see how this situation affected the 311 center. We will also be using data from NOAA National Climatic Data Center regarding the levels of snowfall for the dates that we are interested in. This dataset is also available in our repository for your convenience under the file
snow.csv. You can read its documentation here.
The first thing we want to do is load the data as Pandas DataFrames.
In [ ]:%matplotlib inline import pandas as pd import seaborn as sns from datetime import datetime from matplotlib import pyplot as plt
In [ ]:boston = pd.read_csv('cases_2015.csv')
We need to take a brief look at the data, so that we get an intuition about what we are dealing with. Your first task is to do just that: print some part of the dataset (5 pts)
In [ ]:# Print part of the dataset here
Another useful thing is to get a short summary of the columns and their data types. Make a call to the required function for this (5 pts)
In [ ]:# Print information about the columns of the dataset
As you can see, the
CLOSED_DT columns, all of which should be dates, have been misclassified as
object instead of datetime. Re-write the code that reads from a csv, in order to parse the above columns as datetimes. (5 pts)
In [ ]:# Read the date columns as datetime objects
Make sure that the column data types are exactly as expected:
In [ ]:# Print the data types of the columns again
Now that we read the data and took a quick look, we can start playing around with it. As you have already noticed, there is very little documentation on the dataset itself. In such occasions, we need to look at the dataset very extensively, to get a good feel of its structure. This part, together with what we will do next, is a significant part of the data cleaning process.
To practice our filtering skills, let's try to isolate the cases that were opened in February (5 pts)
In [ ]:# Write code that filters the dataframe and keeps only the cases that were opened in February
Can you do the same thing, but only for the cases that are still remaining open? (5 pts)
In [ ]:# Write a code that also filters for open cases
Great! Now, let's check all the different values of the column
CASE_TITLE together with their counts (5 pts)
In [ ]:# Write code that prints the different values of the column 'CASE_TITLE' # and the count for the occurencies of each value in the boston dataframe
During this process, you may encounter many strange things, that look like outliers. For example, you may notice that there are many values for
CASE_TITLE which only occur once. One such example is the category
"missing pb cover/printed for jim"
I really wonder what this means! Try retrieving and printing the whole row (5 pts)
In [ ]:# Print the row tht has CASE_TITLE "missing pb cover/printed for jim"
There's another category which seems interesting,
Heat - Excessive Insufficient. Write code that prints the first 10 rows of the dataset that have
CASE_TITLE which starts with "Heat" and see if you can understand what this stands for (15 pts)
In [ ]:
Now, let's plot (preferably with horizontal bars) the counts for all the
CASE_TITLEs that occur more than 250 times in the dataset. (5 pts)
In [ ]:# Plots the high-occuring categories for cases as a horizontal histogram.
As you can see, the most common categories are related to snow. Let us see which neighborhoods were mostly affected by snow problems. This information is saved under the
neighborhood column. Make a plot (similar to the one above, like a horizontal histogram) that for each neighborhood shows the number of snow-related calls made from there. (5 pts)
In [ ]:
By now, you should have noticed that, for each day, there are multiple records in our dataframe. Use the
groupby function to get, for each day, the counts for all the case categories. Plot the result for the top 5 categories as you see fit. (20 pts)
In [ ]:# Write your code here
Let's see how this trend compares to the snowfall. First, we need to read the data from
In [ ]:# Read the dataframe snow_data = pd.read_csv("snow.csv",parse_dates=)
According to the documentation, the column
SNOW has the values for the snowfall, in mm. However, the dataset we have has information from many different stations around Boston. Aggregate all their measurements and transform the result to centimeters. Finally, plot the result (10 pts)
In [ ]:# Write your code here
The last thing we will do is to join the snowfall dataframe with that of the top 5 categories of calls, per day. Don't forget to fill the NaN values with 0. You want to do an outer join (if there are calls for a day but no snowfall data, we want to create a 0 record). Save that to a dataframe called
calls_on_snow. The first 5 columns should be the counts for the categories, and the 6th will be the snowfall data. (10 pts)
In [ ]:
Now, you can show the connection of the two trends (the calls and the snow) in a plot. I provide some sample code, but you should also try your own versions.
In [ ]:fig, axes = plt.subplots(nrows=2,figsize=(12,5), sharex=True) calls_on_snow.ix[:,0:5].plot(ax=axes) calls_on_snow.ix[:,5].plot(ax=axes,label="snowfall (in cm)", color='gray') axes.set_title("The effect of the winter snowstorms on Boston's 311 Hotline for 2015") axes.set_ylabel("# Calls") axes.set_ylabel("Snowfall in cm") axes.set_xlabel("Date") axes.xaxis.grid(False) axes.xaxis.grid(False)
The result should look something like this:
In :# Code for setting the style of the notebook from IPython.core.display import HTML def css_styling(): styles = open("../../theme/custom.css", "r").read() return HTML(styles) css_styling()