(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

Analyze the data from the Mayor's 24-hour hotline in Boston using Pandas

Introduction

In this assignment, we will practice some basic skills on working with data. We will see how to read, access, clean, filter and plot a dataset.


Step 1. Downaloading the data

The data that we will be using are freely available online, by the City of Boston. A brief description of the 311 service can be found here. You can download the whole dataset, which extends from 2011 until now (updated every day) from here. However, it contains over 555,000 records and, as a result, it's more than 245 MB in size. In the homework repository, you can find a small part of it that covers the months of January-February 2015, under the filename cases_2015.csv.

During these two months, Boston experienced one of the highest levels of snowfalls. We will take this opportunity and see how this situation affected the 311 center. We will also be using data from NOAA National Climatic Data Center regarding the levels of snowfall for the dates that we are interested in. This dataset is also available in our repository for your convenience under the file snow.csv. You can read its documentation here.


Step 2. Reading the data

The first thing we want to do is load the data as Pandas DataFrames.


In [ ]:
%matplotlib inline
import pandas as pd
import seaborn as sns
from datetime import datetime
from matplotlib import pyplot as plt

In [ ]:
boston = pd.read_csv('cases_2015.csv')

We need to take a brief look at the data, so that we get an intuition about what we are dealing with. Your first task is to do just that: print some part of the dataset (5 pts)


In [ ]:
# Print part of the dataset here

Another useful thing is to get a short summary of the columns and their data types. Make a call to the required function for this (5 pts)


In [ ]:
# Print information about the columns of the dataset

As you can see, the OPEN_DT, TARGET_DT, and CLOSED_DT columns, all of which should be dates, have been misclassified as object instead of datetime. Re-write the code that reads from a csv, in order to parse the above columns as datetimes. (5 pts)


In [ ]:
# Read the date columns as datetime objects

Make sure that the column data types are exactly as expected:


In [ ]:
# Print the data types of the columns again

Step 3. Filtering and plotting the data

Now that we read the data and took a quick look, we can start playing around with it. As you have already noticed, there is very little documentation on the dataset itself. In such occasions, we need to look at the dataset very extensively, to get a good feel of its structure. This part, together with what we will do next, is a significant part of the data cleaning process.

To practice our filtering skills, let's try to isolate the cases that were opened in February (5 pts)


In [ ]:
# Write code that filters the dataframe and keeps only the cases that were opened in February

Can you do the same thing, but only for the cases that are still remaining open? (5 pts)


In [ ]:
# Write a code that also filters for open cases

Great! Now, let's check all the different values of the column CASE_TITLE together with their counts (5 pts)


In [ ]:
# Write code that prints the different values of the column 'CASE_TITLE'
# and the count for the occurencies of each value in the boston dataframe

During this process, you may encounter many strange things, that look like outliers. For example, you may notice that there are many values for CASE_TITLE which only occur once. One such example is the category

"missing pb cover/printed for jim"

I really wonder what this means! Try retrieving and printing the whole row (5 pts)


In [ ]:
# Print the row tht has CASE_TITLE "missing pb cover/printed for jim"

There's another category which seems interesting, Heat - Excessive Insufficient. Write code that prints the first 10 rows of the dataset that have CASE_TITLE which starts with "Heat" and see if you can understand what this stands for (15 pts)


In [ ]:

Now, let's plot (preferably with horizontal bars) the counts for all the CASE_TITLEs that occur more than 250 times in the dataset. (5 pts)


In [ ]:
# Plots the high-occuring categories for cases as a horizontal histogram.

As you can see, the most common categories are related to snow. Let us see which neighborhoods were mostly affected by snow problems. This information is saved under the neighborhood column. Make a plot (similar to the one above, like a horizontal histogram) that for each neighborhood shows the number of snow-related calls made from there. (5 pts)


In [ ]:

Step 3. Advanced filtering & joins

By now, you should have noticed that, for each day, there are multiple records in our dataframe. Use the groupby function to get, for each day, the counts for all the case categories. Plot the result for the top 5 categories as you see fit. (20 pts)


In [ ]:
# Write your code here

Let's see how this trend compares to the snowfall. First, we need to read the data from snow.csv


In [ ]:
# Read the dataframe
snow_data = pd.read_csv("snow.csv",parse_dates=[1])

According to the documentation, the column SNOW has the values for the snowfall, in mm. However, the dataset we have has information from many different stations around Boston. Aggregate all their measurements and transform the result to centimeters. Finally, plot the result (10 pts)


In [ ]:
# Write your code here

The last thing we will do is to join the snowfall dataframe with that of the top 5 categories of calls, per day. Don't forget to fill the NaN values with 0. You want to do an outer join (if there are calls for a day but no snowfall data, we want to create a 0 record). Save that to a dataframe called calls_on_snow. The first 5 columns should be the counts for the categories, and the 6th will be the snowfall data. (10 pts)


In [ ]:

Now, you can show the connection of the two trends (the calls and the snow) in a plot. I provide some sample code, but you should also try your own versions.


In [ ]:
fig, axes = plt.subplots(nrows=2,figsize=(12,5), sharex=True)

calls_on_snow.ix[:,0:5].plot(ax=axes[0])
calls_on_snow.ix[:,5].plot(ax=axes[1],label="snowfall (in cm)", color='gray')

axes[0].set_title("The effect of the winter snowstorms on Boston's 311 Hotline for 2015")
axes[0].set_ylabel("# Calls")
axes[1].set_ylabel("Snowfall in cm")
axes[1].set_xlabel("Date")
axes[0].xaxis.grid(False)
axes[1].xaxis.grid(False)

The result should look something like this:


In [1]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: