In [9]:
import pandas as pd
import numpy as np
import datetime
from datetime import date
from dateutil.rrule import rrule, DAILY
from datetime import datetime
from dateutil import tz
from __future__ import division
import geoplotlib as glp
from geoplotlib.utils import BoundingBox, DataAccessObject
from sklearn.ensemble import RandomForestClassifier
import itertools
from sklearn import cluster
from sklearn.cross_validation import cross_val_score

pd.set_option('display.max_columns', None)
%matplotlib inline

NYC Traffic and Weather - Explainer Notebook

This notebook is an 'explainer' for the visualization site found here.

For more details all the notebooks that have been created for this project can be found here.

Motivation

What is your dataset?

Our dataset is NYPD Motor Vehicle Collisions. It contains records for every reported incident in the NYC area. Records are available from July 2012 till today. Specifically, where the collision took place, the cause of the collision, injuries, fatalities and more.

The other dataset used was an extraction of weather conditions for the NYC area. This was pulled from Weather Underground using their csv service for hourly updates. For each hour, there are measurements of the temperature, visibility, windspeed as well as the overall weather conditions such as Rain, Snow, Clear etc.

Why did you choose this/these particular dataset(s)?

Choosing the collisions dataset, was mainly out of interest in finding out where, how and why collisions happen.

The weather dataset was an afterthough and was mainly something we wanted to investigate after having looked collisions and the road conditions / traffic infrastructure.

What was your goal for the end user's experience?

The main goal is to provide the user/reader with knowledge about how the weather affects the road and in that sense the collision rate. We would assume that more collisions happens in bad conditions, but can we quantify that? The user should also end up with an idea of where incidents happens the most, maybe to give an idea for the government of NYC for improvements.

Basic Stats. Let's understand the dataset better

Write about your choices in data cleaning and preprocessing

Talking about the collisions dataset, the cleaning was mainly done for the columns that we knew we wanted to investigate or otherwise we though was important in the data exploration. Notably we wanted to make sure that we had location data for all the rows we investigated.

The weather information was not directly available to us and required a lot of HTTP requests to Weather Undergrounds servers. Below is how we did it:


In [ ]:
# ============================================
#        Downloading Weather Data (KJFK)
# ============================================

# Getting weather data from wunderground
start_date = date(2012, 7, 1)
end_date = date(2016, 2, 29)

# csv container for all daily weather infromation
frames = []

# url template for http requests. %s/%s/%s represent year/month/day
url_template = 'https://www.wunderground.com/history/airport/KJFK/%s/%s/%s/DailyHistory.html?\
                req_city=New+York&req_state=NY&req_statename=New+York&reqdb.zip=10001\
                &reqdb.magic=4&reqdb.wmo=99999&format=1.csv'
month = ""

# Query wunderground for daily weather information (returned as csv)
for dt in rrule(DAILY, dtstart=start_date, until=end_date):
    if (month != dt.strftime("%m")):
        month = dt.strftime("%m")
        print 'Downloading to memory: ' + dt.strftime("%Y-%m")
    # download and append csv file to csv container
    frames.append(pd.read_csv(url_template % (dt.strftime("%Y"),dt.strftime("%m"), dt.strftime("%d"))))

# Combine all csv's to one big and save it
print "Saving data to csv..."
data = pd.concat(frames)
data.to_csv('weather_data_nyc_kjfk.csv', sep=',')

We now have two datasets. Collisions and weather. However to avoid having to lookup in in a secondary dataset, that is the weather information, we merged the two datasets together. For the most part we had weather data for each hour for all the rows we wanted to investigate, with only a combined gap of a couple of days. In additions some hours had more than one row of weather information. We ignored these factors as we saw them insignificant to the overall result anyways.

In order to join the datasets they both had to have some columns in common. Which needed to be the date and time (hour). The weather dataset already had a datetime column in UTC. What we did was convert to NYC local time and add the columns Year, Month, Day, and Hour:


In [ ]:
# ============================================
#         Cleaning the Weather Data
# ============================================

# Read downloaded dataset
weather = pd.read_csv('datasets/weather_data_nyc_kjfk.csv')

# Convert UTC time to NYC actual
def UTCtoActual(utcDate):
    from_zone = tz.gettz('UTC')
    to_zone = tz.gettz('America/New_York')
    
    utc = datetime.strptime(utcDate.DateUTC, '%Y-%m-%d %H:%M:%S')\
                  .replace(tzinfo=from_zone)\
                  .astimezone(to_zone)
    s = pd.Series([utc.year, utc.month, utc.day, utc.hour])
    s.columns = ['Year', 'Month', 'Day', 'Hour']
    return s

# Apply the above function to every row in the weather dataset and save the file.
weather[['Year', 'Month', 'Day', 'Hour']] = weather.apply(UTCtoActual, axis=1)
weather.to_csv('datasets/weather_data_nyc_kjfk_clean.csv')

With both datasets now having a 'common' ground for joining. This can now be done:


In [ ]:
# ============================================
#              Merging Datasets
#        NYPD Motor Vehicle Collisions
#                    and 
#      Weather Underground Extract (KJFK)
# ============================================

# Read the datasets to be merged
incidents = pd.read_csv('datasets/NYPD_Motor_Vehicle_Collisions.csv')
weather = pd.read_csv('datasets/weather_data_nyc_kjfk_clean2.csv')

# Features from the dataset to merge in
features = ['Conditions', 'Precipitationmm', \
            'TemperatureC', 'VisibilityKm']

# Looks up weather data on the date and hour requested
def lookup_weather2(year, month, day, hour):
    w = weather[(weather.Year == year) & (weather.Month == month) & (weather.Day == day) & (weather.Hour == hour)]
    return w

# Looks up weather data, if nothing is found on the hour, it either looks an hour ahead or back.
# Otherwise returns empty.
def lookup_weather(date, time):
    month = int(date.split('/')[0])
    day = int(date.split('/')[1])
    year = int(date.split('/')[2])
    hour = int(time.split(':')[0])
    d = lookup_weather2(year, month, day, hour).head(1)
    if (d.empty):
        dt_back = datetime.datetime(year, month, day, hour) - datetime.timedelta(hours=1)
        dt_forward = datetime.datetime(year, month, day, hour) + datetime.timedelta(hours=1)
        
        d_back = lookup_weather2(dt_back.year, dt_back.month, dt_back.day, dt_back.hour)
        if (not d_back.empty): return d_back
        
        d_forward = lookup_weather2(dt_forward.year, dt_forward.month, dt_forward.day, dt_forward.hour)
        if (not d_forward.empty): return d_forward
    return d

# Merges the datasets
def merge_weather(incident):
    date = incident.DATE
    time = incident.TIME

    w = lookup_weather(date, time)

    try:
        # Default values
        con = "-"
        temp = "-"
        rainmm = "-"
        viskm = "-"

        # If weather data is different from null
        if (not pd.isnull(w['Conditions'].iloc[0])):
            con = w['Conditions'].iloc[0]
        if (not pd.isnull(w['TemperatureC'].iloc[0])):
            temp = w['TemperatureC'].iloc[0]
        if (not pd.isnull(w['Precipitationmm'].iloc[0])):
            rainmm = w['Precipitationmm'].iloc[0]
        if (not pd.isnull(w['VisibilityKm'].iloc[0])):
            viskm = w['VisibilityKm'].iloc[0]
            
        s = pd.Series([con, rainmm, temp, viskm])
        return s
    except:
        print date + "x" + time
        s = pd.Series([None,None,None,None])
        return s
    
print "Applying weather data to incidents..."
incidents[features] = incidents[incidents.DATE.str.split('/').str.get(2) != '2016'].apply(merge_weather, axis=1)
print "Saving weather in-riched incident data..."
incidents.to_csv('datasets/NYPD_Motor_Vehicle_Collisions_Weather_FINAL.csv', sep=',')

This was how we preprocessed and got all the available information that we require for making our data analysis.

Write a short section that discusses the dataset stats (here you can recycle the work you did for Project Assignment A)

Our primary focus is on the dataset addressing NYC Motor Vehicle Collisions and the merged dataset, consisting of incidents along with their weather conditions.

NYPD Motor Vehicle Collisions

Source

The dataset contained within a csv file is of size 149MB, with 29 features and 769054 records.

We started on getting finding answers to our burning questions and some basic statistics on our data. Some questions that we wanted to address was the number of collisions that are happening within each borough. As an example shown in below chart, it shows the number of incidents recorded within each borough over the years.

Further investigation on the dataset, we made use of the geolocation to get an even more precise geographical representation on where incidents are happening than just grouping by borough. We presented the location of incidents in a heatmap by using geoplotlib.kde to map and show the density incidents in their respective geographical location.

Weather Underground

Source, processed dataset can he found here.

The dataset for the weather, is contained within a csv file with a file size of 3.9MB. The data is constructed with 22 features and 36002 rows.

Now that we have made some modelling on the collision data, we wanted to know the cause of action for these collisions. There are definitely many contribufaction factors for each recorded incidents, however we thought there could be other factors that lies beneath these incidents, which is why we chose a common external factor, the weather.

We retrieved a set of weather data, from the same time period as the dataset for the motor vehicle collisions and merged them by the time. The data contains 30 different weather conditions



Motor Vehicle Collisions and Weather data

After a combination of the two dataset, the csv file size is 175.1MB, with 34 features and 769054 rows. The merged dataset can be found here.

With the merged dataset, we now have the weather conditions for each recorded collisions and we can get the frequency of collision per hour for a specific weather condition. The normalized collisions frequency, we get the ratio of collisions for each weather condition with a reference to Mostly Cloudy, as this weather conditions are the one having the most recorded incidents. The collisitons frequency is shown below generated in Python. A beautified visualisation can be shown on the website.

Individual notebook for how this visualization can be found here.

Theory. Which theoretical tools did you use?

Describe which machine learning tools you use and why the tools you've chosen are right for the problem you're solving.

Our motivation for investigating the collisions dataset was mainly to see if we would be able to find out where, how and why collisions happen in NYC. Could we predict where accidents are likely to happen? In class we have discussed two methods for classifiers that could be used to predict where the next crime was likely to happen. That is, KNN and Decision Trees.

In the end we choose to continue with the decision tree approach for classification. This was mainly due to the features in our dataset, speciafically the weather features that we ended up working with, could easily be categorized. In addition, for us atleast, we could easily prevent overfitting and do cross-validation (see below).

In addition we used K-Means for clustering all incidents by location which data was used to power the decision tree classifier. In other words we used K-means in order to categorize the locational data (latitude and longitude) into clusters. More on this later.

Apart from the K-Means clustering, that we learned in class, we also used a different method of clustering called DBSCAN. This method creates clusters based on distance and density. This method was used to cluster close intersections, that were highly sensitive to specific weather conditions.

Talk about your model selection. How did you split the data in to test/training. Did you use cross validation?

With around 650,000 rows to be split between test and training we were not really concered with too high variance. The split as documented below was 80/20.

From the get go we know that finding the exact intersection where a collision is predictively going to take place is unfeasible to say the least. Doing some initial exploration using a trained decision tree targetting a longitude, latitude tuple we found a prediction score of 4-5%.

If we cannot with great accuracy look at individual intersections we could look at zip codes. We could ask the question; in which postal area should we put an ambulance, given the weather and hour of day? Still the accuracy is low. One way to improve the accuracy would be to add additional non-correlated features. Using K-Means to make location group-based clustering, that is longitude and latitude, how does that improve the model?

As mentioned we are using of using sklearns DecisionTreeClassifier we ended up using the RandomForestClassifier. We will be starting out with training a random forest of 50 trees. This should prevent any overfitting. In addition we use cross-validation to also test the performance of the model. See below.

In summary we are training a random forest of 50 trees, targetting/predicting/classifying ZIP CODE, by using the features, KMEANS (location-based-clusters), HOUR, Condition.


In [21]:
# ============================================
#           Train Random Forest
#
# 1. Preprocess dataset for training
# 2. Add Kmean clustering to dataset
# 3. Split dataset into training and test
# 4. Train forest with dataset
# ============================================

# Encodes the string values in the dataframe to integers. (Classifier does not support string values)
def encode_column(df, target_column):
    df_mod = df.copy()
    targets = pd.Series(df_mod[target_column].unique())
    map_to_int = {name: n for n, name in enumerate(targets)}
    df_mod[target_column+"_encoded"] = df_mod[target_column].replace(map_to_int)
    return (df_mod, targets)

# Trains the forest (50 trees) on a dataset, with a target and features
def train_tree(target, features, dataset):
    clf = RandomForestClassifier(n_estimators=50, n_jobs=2)
    X = np.array(dataset[features])
    Y = np.array(list(itertools.chain(*dataset[[target]].values)))
    return clf.fit(X, Y)

# Calculates kmean clusters and cluster idenfiers and centroids 
def kmeans(k, dataset, colums):
    md = cluster.KMeans(n_clusters=k).fit(dataset[colums])
    return md.predict(dataset[colums]),md.cluster_centers_

# Load, filter and seperate time
data = pd.read_csv('datasets/NYPD_Motor_Vehicle_Collisions_weather4.csv')
data = data[pd.notnull(data['LOCATION'])]
data['HOUR'] = data.TIME.str.split(':').str.get(0)

# Label that is target for prediction, filter dataset for target label
target_label = 'ZIP CODE'
data = data[pd.notnull(data[target_label])]

# Encode target/label column
mdata, target = encode_column(data, target_label)

# Encode the feature columns
mdata, _ = encode_column(mdata, 'Conditions')

# Kmean location clusters (30 clusters calculated)
s, cen = kmeans(30, mdata, ['LONGITUDE', 'LATITUDE'])
mdata['KMEANS'] = s
print "KMEANS LOCATION CLUSTERS CALCULATED"

# Features for prediction
features = ['KMEANS','Conditions_encoded','HOUR']

# Split data set into training and test data
train_data = mdata.head(int(len(mdata.index) * 0.80))
test_data = mdata.tail(int(len(mdata.index) * 0.20))

target_label_encoded = target_label+'_encoded'

# Train the random forest
clf = train_tree(target_label_encoded, features, train_data)
print "TRAINED FOREST WITH %d SAMPLES" % len(train_data)


KMEANS LOCATION CLUSTERS CALCULATED
TRAINED FOREST WITH 467746 SAMPLES

For DBSCAN, we clustered by a distance of around 100 meters. And with different amounts of noise filtering, based on the amount of collisions for the given condition. The code can be viewed here.

Explain the model performance. How did you measure it? Are your results what you expected?

The model performance was measured lettings the decision tree classify the test data. Each row in of itself already has the correct ZIP CODE. By letting the tree predict the ZIP CODE and comparing with the actual value we can test if the prediction is correct. Doing that over the entire set of test data we can find a percentage accuracy of the model.


In [20]:
# ============================================
#           Testing Random Forest
# ============================================

# Test the forest
def test_forest(clf, test_data, features):
    return clf.predict(test_data[features])

# Tests test_data on the forest classifier clf with features and target_label.
# encoded_map is a lookup table of the encoded values (numeric) to actual string values
def test_prediction(target_label, clf, test_data, features, encoded_map):
    corrects = 0
    predictions = test_tree(clf, test_data[features], features)
    for i in range(0, len(predictions)):
        if predictions[i] == test_data.iloc[i][target_label]:
            corrects += 1
    print "CLASSIFIED %d SAMPLES CORRECTLY" % corrects
    # Return model accuracy [%]
    return corrects / len(predictions)

# Test the decision tree
print "PREDICTION SCORE: %f" % test_prediction(target_label_encoded, clf, test_data, features, target)


CLASSIFIED 27735 SAMPLES CORRECTLY
PREDICTION SCORE: 0.237181

The score shows a prediction accuracy of ~23.7%. Based on the input features this result was we could not have predicted. It means that it is actually possible to predict the zip code in which a collosion will occour based on the hour of day, weather condition and improved by the location based clustering. To verify the result - even though we choose to use random forest because of it - a cross-validation on the model has been made.

The cross-validation score are as followed:


In [19]:
# ============================================
#         Cross-validating the Model
# ============================================
scores = cross_val_score(clf, mdata[features].values, mdata[target_label_encoded].values, cv=5)
print("CROSS-VALIDATION SCORE: %0.6f (+/- %0.6f)" % (scores.mean(), scores.std() * 2))


CROSS-VALIDATION SCORE: 0.238919 (+/- 0.000835)

The cross-validation score seem to confirm the accuracy of the model. It is as confident - with 5 passes - as the random forest classifier we setup previously.

Lastly, the visualization of the decision tree can be found here. Note that this is for a single tree in the forest.

And the K-means location cluster visualization can be found here.

Visualizations

Explain the visualizations you've chosen.

Why are they right for the story you want to tell?

The visualizations can be found on the site here. The visualizations are odered below, as they appear on the site.

Interactive Intersection Map

We wanted to get the reader a feel for how the scope of the data. Therefore we choose a this as the first visualization as it gives a good idea where accidents happen, mainly intersections, and it allows for the user to explore on their own getting a feel for their neighbourhood. To do this, we did not think a d3 visualization was sufficient. Therefore used an actual map - service built with the api API - and plotted out every collision on it. That meant that users could zoom, drag and click on each point to get additional information.

Top 10 Intersections by Collisions

This visualization was mainly a continuation of the map. It could be a little bit overwhelming, so we boiled it down to the 10 intersections where the most collisions occour, this again helped to setup the story we wanted to tell.

Frequency of Collisions in Weather Conditions

Now that we completed the introductory part of the story/analysis, we switch to the main part of the analysis. That was the collisions, is certain weather conditions. To start off, we wanted to find out if there was certain conditions that produced more collisions per hour than others. In some ways it did confirm our hypotheses and in others it left us wondering. The why is explained in on the site. Looking back, there are a lot of ways that we could have choosen to visualize this. We do not think we could have gotten around a bar chart but something like this could be a variation that better shows the relative frequency from the baseline.

Distribution of Weather Conditions for the Cause: Pavement Slippery

At ths point in the story we are looking at weather-induced causes for collisions. In particular Pavement Slippery. Again, this was a hypothesis. So we wanted to show if this was actually the case.

Top 10 Intersections with Cause: Pavement Slippery

Before looking into the predictive models we wanted to again show which intersections are the most dangerous when only looking at Pavement Slippery induced collisions. Could we possible find out why these accidents occour? To do so we plotted out the top 10 unique locations where this type of collisions occour the most. And what was found was that in intersections with curved/downhill roads leading to them are the worst. This helped us make a conclusion about why these type of collisions occour and which places that a reader should pay extra attention to.

Weather-Based Cluster Dendrogram

After having done the top 10 intersection analysis, we realized that some intersections can be located very closely. Therefore, we used DBSCAN to find clusters of intersections with a high concentration of collisions, for a given weather condition. To visualize these clusters, we created a cluster Dendrogram. A dendrogram was a great way for us to show the clusters of the 4 picked conditions, as we were able to show all clusters and intersections at the same time. As well as the relation between the clusters for conditions, with the size and color of the leaf. Additionally, we gave the reader an options to switch between a circular layout as well as a linear. Ultimately this approach gave us a different way of finding weather sensitive locations, in NYC.

Location-Based Clusters

As we were not able to find a good way to visualize the random forest compromised and made one instead for the K-Means clustering. It would still help us visualize how we use location clusters and how that would effect our random forest prediction score.

Discussion. Think critically about your creation

What went well?,

Generally we think that the site development and the visualizations we choose went well. Although we think we could have made a few variations in the bar charts shown, namely the Frequency of Collisions in Weather Conditions visualization as already mentioned. We believe that the final page presents our discoveries nicely, with beautiful animations, and a variety of interactive visualizations.

What is still missing? What could be improved?, Why?

A decision tree representation is missing. We think that it wouldbe cool to somehow have a visual representation in the like of what is seen on this page. However, we decided to not look too much into it as it seemed to be a very complex visualization.

We think that we could improve on the predictive model selection. The use of k-means clusters to help the random forest training, is in hindsight maybe not so great of an idea. Doing some very late validation and actual getting it visualized, we found that a higher cluster could give us better prediction accuracy. Although with deminishing returns, this could be coined to the fact that the decision tree 'learns' which cluster number a zip code is part of.

Lastly, much time was used on the initial data analysis / exploration. This was because we did not have a clear direction / end goal with the project. However, it was possible to find a story about how weather affects the traffic in NYC. Having this from the beginning would mean we would be able to have done a lot more with the story.


In [ ]: