In [10]:
from IPython.display import display
from IPython.display import Image
iPython formatting guidance http://nbviewer.ipython.org/github/NelisW/ComputationalRadiometry/blob/master/01-IPythonHintsAndTips.ipynb For those viewing this Notebook as a webpage, this service is provided by nbviewer: http://nbviewer.ipython.org/ Markdown Formatting Cheat Sheet http://assemble.io/docs/Cheatsheet-Markdown.html
In [11]:
picUrl = 'http://software.quest-global.com/wp-content/uploads/2014/11/connectedcar-logo.png'
Embed = Image(picUrl)
display(Embed)
In this project, we tackled the Kaggle competition sponsored by AXA, a multi-national insurance company, to classify and predict automobile driver behavior based on second-by-second positional GPS data. We performed the steps of a data science pipeline to ingest and wrangle data, calculate features, create classification and prediction models, and then visualize the results. For this project, the Python programming language was selected because of the strength of the scikit-learn library for machine learning algorithms. Through research and analysis, we learned that an effective model must rely on both path matching and telematics. In other words, it is important to identify the maneuvers made and describe how those maneuvers were performed. Since our GPS data was unlabeled, meaning that we did not have trips categorized as safe/unsafe, the classification problem is an unsupervised learning problem. Therefore, the k-means clustering technique was chosen and executed on 18 calculated features. Turns have proven to be a key maneuver in distinguishing types of drivers, while average and maximum velocity and acceleration are useful telematics features to describe how maneuvers are performed. With classes from the clustering algorithm, a prediction model was then created by performing a logistic regression on key velocity and acceleration features to predict which class a driver is likely to belong to based on a single trip.
Quantitative risk management has been a cornerstone of the auto insurance industry for many years. Traditionally, companies in the industry calculate risk-based premiums using driver history, demographics, geographic factors, and vehicle characteristics. Driver history includes events such as accidents, moving violations, and license suspensions. Demographic factors include age, gender, marital status, and occupation. Geographic factors include traffic and weather patterns, as well as crime rates such as vandalism and theft. Finally, vehicle characteristics include make, model, year, and mileage to derive a current value. Insurance companies combine these factors into an ensemble of models to obtain an expected loss, i.e. to predict the probability of a claim and the size of a claim in the future. While insurance providers clearly rely on a wealth of information to make pricing decisions, the data still has its limitations. For example, the driver history features listed above are relatively infrequent, negative events. Additionally, insurance companies typically only have access to three to five years of driver history data per state laws. Progressive, an industry leader in data-driven policy making, found that key driving behaviors—like actual miles driven, braking, and time of day—carry more than twice the predictive power of traditional insurance rating variables[1]. Furthermore, a white paper by Deloitte highlighted the mutually beneficial motivations behind this driving-data-craze in a 2012 white paper:
"...the advantages accrue to virtually all sides. Insurance companies benefit from matching premiums more closely >with actual risk. Drivers benefit from the opportunity to lower their insurance rates by driving less or more >safely..."
Knowing that more data can often beat better models, insurance companies are now looking to incorporate daily driving behavior into their premium pricing models to gain a competitive advantage. You may be familiar with Progressive’s Snapshot commercials where customers choose to place a device in their vehicles that monitors the vehicle’s position and speed. The idea is to reward customers with good driving habits by offering them discounted premiums. With this information, Progressive can also identify customers with risky driving habits and then initiate a variety of mitigation options such as offering corrective guidance, increasing premiums, or even terminating coverage. With the prevalence of GPS devices in vehicles, either as a standalone unit or built into a driver’s cell phone, insurance companies now have access to real-time driver behavior to allow for more accurate predictions of claims and expected losses than ever before. Like Progressive Insurance, AXA Insurance is moving in the direction of capturing current driver data and has sponsored a Kaggle competition to obtain a top performing classification and prediction model. The purpose of this paper is to outline the steps taken to create a model to classify driver behavior based telematic data provided by AXA. Then, a prediction model is described to help determine which class a driver most likely belongs to based on a single trip.
…auto insurance companies charged premiums based on accidents and moving violations.
…drivers were being penalized for these infrequent, negative events.
...a company decided to capture the daily driving habits of its customers in its premium pricing model.
…customers with good driving habits felt happier after being rewarded with lower premiums.
…the insurance company experienced less customer turnover and more new customers applied for policies.
...the insurance company had captured greater market share.
AXA, the French multinational investment banking and insurance management firm, sponsored a Kaggle competition and provided a clean data file for each of 200 trips for 3,600+ drivers. Each file contained a list of anonymized vehicle positions at every second of the trip (i.e. each row represents the vehicle’s 2-dimensional position after one second). To protect the identity and location of the drivers, longitude and latitude coordinates are not provided so this data set cannot be matched up with coordinates of major roadways to determine speed limits, road signs, and traffic lights.
In [12]:
driver = raw_input("Enter a number between 1-3612:\n>")
In [13]:
trip = raw_input("Enter a number between 1-200:\n>")
In [14]:
import pandas as pd
import os
%matplotlib inline
import matplotlib.pyplot as plt
path = os.path.abspath(os.getcwd())
pathtocsv = os.path.normpath(os.path.join(os.path.dirname(path),"input","test",str(driver),str(trip)+".csv"))
df = pd.read_csv(pathtocsv)
print df[:15][['x','y']]
df.plot(kind = 'scatter', x = 'x', y = 'y', figsize=(18,10))
plt.show()
In [15]:
import math
# Calculate the direction
def getDirection(y,x):
direction_rad = math.atan2(y,x)
direction = math.degrees(direction_rad)
if direction < 0:
direction += 360
return direction
def getCardinalDirection(direction):
carddir = ''
if direction >= 0 and direction <= 22.5:
carddir = 'East'
elif direction > 337.5 and direction <=360:
carddir = 'East'
elif direction > 22.5 and direction <= 67.5:
carddir = 'Northeast'
elif direction > 67.5 and direction <= 112.5:
carddir = 'North'
elif direction > 112.5 and direction <= 157.5:
carddir = 'Northwest'
elif direction > 157.5 and direction <= 202.5:
carddir = 'West'
elif direction > 202.5 and direction <= 247.5:
carddir = 'Southwest'
elif direction > 247.5 and direction <= 292.5:
carddir = 'South'
elif direction > 292.5 and direction <= 337.5:
carddir = 'Southeast'
return carddir
print "The cardinal direction is %r and the direction in degrees is %r" %(getCardinalDirection(getDirection(11.2,-4.2)), getDirection(11.2,-4.2))
val = pd.rolling_sum(df, window = 3)
print val[2:10]
Data Ingestion. The original AXA data set was downloaded/warehoused as a zipped CSV file from the competition website. The original file was stored in as raw a form as possible; each driver (total of 3612) had a directory with two-hundred (200) trips. Each trip had an x and y column value for each REPORTED second of the trip. Outside of the standard browser provided download tools and local operating system directories, no special tools were used for the ingestion phase of the Data Science Pipeline. There was some experimentation with wget. Team SkidMarks initially planned to use to Postgres as the Write Once Read Many (WORM) repository but scheduling, learning curves, and feature calculations (discussed later) consumed the bulk of the project time. Local folders on teammate computers served as the ingestion landing zone.
Data Munging and Wrangling</b>. Using team-created python scripts, the data was normalized and replicated (added concatenated primary key with creation of trip_id and driver_id columns) to intuitively capture relationships within the data. The os Python module was used for operating system interoperability and the csv Python module was used to read and write the normalized database structure to another directory for computation and analysis.
In [16]:
path = os.path.abspath(os.getcwd())
pathtocsv = os.path.normpath(os.path.join(os.path.dirname(path),"output","test",str(driver), str(driver) + "_" +str(trip)+".csv"))
normed_df = pd.read_csv(pathtocsv)
print pathtocsv
print normed_df[:15][[0,1,2,3]]
Computation and Analyses</b>. Naturally, this step consumed the bulk of the CAPSTONE project’s time and resources; feature creation was a coding, memory, and research intensive process. The python Pandas library was used because it provides high-performance, easy-to-use data structures and data analysis tools within Python. From simple x and y positional data, Team SkidMarks used the pandas library to compute over 18 telematic and trip matching features:
Velocity
Average Velocity
Standard Deviation
Maximum
Acceleration
Average
Standard Deviation
Maximum
Distance
Increment traveled per second
Total distance traveled (sum of all increments)
Total displacement
Heading
Heading standard deviation
Heading in degrees
Heading as a cardinal direction
Maximum 1-second change in heading for trip
Turns
count of maneuvers greater than 45 degree change in direction over 3 second rolling window
Turn plus speed over 30 mph (aggressive turns)
Braking
Deceleration events lasting 3 seconds (braking)
Large deceleration events
Maximum deceleration events
Consecutive zero acceleration/velocity/increment travels events
Time
Total REPORTED trip time
Base features were computed on a trip/per second basis. Next, aggregate trip values, such as the average velocity over the duration of the trip, were computed for all driver files in the data store.
Despite several value/rule-based filters, Team SkidMarks did not completely remove outlier data. “Hyperspace jumps” still remain as outliers in the data set because of periods where travelers covered large distances in short times because of GPS signal loss. However, existing filtering removed the bulk of the erroneous data.
In [17]:
df2 = pd.read_csv(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv')))
print df2[:10]
In [18]:
df2.describe()
Out[18]:
In [19]:
path = os.path.abspath(os.getcwd())
pathtocsv = os.path.normpath(os.path.join(os.path.dirname(path),"input","test","1","200" +".csv"))
df = pd.read_csv(pathtocsv)
df.plot(kind = 'scatter', x = 'x', y = 'y', figsize=(18,10))
plt.ylim(-2000,650)
plt.xlim(-4500,100)
plt.show()
Modeling and Application. Determining safe vs unsafe drivers is an unsupervised learning task. Clustering with K-Means provides the best machine learning option to explore the dataset and will also help discover hidden structure or patterns in unlabeled training data. The goal is to use cluster analysis to take a group of observed trips by drivers so that drivers with similar telematic and/or trip matching features such that members of the same group or cluster are more similar to each other by a given metric than they are to members of other clusters. Team SkidMarks used the KMeans Clustering module within Python’s SciKit-Learn Machine Learning library. The pandas, numpy, and Scipy libraries were used with SciKit-Learn to ingest/preprocess/load data.
The Silhouette Coefficient was used as a performance measure for the clusters. The silhouette coefficient is a measure of the compactness and separation of the clusters. The best value is 1 and the worst value is -1.
KMeans clustered on train data, and a function, that, given train data, returned an array of integer labels corresponding to the different clusters. Team SkidMarks experimented with using the learned features to build a supervised classifier, making this CAPSTONE a semi supervised learning problem in the end.
Reporting and Visualization. Results were presented as visualizations using the matplotlib Python library. A scatter matrix was initially used to explore the potential relationships between the calculated features. Next, cluster visualizations were created to show the decision space and cluster groupings calculated via the KMeans clustering algorithm. And finally, a simple fitted line plot was used to show the fitted classifier data.
Research has shown K-Means is a popular and useful method to explore unlabeled data. Team Skidmarks used K-Means clustering to assign driver’s trips to classes based on calculated features. The optimal number of clusters (k) was determined using the “elbow method” . This method aims to:
“graph the percentage of variance explained by the clusters against the number of clusters, so that the first clusters add much information (explain a lot of variance), but at some point the marginal gain from additional clusters will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the elbow criterion”
The graphic below depicts a value of four, five or six as an optimal k in the study.
In [20]:
import os
import numpy as np
import pandas as pd
from sklearn import metrics
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.preprocessing import Imputer
from statsmodels.sandbox.regression.predstd import wls_prediction_std
# Some colors for later
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
###############################################################################
# File Path Info
###############################################################################
path = path = os.path.abspath(os.getcwd())
#load data from a CSV to a dataframe
with open(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv'))) as in_data:
skid_data = pd.DataFrame.from_csv(in_data, sep=',')
###############################################################################
# Load Data and Prep for scikit-learn
###############################################################################
train = len(skid_data) * .60
test = int(train + 1)
as_array = np.asfarray(skid_data[:int(train)][['Max Acceleration (mph per s)','Turns']]) #'Velocity Stdev','Average Acceleration (mph per s)', 'Max Acceleration (mph per s)', ' Acceleration Stdev','Displacement','Total Distance Traveled','Max Direction Change per sec', ' Direction Stdev','Time (s)', 'Turns', 'Aggressive Turns', 'Stops', 'Large Deceleration Events', 'Deceleration Events', 'Max Deceleration Event']])
###############################################################################
# scikit-learn preprocessing; uncomment to see differences
###############################################################################
#Correct missing data
imputer = Imputer(missing_values="NaN", strategy="mean")
patched = imputer.fit_transform(as_array)
# Preprocessing tricks
#patched = StandardScaler().fit_transform(patched)
#patched = scale(patched, axis=0, with_mean=True)
#patched = preprocessing.normalize(patched, norm='l2')
#min_max_scaler = preprocessing.MinMaxScaler()
#patched = min_max_scaler.fit_transform(patched)
###############################################################################
# Main Functions
###############################################################################
K = range(1,20)
meandisortions = []
for k in K:
kmeans = KMeans(n_clusters = k)
kmeans.fit(patched)
meandisortions.append(sum(np.min(cdist(patched,kmeans.cluster_centers_,'euclidean'),axis=1))/patched.shape[0])
plt.plot(K,meandisortions,'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
plt.show()
In [92]:
from time import time
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import Imputer
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
# Some colors for later
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
###
#Initial data load from the aggregate file csv
with open(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv'))) as in_data:
skid_data = pd.DataFrame.from_csv(in_data, sep=',')
print (list(enumerate(skid_data.columns)))
#Loading into the numpy array
as_array = np.asfarray(skid_data[['Average Velocity (mph)','Max Velocity', 'Velocity Stdev','Average Acceleration (mph per s)', 'Max Acceleration (mph per s)', ' Acceleration Stdev','Displacement','Total Distance Traveled','Max Direction Change per sec', ' Direction Stdev','Time (s)', 'Turns', 'Aggressive Turns', 'Stops', 'Large Deceleration Events', 'Deceleration Events', 'Max Deceleration Event']])
print (skid_data.shape)
#number of groups
n_clusters=3
# preprocessing tricks
imputer = Imputer(missing_values="NaN", strategy="mean")
patched = imputer.fit_transform(as_array)
#patched = StandardScaler().fit_transform(patched)
#patched = scale(patched, axis=0, with_mean=True)
#cluster data
cluster = KMeans(n_clusters=n_clusters, n_init = 100)
cluster_labels = cluster.fit_predict(patched)
#assigned grouped labels to the skid data
labels = cluster.labels_
skid_data["labels"]=labels
# Make Predictions
predictions = cluster.predict(patched)
# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = cluster.labels_
#copy dataframe (may be memory intensive but just for illustration)
skid_data = skid_data.copy()
skid_data['Cluster Class'] = pd.Series(classified_data, index=skid_data.index)
print (cluster.labels_)
skid_data.plot( x = 'Average Velocity (mph)', y = 'Cluster Class', kind = 'scatter',facecolor='#FFFF99')
plt.show()
# Plotting the clusters
plt.scatter(patched[:, 0], patched[:, 11], color=colors[predictions].tolist(), s=5)
# Labeling the clusters
centers = cluster.cluster_centers_
center_colors = colors[:len(centers)]
plt.scatter(centers[:, 0], centers[:, 11], s=200, c=center_colors,marker='x', linewidths=3,
color ='k', alpha=1,zorder=10)
#for i, c in enumerate(centers):
#plt.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
#sample_silhouette_values = silhouette_samples(patched, cluster_labels)
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
plt.title('K-means clustering on Average Velocity and the number of \ncalculated turns in a trip'
'Centroids are marked with blue cross')
plt.xticks(())
plt.yticks(())
plt.ylabel('$Feature A$')
plt.xlabel('$Feature B$')
#plt.figure(num=1, figsize=(18, 18), dpi=80, facecolor='w', edgecolor='k')
plt.rcParams['axes.facecolor'] = 'w'
plt.show()
classified_data = cluster.labels_
skid_data = skid_data.copy()
#print pd.Series(classified_data).mode()
skid_data['Cluster Class'] = pd.Series(classified_data, index=skid_data.index)
print ("This is what all of my work was for, this one little array.\n", classified_data)
silhouette_avg = silhouette_score(patched, classified_data)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
###############################################################################
# Ordinary Least Squares Report
###############################################################################
model = sm.OLS(classified_data, patched)
results = model.fit()
print (results.summary())
In [23]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def DisPlt(driver,trip):
path = os.path.abspath(os.getcwd())
pathtocsv = os.path.normpath(os.path.join(os.path.dirname(path),"output","trip",str(driver)+"_"+str(trip)+".csv"))
df = pd.read_csv(pathtocsv)
investigation = str(raw_input("Enter a variable \n>"))
h = sorted([df[investigation]]) #sorted
fit = stats.norm.pdf(h, np.mean(h), np.std(h)) #this is a fitting indeed
plt.plot(h,fit,'-o')
plt.hist(h,normed=True) #use this to draw histogram of your data
plt.show() #use may also need add this
###############################################################################
# 'Main' Function
###############################################################################
DisPlt(1,1)
The scatter matrix is a visualizaiton that helps to explore the "character" of the data; some of the feature comparisons may suggest potential correlations or relationships.
In [24]:
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
scatter_matrix(df2,alpha=0.5, figsize=(15,15),diagonal='kde')
plt.show()
The first goal is to group a set of drivers into the same group (called a cluster), where drivers within the group are more similar (in some sense or another) to each other than to those in other groups (clusters). The KMeans was used because the algorithm scales well to large data sets and is one of the easier algorithms to implement and return classes. The initial clustering follows:
In [44]:
from __future__ import print_function
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
with open(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv'))) as in_data:
skid_data = pd.DataFrame.from_csv(in_data, sep=',')
X = np.asfarray(skid_data[['Average Velocity (mph)','Turns','Max Velocity', 'Velocity Stdev','Average Acceleration (mph per s)', 'Max Acceleration (mph per s)', ' Acceleration Stdev','Displacement','Total Distance Traveled','Max Direction Change per sec', ' Direction Stdev','Time (s)', 'Turns', 'Aggressive Turns', 'Stops', 'Large Deceleration Events', 'Deceleration Events', 'Max Deceleration Event']])
# Preprocessing tricks
imputer = Imputer(missing_values="NaN", strategy="mean")
X = imputer.fit_transform(X)
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for Telematic Data
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Clustered Drivers")
# The vertical line for average silhoutte score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
#plt.rcParams['axes.facecolor'] = '#FFCCCC'
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
In [84]:
from time import time
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
import os
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn.metrics.cluster import v_measure_score
path = path = os.path.abspath(os.getcwd())
# Some colors for later
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
###
#load data from a CSV to a dataframe
with open(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv'))) as in_data:
skid_data = pd.DataFrame.from_csv(in_data, sep=',')
n_samples, n_features = skid_data.shape
print (skid_data.shape)
#skid_data=skid_data.fillna(value=-999)
#load all numeric data into an array. The offense column from the crime data
#is excluded
as_array = np.asfarray(skid_data[['Average Velocity (mph)','Max Velocity', 'Velocity Stdev','Average Acceleration (mph per s)', 'Max Acceleration (mph per s)', ' Acceleration Stdev','Displacement','Total Distance Traveled','Max Direction Change per sec', ' Direction Stdev','Time (s)', 'Turns', 'Aggressive Turns', 'Stops', 'Large Deceleration Events', 'Deceleration Events', 'Max Deceleration Event']])
#number of groups
n_clusters=3
#Correct missing data
imputer = Imputer(missing_values="NaN", strategy="mean")
patched = imputer.fit_transform(as_array)
# Preprocessing tricks
patched = StandardScaler().fit_transform(patched)
#patched = scale(patched, axis=0, with_mean=True)
#patched = preprocessing.normalize(patched, norm='l2')
#min_max_scaler = preprocessing.MinMaxScaler()
#patched = min_max_scaler.fit_transform(patched)
#cluster data
cluster = KMeans(n_clusters=n_clusters)
cluster.fit(patched)
reduced_data = PCA(n_components=2).fit_transform(patched)
kmeans = KMeans(init='k-means++', n_clusters=n_clusters)
fit = kmeans.fit(reduced_data)
predict = kmeans.predict(reduced_data)
# Make Predictions
cluster.predict(patched)
# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = kmeans.labels_
#copy dataframe (may be memory intensive but just for illustration)
skid_data = skid_data.copy()
skid_data['Cluster Class'] = pd.Series(classified_data, index=skid_data.index)
print (kmeans.labels_)
skid_data.plot( x = 'Average Velocity (mph)', y = 'Cluster Class', kind = 'scatter')
plt.show()
# Scoring to evaluate cluster performance
# Silhouette Coefficient
print ("We want scores close to 1 \n")
SilouetteCoefficient = metrics.silhouette_score(patched, classified_data, metric='euclidean')
'''
AdjustRandIndex = metrics.adjusted_rand_score(classified_data, prediction_data)
MutualInfoScore = metrics.adjusted_mutual_info_score(classified_data,prediction_data)
HomogenietyScore = metrics.homogeneity_score(classified_data, prediction_data)
CompletenessScore = metrics.completeness_score(classified_data, prediction_data)
V_measure = metrics.v_measure_score(classified_data, prediction_data)
'''
'\nThe Adjusted Rand index is %r\nThe Mutual Information based score is %r\nThe Homogeneity score is %r\nThe completeness score is %r\nThe V-measure score is %r" % (SilouetteCoefficient,AdjustRandIndex,MutualInfoScore,HomogenietyScore,CompletenessScore,V_measure)'
print ("The Silouette Coefficient score is...", SilouetteCoefficient)
#############
#scikit-learn visualization example
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02 # point in the mesh [x_min, m_max]x[y_min, y_max].
# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() + 1, reduced_data[:, 0].max() - 1
y_min, y_max = reduced_data[:, 1].min() + 1, reduced_data[:, 1].max() - 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title('K-means clustering on the snippet of Team Skidmarks dataset \n(PCA-reduced data)'
'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
As demonstrated in the figure above, the clusters are nearly indistiguishable when we reduce the dimensionality of the features.
“Hyperspace jumps” where the GPS signal was likely lost for a short period of time, causing the data to appear as if the vehicle traveled 1,000 mph for a period of time.We later discovered that the vehicle’s heading, i.e. the direction it is traveling, can also change dramatically because of GPS errors. To properly adjust for these GPS signal loss issues, smoothing techniques are needed.To discourage trip matching, AXA inserted fake trips. Need a method to programmatically identify and eliminate these trips.
We have not broken out the key features, but they exist in our feature set based on the Kaggle Competiton Winner's forum posts!
a bibliography or works cited
any appendices that highlight key parts of your work (code snippets, data snippets, other graphs, etc.)