Using Telematic Data to Identify Increased Liability Drivers

Team Name:

SkidMarks

Team Members:

Linwood Creekmore
Vikram Mittal

_In partial fulfillment of the requirements for Georgetown University's Data Science Professional Certificate Program_

April 2015

Abstract

Pixar Pitch

Introduction

Data

Issues

Methodology

Results

Discussion

Challenges

Lessons Learned

Winning Team

Further Research

Conclusion

References

Appendices



In [10]:

    
from IPython.display import display
from IPython.display import Image

iPython formatting guidance http://nbviewer.ipython.org/github/NelisW/ComputationalRadiometry/blob/master/01-IPythonHintsAndTips.ipynb For those viewing this Notebook as a webpage, this service is provided by nbviewer: http://nbviewer.ipython.org/ Markdown Formatting Cheat Sheet http://assemble.io/docs/Cheatsheet-Markdown.html



In [11]:

    
picUrl = 'http://software.quest-global.com/wp-content/uploads/2014/11/connectedcar-logo.png'
Embed  = Image(picUrl)
display(Embed)

Abstract

In this project, we tackled the Kaggle competition sponsored by AXA, a multi-national insurance company, to classify and predict automobile driver behavior based on second-by-second positional GPS data. We performed the steps of a data science pipeline to ingest and wrangle data, calculate features, create classification and prediction models, and then visualize the results. For this project, the Python programming language was selected because of the strength of the scikit-learn library for machine learning algorithms. Through research and analysis, we learned that an effective model must rely on both path matching and telematics. In other words, it is important to identify the maneuvers made and describe how those maneuvers were performed. Since our GPS data was unlabeled, meaning that we did not have trips categorized as safe/unsafe, the classification problem is an unsupervised learning problem. Therefore, the k-means clustering technique was chosen and executed on 18 calculated features. Turns have proven to be a key maneuver in distinguishing types of drivers, while average and maximum velocity and acceleration are useful telematics features to describe how maneuvers are performed. With classes from the clustering algorithm, a prediction model was then created by performing a logistic regression on key velocity and acceleration features to predict which class a driver is likely to belong to based on a single trip.

Introduction

Quantitative risk management has been a cornerstone of the auto insurance industry for many years. Traditionally, companies in the industry calculate risk-based premiums using driver history, demographics, geographic factors, and vehicle characteristics. Driver history includes events such as accidents, moving violations, and license suspensions. Demographic factors include age, gender, marital status, and occupation. Geographic factors include traffic and weather patterns, as well as crime rates such as vandalism and theft. Finally, vehicle characteristics include make, model, year, and mileage to derive a current value. Insurance companies combine these factors into an ensemble of models to obtain an expected loss, i.e. to predict the probability of a claim and the size of a claim in the future. While insurance providers clearly rely on a wealth of information to make pricing decisions, the data still has its limitations. For example, the driver history features listed above are relatively infrequent, negative events. Additionally, insurance companies typically only have access to three to five years of driver history data per state laws. Progressive, an industry leader in data-driven policy making, found that key driving behaviors—like actual miles driven, braking, and time of day—carry more than twice the predictive power of traditional insurance rating variables[1]. Furthermore, a white paper by Deloitte highlighted the mutually beneficial motivations behind this driving-data-craze in a 2012 white paper:

"...the advantages accrue to virtually all sides. Insurance companies benefit from matching premiums more closely >with actual risk. Drivers benefit from the opportunity to lower their insurance rates by driving less or more >safely..."

Knowing that more data can often beat better models, insurance companies are now looking to incorporate daily driving behavior into their premium pricing models to gain a competitive advantage. You may be familiar with Progressive’s Snapshot commercials where customers choose to place a device in their vehicles that monitors the vehicle’s position and speed. The idea is to reward customers with good driving habits by offering them discounted premiums. With this information, Progressive can also identify customers with risky driving habits and then initiate a variety of mitigation options such as offering corrective guidance, increasing premiums, or even terminating coverage. With the prevalence of GPS devices in vehicles, either as a standalone unit or built into a driver’s cell phone, insurance companies now have access to real-time driver behavior to allow for more accurate predictions of claims and expected losses than ever before. Like Progressive Insurance, AXA Insurance is moving in the direction of capturing current driver data and has sponsored a Kaggle competition to obtain a top performing classification and prediction model. The purpose of this paper is to outline the steps taken to create a model to classify driver behavior based telematic data provided by AXA. Then, a prediction model is described to help determine which class a driver most likely belongs to based on a single trip.

The Pixar Pitch

Once upon a time

…auto insurance companies charged premiums based on accidents and moving violations.

Every Day

…drivers were being penalized for these infrequent, negative events.

One Day

...a company decided to capture the daily driving habits of its customers in its premium pricing model.

Because of that

…customers with good driving habits felt happier after being rewarded with lower premiums.

# Because of that

…the insurance company experienced less customer turnover and more new customers applied for policies.

## Until finally

...the insurance company had captured greater market share.

Insurance companies WANT your data!!!

AXA, the French multinational investment banking and insurance management firm, sponsored a Kaggle competition and provided a clean data file for each of 200 trips for 3,600+ drivers. Each file contained a list of anonymized vehicle positions at every second of the trip (i.e. each row represents the vehicle’s 2-dimensional position after one second). To protect the identity and location of the drivers, longitude and latitude coordinates are not provided so this data set cannot be matched up with coordinates of major roadways to determine speed limits, road signs, and traffic lights.



In [12]:

    
driver = raw_input("Enter a number between 1-3612:\n>")









    



Enter a number between 1-3612:
>1



In [13]:

    
trip = raw_input("Enter a number between 1-200:\n>")









    



Enter a number between 1-200:
>33



In [14]:

    
import pandas as pd
import os
%matplotlib inline
import matplotlib.pyplot as plt

path = os.path.abspath(os.getcwd())
pathtocsv = os.path.normpath(os.path.join(os.path.dirname(path),"input","test",str(driver),str(trip)+".csv"))
df = pd.read_csv(pathtocsv)
print df[:15][['x','y']]

df.plot(kind = 'scatter', x = 'x', y = 'y', figsize=(18,10))
plt.show()









    



       x     y
0    0.0   0.0
1    0.0   3.1
2   -0.2   5.9
3   -2.6   8.9
4   -4.2  11.2
5   -6.7  13.9
6   -9.2  16.9
7  -13.7  21.0
8  -17.1  24.8
9  -20.5  28.5
10 -24.1  31.8
11 -25.9  33.4
12 -27.8  34.6
13 -28.9  35.0
14 -31.0  35.6



In [15]:

    
import math

# Calculate the direction
def getDirection(y,x):
    direction_rad = math.atan2(y,x)
    direction = math.degrees(direction_rad) 
    if direction < 0:
        direction += 360
    return direction 

def getCardinalDirection(direction):
    carddir = ''
    if direction >= 0 and direction <= 22.5:
        carddir = 'East'
    elif direction > 337.5 and direction <=360:
        carddir = 'East' 
    elif direction > 22.5 and direction <= 67.5:
        carddir = 'Northeast'
    elif direction > 67.5 and direction <= 112.5:
        carddir = 'North'
    elif direction > 112.5 and direction <= 157.5:
        carddir  = 'Northwest'
    elif direction > 157.5 and direction <= 202.5:
        carddir  = 'West'
    elif direction > 202.5 and direction <= 247.5:
        carddir = 'Southwest'
    elif direction > 247.5 and direction <= 292.5:
        carddir  = 'South'
    elif direction > 292.5 and direction <= 337.5:
        carddir  = 'Southeast'

    return carddir

print "The cardinal direction is %r and the direction in degrees is %r" %(getCardinalDirection(getDirection(11.2,-4.2)), getDirection(11.2,-4.2))

val = pd.rolling_sum(df, window = 3)
print val[2:10]









    



The cardinal direction is 'North' and the direction in degrees is 110.55604521958348
      x     y
2  -0.2   9.0
3  -2.8  17.9
4  -7.0  26.0
5 -13.5  34.0
6 -20.1  42.0
7 -29.6  51.8
8 -40.0  62.7
9 -51.3  74.3

Methodology and Tools

Data Ingestion. The original AXA data set was downloaded/warehoused as a zipped CSV file from the competition website. The original file was stored in as raw a form as possible; each driver (total of 3612) had a directory with two-hundred (200) trips. Each trip had an x and y column value for each REPORTED second of the trip. Outside of the standard browser provided download tools and local operating system directories, no special tools were used for the ingestion phase of the Data Science Pipeline. There was some experimentation with wget. Team SkidMarks initially planned to use to Postgres as the Write Once Read Many (WORM) repository but scheduling, learning curves, and feature calculations (discussed later) consumed the bulk of the project time. Local folders on teammate computers served as the ingestion landing zone.

Data Munging and Wrangling</b>. Using team-created python scripts, the data was normalized and replicated (added concatenated primary key with creation of trip_id and driver_id columns) to intuitively capture relationships within the data. The os Python module was used for operating system interoperability and the csv Python module was used to read and write the normalized database structure to another directory for computation and analysis.



In [16]:

    
path = os.path.abspath(os.getcwd())
pathtocsv = os.path.normpath(os.path.join(os.path.dirname(path),"output","test",str(driver), str(driver) + "_" +str(trip)+".csv"))
normed_df = pd.read_csv(pathtocsv)
print pathtocsv
print normed_df[:15][[0,1,2,3]]









    



/home/captainobvious/Google Drive/Python/projects/gtownclass/skidmarks/bin/output/test/1/1_33.csv
    driver_id  trip_id     x     y
0           1       33   0.0   0.0
1           1       33   0.0   3.1
2           1       33  -0.2   5.9
3           1       33  -2.6   8.9
4           1       33  -4.2  11.2
5           1       33  -6.7  13.9
6           1       33  -9.2  16.9
7           1       33 -13.7  21.0
8           1       33 -17.1  24.8
9           1       33 -20.5  28.5
10          1       33 -24.1  31.8
11          1       33 -25.9  33.4
12          1       33 -27.8  34.6
13          1       33 -28.9  35.0
14          1       33 -31.0  35.6

Computation and Analyses</b>. Naturally, this step consumed the bulk of the CAPSTONE project’s time and resources; feature creation was a coding, memory, and research intensive process. The python Pandas library was used because it provides high-performance, easy-to-use data structures and data analysis tools within Python. From simple x and y positional data, Team SkidMarks used the pandas library to compute over 18 telematic and trip matching features:

Velocity
Average Velocity
Standard Deviation
Maximum
Acceleration
Average
Standard Deviation
Maximum
Distance
Increment traveled per second
Total distance traveled (sum of all increments)
Total displacement
Heading
Heading standard deviation
Heading in degrees
Heading as a cardinal direction
Maximum 1-second change in heading for trip
Turns
count of maneuvers greater than 45 degree change in direction over 3 second rolling window
Turn plus speed over 30 mph (aggressive turns)
Braking
Deceleration events lasting 3 seconds (braking)
Large deceleration events
Maximum deceleration events
Consecutive zero acceleration/velocity/increment travels events
Time
Total REPORTED trip time

Base features were computed on a trip/per second basis. Next, aggregate trip values, such as the average velocity over the duration of the trip, were computed for all driver files in the data store.

Despite several value/rule-based filters, Team SkidMarks did not completely remove outlier data. “Hyperspace jumps” still remain as outliers in the data set because of periods where travelers covered large distances in short times because of GPS signal loss. However, existing filtering removed the bulk of the erroneous data.



In [17]:

    
df2 = pd.read_csv(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv')))
print df2[:10]









    



   driver_id  trip_id  Average Velocity (mph)  Max Velocity  Velocity Stdev  \
0         10        1               22.025986     51.479567       16.089867   
1         10       10               25.933013     47.370042       13.689941   
2         10      100               13.319320     31.070665       10.550685   
3         10      101               16.599385     41.563319       13.933061   
4         10      102               16.182505     39.469792       10.719522   
5         10      103               23.693349     53.782862       15.955094   
6         10      104               12.387942     38.070116       14.592944   
7         10      105                8.221839     20.899332        6.665026   
8         10      106               19.598540     51.268115       16.371993   
9         10      107               15.625060     49.971584       14.645314   

   Average Acceleration (mph per s)  Max Acceleration (mph per s)  \
0                          1.181900                      6.456999   
1                          1.105177                     13.747442   
2                          1.059753                     15.378824   
3                          1.580786                     14.470438   
4                          1.363552                      7.118497   
5                          1.451964                     13.217819   
6                          1.310944                      9.735251   
7                          0.854507                      4.062887   
8                          0.934565                     14.108232   
9                          1.077393                     15.134891   

    Acceleration Stdev  Displacement  Total Distance Traveled  \
0             1.773505   4351.885666              5390.362782   
1             1.779482   9764.480899             11708.811655   
2             2.199846    400.950134              1226.884398   
3             2.374612   1660.095467              3614.249163   
4             2.925042    985.140746              1904.132677   
5             2.082712   5524.843210              8640.436403   
6             2.265311   1410.656758              1957.016567   
7             1.220241    438.489053               731.761136   
8             1.540571   9755.440158             11106.770899   
9             1.863918   5256.909663              8265.942417   

   Max Direction Change per sec   Direction Stdev  Time (s)  Turns  \
0                     59.176701         87.849736       549      8   
1                     51.436036         71.703856      1011     17   
2                     47.862405        101.083523       206      7   
3                     55.133316         96.266504       488     11   
4                     45.000000         80.409318       264      8   
5                     99.272602        129.947743       817     19   
6                     45.404910        112.510871       354     11   
7                     45.000000         95.965353       200      6   
8                     59.261295        151.877313      1269     32   
9                     89.414706         83.650163      1184     18   

   Aggressive Turns  Stops  Large Deceleration Events  Deceleration Events  \
0                 2      3                          1                    3   
1                 2      5                          3                   11   
2               NaN      2                          0                    1   
3                 6      6                          3                   10   
4                 3      1                          0                    6   
5                 1      8                          4                    9   
6                 1      4                          1                    2   
7                 1      2                          0                    1   
8                 3      8                          1                   11   
9                 5     13                          1                   18   

   Max Deceleration Event  
0               -6.050113  
1              -13.747442  
2              -13.844077  
3              -11.376965  
4               -5.705107  
5              -13.217819  
6               -8.534422  
7               -4.062887  
8              -14.108232  
9              -15.067280



In [18]:

    
df2.describe()









    Out[18]:






  
    
      
      driver_id
      trip_id
      Average Velocity (mph)
      Max Velocity
      Velocity Stdev
      Average Acceleration (mph per s)
      Max Acceleration (mph per s)
      Acceleration Stdev
      Displacement
      Total Distance Traveled
      Max Direction Change per sec
      Direction Stdev
      Time (s)
      Turns
      Aggressive Turns
      Stops
      Large Deceleration Events
      Deceleration Events
      Max Deceleration Event
    
  
  
    
      count
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4600.000000
      4570.000000
      3679.000000
      4569.000000
      4600.000000
      4600.000000
      4600.000000
    
    
      mean
      17.260870
      100.500000
      17.143772
      43.930266
      12.679038
      1.193506
      14.586791
      2.052010
      3294.765953
      5237.066675
      65.329313
      103.359228
      636.169783
      14.979869
      3.324817
      5.891223
      1.112174
      7.244130
      -11.295638
    
    
      std
      8.050814
      57.740582
      9.736226
      18.339800
      4.668269
      0.458221
      14.345964
      0.891752
      3921.873153
      5195.860836
      34.442457
      22.782350
      397.501378
      10.725490
      2.910428
      5.309694
      1.611826
      5.934242
      7.182614
    
    
      min
      1.000000
      1.000000
      0.026113
      0.448000
      0.087896
      0.052225
      0.448000
      0.125665
      0.223607
      2.809902
      0.000000
      14.581429
      150.000000
      1.000000
      1.000000
      1.000000
      0.000000
      0.000000
      -39.982664
    
    
      25%
      12.000000
      50.750000
      10.241975
      33.437581
      10.065898
      0.954951
      7.254335
      1.572558
      754.676534
      1710.073160
      45.000000
      87.375657
      322.000000
      8.000000
      1.000000
      2.000000
      0.000000
      3.000000
      -13.774150
    
    
      50%
      18.000000
      100.500000
      15.437727
      40.985879
      12.274847
      1.154427
      10.345544
      1.924528
      1704.342617
      3478.840137
      49.516026
      105.238781
      518.000000
      12.000000
      2.000000
      4.000000
      1.000000
      6.000000
      -8.944974
    
    
      75%
      24.000000
      150.250000
      23.640534
      52.106487
      15.325039
      1.405204
      16.314069
      2.363690
      4477.182222
      6951.420384
      73.293805
      119.716872
      854.000000
      19.000000
      4.000000
      8.000000
      1.000000
      10.000000
      -6.320108
    
    
      max
      30.000000
      200.000000
      63.739891
      147.699424
      31.679671
      14.662127
      140.534184
      19.665523
      28904.697758
      40731.752107
      180.000000
      158.809584
      1785.000000
      73.000000
      26.000000
      54.000000
      14.000000
      43.000000
      -0.316784



In [19]:

    
path = os.path.abspath(os.getcwd())
pathtocsv = os.path.normpath(os.path.join(os.path.dirname(path),"input","test","1","200" +".csv"))
df = pd.read_csv(pathtocsv)
df.plot(kind = 'scatter', x = 'x', y = 'y', figsize=(18,10))
plt.ylim(-2000,650)
plt.xlim(-4500,100)
plt.show()

Modeling and Application. Determining safe vs unsafe drivers is an unsupervised learning task. Clustering with K-Means provides the best machine learning option to explore the dataset and will also help discover hidden structure or patterns in unlabeled training data. The goal is to use cluster analysis to take a group of observed trips by drivers so that drivers with similar telematic and/or trip matching features such that members of the same group or cluster are more similar to each other by a given metric than they are to members of other clusters. Team SkidMarks used the KMeans Clustering module within Python’s SciKit-Learn Machine Learning library. The pandas, numpy, and Scipy libraries were used with SciKit-Learn to ingest/preprocess/load data.

The Silhouette Coefficient was used as a performance measure for the clusters. The silhouette coefficient is a measure of the compactness and separation of the clusters. The best value is 1 and the worst value is -1.

KMeans clustered on train data, and a function, that, given train data, returned an array of integer labels corresponding to the different clusters. Team SkidMarks experimented with using the learned features to build a supervised classifier, making this CAPSTONE a semi supervised learning problem in the end.

Reporting and Visualization. Results were presented as visualizations using the matplotlib Python library. A scatter matrix was initially used to explore the potential relationships between the calculated features. Next, cluster visualizations were created to show the decision space and cluster groupings calculated via the KMeans clustering algorithm. And finally, a simple fitted line plot was used to show the fitted classifier data.

Results

Research has shown K-Means is a popular and useful method to explore unlabeled data. Team Skidmarks used K-Means clustering to assign driver’s trips to classes based on calculated features. The optimal number of clusters (k) was determined using the “elbow method” . This method aims to:

“graph the percentage of variance explained by the clusters against the number of clusters, so that the first clusters add much information (explain a lot of variance), but at some point the marginal gain from additional clusters will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the elbow criterion”

The graphic below depicts a value of four, five or six as an optimal k in the study.



In [20]:

    
import os
import numpy as np
import pandas as pd
from sklearn import metrics
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.preprocessing import Imputer
from statsmodels.sandbox.regression.predstd import wls_prediction_std


# Some colors for later
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)


###############################################################################
# File Path Info
###############################################################################


path = path = os.path.abspath(os.getcwd())

#load data from a CSV to a dataframe
with open(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv'))) as in_data:
    skid_data = pd.DataFrame.from_csv(in_data, sep=',')


###############################################################################
# Load Data and Prep for scikit-learn
###############################################################################
train = len(skid_data) * .60
test = int(train + 1)
as_array = np.asfarray(skid_data[:int(train)][['Max Acceleration (mph per s)','Turns']]) #'Velocity Stdev','Average Acceleration (mph per s)', 'Max Acceleration (mph per s)', ' Acceleration Stdev','Displacement','Total Distance Traveled','Max Direction Change per sec', ' Direction Stdev','Time (s)', 'Turns', 'Aggressive Turns', 'Stops', 'Large Deceleration Events', 'Deceleration Events', 'Max Deceleration Event']])

###############################################################################
# scikit-learn preprocessing; uncomment to see differences
###############################################################################

#Correct missing data 
imputer = Imputer(missing_values="NaN", strategy="mean")
patched = imputer.fit_transform(as_array)

# Preprocessing tricks
#patched = StandardScaler().fit_transform(patched)
#patched = scale(patched, axis=0, with_mean=True)
#patched = preprocessing.normalize(patched, norm='l2')
#min_max_scaler = preprocessing.MinMaxScaler()
#patched = min_max_scaler.fit_transform(patched)

###############################################################################
# Main Functions
###############################################################################

K = range(1,20)
meandisortions = []

for k in K:
	kmeans = KMeans(n_clusters = k)
	kmeans.fit(patched)
	meandisortions.append(sum(np.min(cdist(patched,kmeans.cluster_centers_,'euclidean'),axis=1))/patched.shape[0])

plt.plot(K,meandisortions,'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
plt.show()



In [92]:

    
from time import time
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import Imputer
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

# Some colors for later
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
  
###
 #Initial data load from the aggregate file csv
with open(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv'))) as in_data:
    skid_data = pd.DataFrame.from_csv(in_data, sep=',')

print (list(enumerate(skid_data.columns)))

#Loading into the numpy array

as_array = np.asfarray(skid_data[['Average Velocity (mph)','Max Velocity', 'Velocity Stdev','Average Acceleration (mph per s)', 'Max Acceleration (mph per s)', ' Acceleration Stdev','Displacement','Total Distance Traveled','Max Direction Change per sec', ' Direction Stdev','Time (s)', 'Turns', 'Aggressive Turns', 'Stops', 'Large Deceleration Events', 'Deceleration Events', 'Max Deceleration Event']])


print (skid_data.shape)

#number of groups
n_clusters=3

# preprocessing tricks
imputer = Imputer(missing_values="NaN", strategy="mean")
patched = imputer.fit_transform(as_array)
#patched = StandardScaler().fit_transform(patched)
#patched = scale(patched, axis=0, with_mean=True)

#cluster data 
cluster = KMeans(n_clusters=n_clusters, n_init = 100)
cluster_labels = cluster.fit_predict(patched)

#assigned grouped labels to the skid data
labels = cluster.labels_
skid_data["labels"]=labels

# Make Predictions
predictions = cluster.predict(patched)

# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = cluster.labels_


#copy dataframe (may be memory intensive but just for illustration)

skid_data = skid_data.copy()
skid_data['Cluster Class'] = pd.Series(classified_data, index=skid_data.index)

print (cluster.labels_)
skid_data.plot( x = 'Average Velocity (mph)', y = 'Cluster Class', kind = 'scatter',facecolor='#FFFF99')
plt.show()

# Plotting the clusters
plt.scatter(patched[:, 0], patched[:, 11], color=colors[predictions].tolist(), s=5)



# Labeling the clusters
    
centers = cluster.cluster_centers_

center_colors = colors[:len(centers)]
plt.scatter(centers[:, 0], centers[:, 11], s=200, c=center_colors,marker='x', linewidths=3,
            color ='k', alpha=1,zorder=10)

#for i, c in enumerate(centers):
        #plt.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
        
        #sample_silhouette_values = silhouette_samples(patched, cluster_labels)

    
for i in range(n_clusters):
    # Aggregate the silhouette scores for samples belonging to
    # cluster i, and sort them
    ith_cluster_silhouette_values = \
        sample_silhouette_values[cluster_labels == i]

    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    color = cm.spectral(float(i) / n_clusters)
    ax1.fill_betweenx(np.arange(y_lower, y_upper),
                      0, ith_cluster_silhouette_values,
                      facecolor=color, edgecolor=color, alpha=0.7)

    # Label the silhouette plots with their cluster numbers at the middle
    ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

plt.title('K-means clustering on Average Velocity and the number of \ncalculated turns in a trip'
          'Centroids are marked with blue cross')
plt.xticks(())
plt.yticks(())
plt.ylabel('$Feature A$')
plt.xlabel('$Feature B$')
#plt.figure(num=1, figsize=(18, 18), dpi=80, facecolor='w', edgecolor='k')
plt.rcParams['axes.facecolor'] = 'w'
plt.show()


classified_data = cluster.labels_
skid_data = skid_data.copy()
#print pd.Series(classified_data).mode()
skid_data['Cluster Class'] = pd.Series(classified_data, index=skid_data.index)
print ("This is what all of my work was for, this one little array.\n", classified_data)
silhouette_avg = silhouette_score(patched, classified_data)
print("For n_clusters =", n_clusters,
      "The average silhouette_score is :", silhouette_avg)

###############################################################################
# Ordinary Least Squares Report
###############################################################################

model = sm.OLS(classified_data, patched)
results = model.fit()
print (results.summary())









    



[(0, 'trip_id'), (1, 'Average Velocity (mph)'), (2, 'Max Velocity'), (3, 'Velocity Stdev'), (4, 'Average Acceleration (mph per s)'), (5, 'Max Acceleration (mph per s)'), (6, ' Acceleration Stdev'), (7, 'Displacement'), (8, 'Total Distance Traveled'), (9, 'Max Direction Change per sec'), (10, ' Direction Stdev'), (11, 'Time (s)'), (12, 'Turns'), (13, 'Aggressive Turns'), (14, 'Stops'), (15, 'Large Deceleration Events'), (16, 'Deceleration Events'), (17, 'Max Deceleration Event')]
(4600, 18)
[2 2 1 ..., 2 1 1]






    












    












    



This is what all of my work was for, this one little array.
 [2 2 1 ..., 2 1 1]
For n_clusters = 3 The average silhouette_score is : 0.585125693146
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.872
Method:                 Least Squares   F-statistic:                     1847.
Date:                Sat, 09 May 2015   Prob (F-statistic):               0.00
Time:                        23:58:22   Log-Likelihood:                -3190.2
No. Observations:                4600   AIC:                             6414.
Df Residuals:                    4583   BIC:                             6524.
Df Model:                          17                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.0641      0.002     33.317      0.000         0.060     0.068
x2            -0.0005      0.001     -0.441      0.659        -0.003     0.002
x3             0.0107      0.004      3.046      0.002         0.004     0.018
x4            -0.0969      0.036     -2.665      0.008        -0.168    -0.026
x5             0.0025      0.001      2.017      0.044      6.95e-05     0.005
x6            -0.0485      0.021     -2.300      0.022        -0.090    -0.007
x7         -3.138e-05   4.26e-06     -7.370      0.000     -3.97e-05  -2.3e-05
x8            -0.0001   4.86e-06    -29.721      0.000        -0.000    -0.000
x9             0.0007      0.000      2.721      0.007         0.000     0.001
x10            0.0019      0.000      7.502      0.000         0.001     0.002
x11            0.0011   4.66e-05     23.137      0.000         0.001     0.001
x12           -0.0012      0.001     -0.854      0.393        -0.004     0.002
x13            0.0286      0.004      7.851      0.000         0.021     0.036
x14            0.0021      0.002      0.921      0.357        -0.002     0.007
x15           -0.0088      0.006     -1.377      0.169        -0.021     0.004
x16            0.0105      0.002      4.518      0.000         0.006     0.015
x17            0.0027      0.002      1.789      0.074        -0.000     0.006
==============================================================================
Omnibus:                       50.524   Durbin-Watson:                   1.883
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               52.054
Skew:                          -0.260   Prob(JB):                     4.97e-12
Kurtosis:                       2.957   Cond. No.                     4.98e+04
==============================================================================

Warnings:
[1] The condition number is large, 4.98e+04. This might indicate that there are
strong multicollinearity or other numerical problems.



In [23]:

    
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt


def DisPlt(driver,trip):

	path = os.path.abspath(os.getcwd())
	pathtocsv = os.path.normpath(os.path.join(os.path.dirname(path),"output","trip",str(driver)+"_"+str(trip)+".csv"))
	df = pd.read_csv(pathtocsv)

	investigation = str(raw_input("Enter a variable \n>"))

	h = sorted([df[investigation]]) #sorted




	fit = stats.norm.pdf(h, np.mean(h), np.std(h))  #this is a fitting indeed

	plt.plot(h,fit,'-o')

	plt.hist(h,normed=True)      #use this to draw histogram of your data

	plt.show()                   #use may also need add this

###############################################################################
# 'Main' Function
############################################################################### 


DisPlt(1,1)









    



Enter a variable 
>Velocity (mph)

The scatter matrix is a visualizaiton that helps to explore the "character" of the data; some of the feature comparisons may suggest potential correlations or relationships.



In [24]:

    
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
from matplotlib.ticker import MultipleLocator, FormatStrFormatter

scatter_matrix(df2,alpha=0.5, figsize=(15,15),diagonal='kde')

plt.show()

The first goal is to group a set of drivers into the same group (called a cluster), where drivers within the group are more similar (in some sense or another) to each other than to those in other groups (clusters). The KMeans was used because the algorithm scales well to large data sets and is one of the easier algorithms to implement and return classes. The initial clustering follows:

A test on our clusters



In [44]:

    
from __future__ import print_function

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn import preprocessing


with open(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv'))) as in_data:
    skid_data = pd.DataFrame.from_csv(in_data, sep=',')

X = np.asfarray(skid_data[['Average Velocity (mph)','Turns','Max Velocity', 'Velocity Stdev','Average Acceleration (mph per s)', 'Max Acceleration (mph per s)', ' Acceleration Stdev','Displacement','Total Distance Traveled','Max Direction Change per sec', ' Direction Stdev','Time (s)', 'Turns', 'Aggressive Turns', 'Stops', 'Large Deceleration Events', 'Deceleration Events', 'Max Deceleration Event']])


# Preprocessing tricks

imputer = Imputer(missing_values="NaN", strategy="mean")
X = imputer.fit_transform(X)


range_n_clusters = [2, 3, 4, 5, 6, 7, 8]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for Telematic Data
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Clustered Drivers")

    # The vertical line for average silhoutte score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors)

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1],
                marker='o', c="white", alpha=1, s=200)

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    #plt.rcParams['axes.facecolor'] = '#FFCCCC'
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

    plt.show()









    



For n_clusters = 2 The average silhouette_score is : 0.680892720044






    












    



For n_clusters = 3 The average silhouette_score is : 0.585107019089






    












    



For n_clusters = 4 The average silhouette_score is : 0.579256666467






    












    



For n_clusters = 5 The average silhouette_score is : 0.528042052949






    












    



For n_clusters = 6 The average silhouette_score is : 0.541283270294






    












    



For n_clusters = 7 The average silhouette_score is : 0.460992868521






    












    



For n_clusters = 8 The average silhouette_score is : 0.464295031055



In [84]:

    
from time import time
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
import os
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn.metrics.cluster import v_measure_score


path = path = os.path.abspath(os.getcwd())

# Some colors for later
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)

###
 #load data from a CSV to a dataframe
with open(os.path.normpath(os.path.join(os.path.dirname(path),'lin.csv'))) as in_data:
    skid_data = pd.DataFrame.from_csv(in_data, sep=',')

n_samples, n_features = skid_data.shape
print (skid_data.shape)

#skid_data=skid_data.fillna(value=-999)

#load all numeric data into an array. The offense column from the crime data
#is excluded
as_array = np.asfarray(skid_data[['Average Velocity (mph)','Max Velocity', 'Velocity Stdev','Average Acceleration (mph per s)', 'Max Acceleration (mph per s)', ' Acceleration Stdev','Displacement','Total Distance Traveled','Max Direction Change per sec', ' Direction Stdev','Time (s)', 'Turns', 'Aggressive Turns', 'Stops', 'Large Deceleration Events', 'Deceleration Events', 'Max Deceleration Event']])

#number of groups
n_clusters=3


#Correct missing data 
imputer = Imputer(missing_values="NaN", strategy="mean")
patched = imputer.fit_transform(as_array)

# Preprocessing tricks
patched = StandardScaler().fit_transform(patched)
#patched = scale(patched, axis=0, with_mean=True)

#patched = preprocessing.normalize(patched, norm='l2')

#min_max_scaler = preprocessing.MinMaxScaler()
#patched = min_max_scaler.fit_transform(patched)




#cluster data 
cluster = KMeans(n_clusters=n_clusters)
cluster.fit(patched)

reduced_data = PCA(n_components=2).fit_transform(patched)


kmeans = KMeans(init='k-means++', n_clusters=n_clusters)
fit = kmeans.fit(reduced_data)
predict = kmeans.predict(reduced_data)

# Make Predictions
cluster.predict(patched)

# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = kmeans.labels_


#copy dataframe (may be memory intensive but just for illustration)

skid_data = skid_data.copy()
skid_data['Cluster Class'] = pd.Series(classified_data, index=skid_data.index)

print (kmeans.labels_)
skid_data.plot( x = 'Average Velocity (mph)', y = 'Cluster Class', kind = 'scatter')
plt.show()


# Scoring to evaluate cluster performance

# Silhouette Coefficient
print ("We want scores close to 1 \n")

SilouetteCoefficient = metrics.silhouette_score(patched, classified_data, metric='euclidean')
'''

AdjustRandIndex = metrics.adjusted_rand_score(classified_data, prediction_data)
MutualInfoScore = metrics.adjusted_mutual_info_score(classified_data,prediction_data)
HomogenietyScore = metrics.homogeneity_score(classified_data, prediction_data) 
CompletenessScore = metrics.completeness_score(classified_data, prediction_data)
V_measure = metrics.v_measure_score(classified_data, prediction_data) 
'''


'\nThe Adjusted Rand index is %r\nThe Mutual Information based score is %r\nThe Homogeneity score is %r\nThe completeness score is %r\nThe V-measure score is %r" % (SilouetteCoefficient,AdjustRandIndex,MutualInfoScore,HomogenietyScore,CompletenessScore,V_measure)'
print ("The Silouette Coefficient score is...", SilouetteCoefficient)

#############
#scikit-learn visualization example

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() + 1, reduced_data[:, 0].max() - 1
y_min, y_max = reduced_data[:, 1].min() + 1, reduced_data[:, 1].max() - 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the snippet of Team Skidmarks dataset \n(PCA-reduced data)'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()









    



(4600, 18)
[0 0 2 ..., 1 2 2]






    












    



We want scores close to 1 

The Silouette Coefficient score is... 0.176452907209

As demonstrated in the figure above, the clusters are nearly indistiguishable when we reduce the dimensionality of the features.

Challenges

“Hyperspace jumps” where the GPS signal was likely lost for a short period of time, causing the data to appear as if the vehicle traveled 1,000 mph for a period of time.We later discovered that the vehicle’s heading, i.e. the direction it is traveling, can also change dramatically because of GPS errors. To properly adjust for these GPS signal loss issues, smoothing techniques are needed.To discourage trip matching, AXA inserted fake trips. Need a method to programmatically identify and eliminate these trips.

Discussion

Data wrangling and data munging consume 90% of the time
Sci-kit learn documentation makes modeling and analysis easier
since our features are all calculated, our data science pipeline is a little different. After each feature calculation, we need to revisit data wrangling.

	driver_id	trip_id	Average Velocity (mph)	Max Velocity	Velocity Stdev	Average Acceleration (mph per s)	Max Acceleration (mph per s)	Acceleration Stdev	Displacement	Total Distance Traveled	Max Direction Change per sec	Direction Stdev	Time (s)	Turns	Aggressive Turns	Stops	Large Deceleration Events	Deceleration Events	Max Deceleration Event
count	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4570.000000	3679.000000	4569.000000	4600.000000	4600.000000	4600.000000
mean	17.260870	100.500000	17.143772	43.930266	12.679038	1.193506	14.586791	2.052010	3294.765953	5237.066675	65.329313	103.359228	636.169783	14.979869	3.324817	5.891223	1.112174	7.244130	-11.295638
std	8.050814	57.740582	9.736226	18.339800	4.668269	0.458221	14.345964	0.891752	3921.873153	5195.860836	34.442457	22.782350	397.501378	10.725490	2.910428	5.309694	1.611826	5.934242	7.182614
min	1.000000	1.000000	0.026113	0.448000	0.087896	0.052225	0.448000	0.125665	0.223607	2.809902	0.000000	14.581429	150.000000	1.000000	1.000000	1.000000	0.000000	0.000000	-39.982664
25%	12.000000	50.750000	10.241975	33.437581	10.065898	0.954951	7.254335	1.572558	754.676534	1710.073160	45.000000	87.375657	322.000000	8.000000	1.000000	2.000000	0.000000	3.000000	-13.774150
50%	18.000000	100.500000	15.437727	40.985879	12.274847	1.154427	10.345544	1.924528	1704.342617	3478.840137	49.516026	105.238781	518.000000	12.000000	2.000000	4.000000	1.000000	6.000000	-8.944974
75%	24.000000	150.250000	23.640534	52.106487	15.325039	1.405204	16.314069	2.363690	4477.182222	6951.420384	73.293805	119.716872	854.000000	19.000000	4.000000	8.000000	1.000000	10.000000	-6.320108
max	30.000000	200.000000	63.739891	147.699424	31.679671	14.662127	140.534184	19.665523	28904.697758	40731.752107	180.000000	158.809584	1785.000000	73.000000	26.000000	54.000000	14.000000	43.000000	-0.316784