Silvestro di Luise [silvestro.diluise@gmail.com]

This notebook contains some non static elements (indexes, maps), it is suggested to visualize via the Jupyter nbviewer for proper rendering: paste github link of this notebook here: nbviewer.jupyter.org

Introduction

First part of the data processing/preparation is performed on a Spark Data Frame. This is in principle not necessary on a small dataset like the one under study. Neverthless the Spark framework has been used since that would be the way to proceed in the case of a very large dataset distributed on (HDFS) cluster.
The tool used is PySpark.
The second part of data processing/preparation: characterization and modelling is performed on a Pandas Data Frame. After data cleaning it is advisable to perform data exploration and modelling on a small part of the data stored to on locally resident object. Tools used: Pandas, numpym scikit learn, ... The full pipeline can then be run on the distributed dataset using Spark MLLib package to implent the ML algorithm.

Quickjump Index

General Settings
Data Ingestion
Data Preparation: Inspection, Cleaning, Fixing, Validation
Data Analytics
Conclusions

General Settings



In [154]:

    
# write all outputs not only last one in the cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'



In [2]:

    
# enable retina display quality for figures
%config InlineBackend.figure_format = 'retina'



In [3]:

    
# do not show warnings (turn on for final presentation)
import warnings
warnings.filterwarnings('ignore')



In [4]:

    
# platform setting
import os, sys, platform
platform.python_version()









    Out[4]:





'2.7.12'



In [5]:

    
# some utility 
import numpy as np



In [6]:

    
# Pandas module import and cofiguration

import pandas as pd

# avoid truncation when displaying long field values
pd.set_option('display.max_colwidth', -1)

# avoid column truncation when displaying high dimension data frames 
pd.set_option('display.max_columns', None)

#
pd.options.display.max_rows = 999

# avoid long decimal printouts
pd.options.display.float_format = '{:,.5f}'.format

Spark environment setup



In [7]:

    
from pyspark import SparkContext, SQLContext, SparkConf

conf = SparkConf().setMaster("local").set("spark.driver.memory", "1g").set("spark.executor.memory", "1g") 
        
sc = SparkContext(conf = conf)

Start a Spark session



In [8]:

    
#spark object entry point Spark 2.0, handle to a spark session
from pyspark.sql import SparkSession
spark = SparkSession(sc)
spark.version









    Out[8]:





u'2.2.0'

Set number of partitions. That is not necessary on small dataset and while running Spark locally
But it is in general advisable to optimize this setting



In [9]:

    
spark.conf.get("spark.sql.shuffle.partitions")
spark.conf.set("spark.sql.shuffle.partitions",2)
spark.conf.get("spark.sql.shuffle.partitions")









    Out[9]:





u'200'






    Out[9]:





u'2'

Load utiity functions and data types



In [10]:

    
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, IntegerType, LongType, FloatType, DoubleType, TimestampType

Data Ingestion

to index

Data is a set of partitions saved in parquet format
Since the data set if a few MBs, distributed over about 40 partitions, it is possible to list single files to test file state individually



In [11]:

    
# set path to input data
data_path = 'Data/zubie_trips_anonymous/'



In [12]:

    
# list local data directory 
!ls -ltrh Data/zubie_trips_anonymous | grep .parquet | awk '{print $5, $9}'









    



5.8K part-00037-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00036-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00035-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00034-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00033-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00032-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00031-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00030-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00029-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00028-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00027-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00026-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00025-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00024-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00023-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00022-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00021-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00020-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00019-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00018-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00017-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00016-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00015-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00014-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
5.8K part-00013-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
26K part-00012-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
57K part-00011-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
111K part-00010-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
129K part-00009-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
154K part-00008-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
144K part-00007-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
298K part-00006-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
271K part-00005-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
236K part-00004-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
250K part-00003-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
313K part-00002-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
473K part-00001-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet
533K part-00000-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet



In [13]:

    
# loop over files using HDFS filesystem interface as if data would be on cluster
# read data files and print row counts 

hadoop = spark._jvm.org.apache.hadoop
fs   = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration() 
path = hadoop.fs.Path(data_path)

for f in fs.get(conf).listStatus(path):
    name =  str(f.getPath())
    if name.endswith('.parquet'):
        counts = spark.read.parquet(name).count()  
        print ("{}, rows: {} ".format( name.split('/')[-1], counts) )









    



part-00000-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 8530 
part-00001-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 4922 
part-00002-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 3115 
part-00003-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 2202 
part-00004-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 1701 
part-00005-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 2508 
part-00006-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 1803 
part-00007-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 1357 
part-00008-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 1102 
part-00009-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 856 
part-00010-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 534 
part-00011-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 281 
part-00012-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 41 
part-00013-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00014-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00015-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00016-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00017-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00018-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00019-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00020-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00021-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00022-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00023-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00024-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00025-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00026-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00027-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00028-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00029-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00030-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00031-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00032-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00033-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00034-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00035-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00036-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0 
part-00037-3927ef29-1d8d-4f92-bb39-ac1f52d70583.snappy.parquet, rows: 0

Load whole data set into a Spark DataFrame
Get total number of rows



In [14]:

    
df = spark.read.parquet(data_path)

numOfFieldsRaw = len(df.columns)

numOfRows = df.count()
numOfRowsRaw = numOfRows

print ( "total num of fields: {}".format( numOfFieldsRaw ) )
print ( "total num of rows: {}".format( numOfRowsRaw ) )









    



total num of fields: 56
total num of rows: 28952

Number of Partitions



In [15]:

    
print ( "Number of partitions: {}".format(df.rdd.getNumPartitions()) )









    



Number of partitions: 2

Print Data Frame Schema: field Name and Type



In [16]:

    
print ( " Data Frame Schema: " )
df.printSchema()









    



 Data Frame Schema: 
root
 |-- device_key: string (nullable = true)
 |-- distance_um: string (nullable = true)
 |-- duration_seconds: double (nullable = true)
 |-- end_point_address_city: string (nullable = true)
 |-- end_point_address_country: string (nullable = true)
 |-- end_point_address_state: string (nullable = true)
 |-- end_point_address_zipcode: string (nullable = true)
 |-- end_point_daylight_saving_time_flag: boolean (nullable = true)
 |-- end_point_latitude: double (nullable = true)
 |-- end_point_longitude: double (nullable = true)
 |-- end_point_place: string (nullable = true)
 |-- end_point_timestamp: string (nullable = true)
 |-- end_point_timestamp_tz: string (nullable = true)
 |-- end_point_timestamp_utc: string (nullable = true)
 |-- fuel_consumed: string (nullable = true)
 |-- fuel_consumed_um: string (nullable = true)
 |-- fuel_cost: string (nullable = true)
 |-- fuel_cost_currency_code: string (nullable = true)
 |-- fuel_cost_currency_symbol: string (nullable = true)
 |-- fuel_ppg: string (nullable = true)
 |-- fuel_type: string (nullable = true)
 |-- gps_distance: double (nullable = true)
 |-- gps_miles: double (nullable = true)
 |-- hard_accel_count: long (nullable = true)
 |-- hard_brake_count: long (nullable = true)
 |-- idle_seconds: long (nullable = true)
 |-- key: string (nullable = true)
 |-- mpg_combined: string (nullable = true)
 |-- obd_distance: double (nullable = true)
 |-- obd_miles: double (nullable = true)
 |-- points_city_count: long (nullable = true)
 |-- points_hwy_count: long (nullable = true)
 |-- speed_um: string (nullable = true)
 |-- speeding_city_major_count: long (nullable = true)
 |-- speeding_city_minor_count: long (nullable = true)
 |-- speeding_hwy_major_count: long (nullable = true)
 |-- speeding_hwy_minor_count: long (nullable = true)
 |-- start_point_address_city: string (nullable = true)
 |-- start_point_address_country: string (nullable = true)
 |-- start_point_address_state: string (nullable = true)
 |-- start_point_address_zipcode: string (nullable = true)
 |-- start_point_daylight_saving_time_flag: boolean (nullable = true)
 |-- start_point_latitude: double (nullable = true)
 |-- start_point_longitude: double (nullable = true)
 |-- start_point_place: string (nullable = true)
 |-- start_point_timestamp: string (nullable = true)
 |-- start_point_timestamp_tz: string (nullable = true)
 |-- start_point_timestamp_utc: string (nullable = true)
 |-- static_map_url: string (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- top_speed: double (nullable = true)
 |-- top_speed_mph: double (nullable = true)
 |-- trip_segments: long (nullable = true)
 |-- user: string (nullable = true)
 |-- vehicle_key: string (nullable = true)
 |-- vehicle_nickname: string (nullable = true)

Data Preparation: Inspection, Cleaning, Fixing, Validation

to index

Analysis performed on Spark Data Frame

First lookat the Data

To visualize table data, a few rows of Spark DF are retrieved and tranformed to a Pandas DF for better formatting



In [17]:

    
df.limit(10).toPandas()









    Out[17]:






  
    
      
      device_key
      distance_um
      duration_seconds
      end_point_address_city
      end_point_address_country
      end_point_address_state
      end_point_address_zipcode
      end_point_daylight_saving_time_flag
      end_point_latitude
      end_point_longitude
      end_point_place
      end_point_timestamp
      end_point_timestamp_tz
      end_point_timestamp_utc
      fuel_consumed
      fuel_consumed_um
      fuel_cost
      fuel_cost_currency_code
      fuel_cost_currency_symbol
      fuel_ppg
      fuel_type
      gps_distance
      gps_miles
      hard_accel_count
      hard_brake_count
      idle_seconds
      key
      mpg_combined
      obd_distance
      obd_miles
      points_city_count
      points_hwy_count
      speed_um
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
      start_point_address_city
      start_point_address_country
      start_point_address_state
      start_point_address_zipcode
      start_point_daylight_saving_time_flag
      start_point_latitude
      start_point_longitude
      start_point_place
      start_point_timestamp
      start_point_timestamp_tz
      start_point_timestamp_utc
      static_map_url
      tags
      top_speed
      top_speed_mph
      trip_segments
      user
      vehicle_key
      vehicle_nickname
    
  
  
    
      0
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      425.00000
      Cleckheaton
      GB
      England
      BD19 3TY
      False
      53.73208
      -1.71761
      None
      2016-11-24 12:33:01
      GMT
      2016-11-24 12:33:01
      None
      gal
      None
      EUR
      €
      2.39
      regular
      0.74782
      0.74782
      0
      0
      238
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUmVUaEpUNlFaRmJhQ1IyREtTTmFSNww
      None
      0.78106
      0.78106
      12
      0
      mph
      0
      0
      0
      0
      Cleckheaton
      GB
      England
      BD19 3BE
      False
      53.72496
      -1.71555
      None
      2016-11-24 12:25:56
      GMT
      2016-11-24 12:25:56
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUmVUaEpUNlFaRmJhQ1IyREtTTmFSNww/staticmap?t=QTd3VNpBsUxZ7dW2GPUkOycYN48
      []
      32.30000
      32.30000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      1
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      235.00000
      Cleckheaton
      GB
      England
      BD19 3BJ
      False
      53.72505
      -1.71570
      None
      2016-11-24 12:12:05
      GMT
      2016-11-24 12:12:05
      None
      gal
      None
      EUR
      €
      2.39
      regular
      0.87467
      0.87467
      0
      0
      25
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWazNmMmdEb05oRHJWY3NQSEh5aGdlVgw
      None
      0.88483
      0.88483
      12
      0
      mph
      0
      0
      0
      0
      Cleckheaton
      GB
      England
      BD19 3TY
      False
      53.73204
      -1.71759
      None
      2016-11-24 12:08:10
      GMT
      2016-11-24 12:08:10
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWazNmMmdEb05oRHJWY3NQSEh5aGdlVgw/staticmap?t=PIamI1nMri8X7alsbJIJYBfyExs
      []
      27.30000
      27.30000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      2,439.00000
      Cleckheaton
      GB
      England
      BD19 3TY
      False
      53.73204
      -1.71758
      None
      2016-11-24 08:28:24
      GMT
      2016-11-24 08:28:24
      None
      gal
      None
      EUR
      €
      2.39
      regular
      11.28354
      11.28354
      3
      0
      541
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWNFB6V1N5YVhKZFhuOGdlcDNtVVoyVww
      None
      10.95850
      10.95850
      83
      0
      mph
      2
      1
      0
      0
      Luddenden Foot
      GB
      England
      HX2
      False
      53.72362
      -1.94588
      None
      2016-11-24 07:47:45
      GMT
      2016-11-24 07:47:45
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWNFB6V1N5YVhKZFhuOGdlcDNtVVoyVww/staticmap?t=FEn4Y1GG7dmU6Mad56rY5IU9Nho
      []
      52.80000
      52.80000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      3
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      738.00000
      Luddenden Foot
      GB
      England
      HX2
      False
      53.72361
      -1.94621
      None
      2016-11-23 17:55:30
      GMT
      2016-11-23 17:55:30
      None
      gal
      None
      EUR
      €
      2.39
      regular
      3.56389
      3.56389
      0
      0
      117
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWOEViRzVwb0FDc21vSzdabkpQTTZraAw
      None
      3.46539
      3.46539
      26
      0
      mph
      0
      0
      0
      0
      Halifax
      GB
      England
      HX1 3BA
      False
      53.71449
      -1.87807
      None
      2016-11-23 17:43:12
      GMT
      2016-11-23 17:43:12
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWOEViRzVwb0FDc21vSzdabkpQTTZraAw/staticmap?t=MWYD3chkvl1AqUxUwsrviEnxn0U
      []
      37.90000
      37.90000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      4
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      1,808.00000
      Halifax
      GB
      England
      HX1 3BA
      False
      53.71443
      -1.87809
      None
      2016-11-23 17:32:16
      GMT
      2016-11-23 17:32:16
      None
      gal
      None
      EUR
      €
      2.39
      regular
      8.28467
      8.28467
      0
      0
      478
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZlNyRFpVR1ZBbXBmVXNlR3pqcVl3Sgw
      None
      8.05608
      8.05608
      58
      6
      mph
      0
      1
      0
      0
      Cleckheaton
      GB
      England
      BD19 3TY
      False
      53.73196
      -1.71741
      None
      2016-11-23 17:02:08
      GMT
      2016-11-23 17:02:08
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZlNyRFpVR1ZBbXBmVXNlR3pqcVl3Sgw/staticmap?t=aesK_aPLG_aBH_B-3RXHFUFiYnE
      []
      46.00000
      46.00000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      5
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      241.00000
      Cleckheaton
      GB
      England
      BD19 3TY
      False
      53.73208
      -1.71757
      None
      2016-11-23 13:50:00
      GMT
      2016-11-23 13:50:00
      None
      gal
      None
      EUR
      €
      2.39
      regular
      0.91447
      0.91447
      0
      0
      62
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWRE14bjNIYUtIUlVrRXlTVHRjeW9mZww
      None
      0.91714
      0.91714
      11
      0
      mph
      0
      0
      0
      0
      Cleckheaton
      GB
      England
      BD19 3BE
      False
      53.72501
      -1.71539
      None
      2016-11-23 13:45:59
      GMT
      2016-11-23 13:45:59
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWRE14bjNIYUtIUlVrRXlTVHRjeW9mZww/staticmap?t=y0KEAxjFAOE5zvZr7dlueq_gCFo
      []
      36.00000
      36.00000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      6
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      252.00000
      Cleckheaton
      GB
      England
      BD19 3HU
      False
      53.72507
      -1.71529
      None
      2016-11-23 13:35:05
      GMT
      2016-11-23 13:35:05
      None
      gal
      None
      EUR
      €
      2.39
      regular
      0.91864
      0.91864
      0
      0
      85
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWNGd1UmpBYmtEcGVVeFU1Q1pFUll4VQw
      None
      0.90720
      0.90720
      10
      0
      mph
      0
      0
      0
      0
      Cleckheaton
      GB
      England
      BD19
      False
      53.73222
      -1.71735
      None
      2016-11-23 13:30:53
      GMT
      2016-11-23 13:30:53
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWNGd1UmpBYmtEcGVVeFU1Q1pFUll4VQw/staticmap?t=d7G_lFT1s4HDZDck3HKofRbsU3U
      []
      35.40000
      35.40000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      7
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      2,366.00000
      Cleckheaton
      GB
      England
      BD19 3TY
      False
      53.73207
      -1.71757
      None
      2016-11-23 08:27:53
      GMT
      2016-11-23 08:27:53
      None
      gal
      None
      EUR
      €
      2.39
      regular
      11.33165
      11.33165
      0
      1
      381
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWR3BTemRLdXc3bjJKR1Z1ZHhYQ3pRQww
      None
      10.95291
      10.95291
      80
      0
      mph
      1
      1
      0
      0
      Luddenden Foot
      GB
      England
      HX2
      False
      53.72361
      -1.94609
      None
      2016-11-23 07:48:27
      GMT
      2016-11-23 07:48:27
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWR3BTemRLdXc3bjJKR1Z1ZHhYQ3pRQww/staticmap?t=26FKe4-nnlFmtgSLe7BwwqbByWY
      []
      50.30000
      50.30000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      8
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      526.00000
      Luddenden Foot
      GB
      England
      HX2 6AH
      False
      53.72363
      -1.94622
      None
      2016-11-22 21:39:30
      GMT
      2016-11-22 21:39:30
      None
      gal
      None
      EUR
      €
      2.39
      regular
      3.57647
      3.57647
      0
      0
      61
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWdGNtVmh4ejhSbmg5NGNScXY3WXR4eQw
      None
      3.43805
      3.43805
      22
      0
      mph
      2
      1
      0
      0
      Halifax
      GB
      England
      HX1 3BA
      False
      53.71457
      -1.87802
      None
      2016-11-22 21:30:44
      GMT
      2016-11-22 21:30:44
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWdGNtVmh4ejhSbmg5NGNScXY3WXR4eQw/staticmap?t=Dk0uV27dMBIVyQDR05r_bayFUFg
      []
      44.10000
      44.10000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      9
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      532.00000
      Halifax
      GB
      England
      HX1 3BA
      False
      53.71457
      -1.87801
      None
      2016-11-22 21:25:20
      GMT
      2016-11-22 21:25:20
      None
      gal
      None
      EUR
      €
      2.39
      regular
      1.53763
      1.53763
      0
      0
      48
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWaVM3QkFEV1FYNGJINDI4d05lV1ptaQw
      None
      1.55964
      1.55964
      20
      0
      mph
      1
      0
      0
      0
      Halifax
      GB
      England
      HX1 4TA
      False
      53.72466
      -1.88786
      None
      2016-11-22 21:16:28
      GMT
      2016-11-22 21:16:28
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWaVM3QkFEV1FYNGJINDI4d05lV1ptaQw/staticmap?t=4GOnC6A4FNpOm4pd_UHYX3FMQDU
      []
      32.30000
      32.30000
      1
      None
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN

Remove all-NA Fields

From the display looks like several fields have all 'None' values
Non-NA occurences are computed individually for selected field and confirmed that 0 rows are left out
Columns are removed
It is important to remove those fields otherwise all the rows will be removed
when in the following you want to drop rows with at least one NA field.



In [18]:

    
# Selected columns to inspect and eventually to drop
cols = ['user','start_point_place','end_point_place','mpg_combined','fuel_consumed','fuel_cost']

# count number of NA 
for col in cols:
  print ( "field: \'{}\', Number of non NULL entries: {} "
         .format( col, df.select(col).dropna().count() ) 
        )

# 'tags' field contains an array, let's expand it before counting for missing values
print ( "field: \'tags\',  Number of non NULL entries: {}"
       .format( df.select(F.explode('tags')).dropna().count() ) 
      )

# append to the list of columns 
cols.append('tags')

# apply drop function iteratively on selected columns
from functools import reduce
from pyspark.sql import DataFrame

df = reduce(DataFrame.drop, cols, df)

print ("\n Columns have been dropped ")









    



field: 'user', Number of non NULL entries: 0 
field: 'start_point_place', Number of non NULL entries: 0 
field: 'end_point_place', Number of non NULL entries: 0 
field: 'mpg_combined', Number of non NULL entries: 0 
field: 'fuel_consumed', Number of non NULL entries: 0 
field: 'fuel_cost', Number of non NULL entries: 0 
field: 'tags',  Number of non NULL entries: 0

 Columns have been dropped

Remove Duplicated Rows

More than 80% of the rows in the dataset are duplicates
This reduce the effective DF entries from ~30k to ~5k
A quick check is done to show that this is indeed the case



In [19]:

    
# transfom Spark DF into Pandas DF. 
# (for large distributed dataset it would be necessary to sample only a franction
# of it before)
# Keep only duplicated rows
# Select a single vehicle_nickname, order by date
# Show table
pdf = df.toPandas()

pdf[ pdf.duplicated() ]\
   [ pdf['vehicle_nickname']==pdf['vehicle_nickname'].values[0] ]\
   .set_index('start_point_timestamp')\
   .sort_index()\
   .head(10)









    Out[19]:






  
    
      
      device_key
      distance_um
      duration_seconds
      end_point_address_city
      end_point_address_country
      end_point_address_state
      end_point_address_zipcode
      end_point_daylight_saving_time_flag
      end_point_latitude
      end_point_longitude
      end_point_timestamp
      end_point_timestamp_tz
      end_point_timestamp_utc
      fuel_consumed_um
      fuel_cost_currency_code
      fuel_cost_currency_symbol
      fuel_ppg
      fuel_type
      gps_distance
      gps_miles
      hard_accel_count
      hard_brake_count
      idle_seconds
      key
      obd_distance
      obd_miles
      points_city_count
      points_hwy_count
      speed_um
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
      start_point_address_city
      start_point_address_country
      start_point_address_state
      start_point_address_zipcode
      start_point_daylight_saving_time_flag
      start_point_latitude
      start_point_longitude
      start_point_timestamp_tz
      start_point_timestamp_utc
      static_map_url
      top_speed
      top_speed_mph
      trip_segments
      vehicle_key
      vehicle_nickname
    
    
      start_point_timestamp
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2016-09-15 17:42:06
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      774.00000
      Halifax
      GB
      England
      HX1
      True
      53.72095
      -1.85880
      2016-09-15 17:55:00
      BST
      2016-09-15 16:55:00
      gal
      USD
      $
      3.54
      regular
      2.60910
      2.60910
      0
      0
      101.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw
      3.72326
      3.72326
      31.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      None
      None
      None
      None
      False
      0.00000
      0.00000
      None
      2016-09-15 16:42:06
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw/staticmap?t=0g1-hUz0lbYGuN5je_w-C6Ksmwo
      41.60000
      41.60000
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-15 17:42:06
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      774.00000
      Halifax
      GB
      England
      HX1
      True
      53.72095
      -1.85880
      2016-09-15 17:55:00
      BST
      2016-09-15 16:55:00
      gal
      USD
      $
      3.54
      regular
      2.60910
      2.60910
      0
      0
      101.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw
      3.72326
      3.72326
      31.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      None
      None
      None
      None
      False
      0.00000
      0.00000
      None
      2016-09-15 16:42:06
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw/staticmap?t=0g1-hUz0lbYGuN5je_w-C6Ksmwo
      41.60000
      41.60000
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-15 17:42:06
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      774.00000
      Halifax
      GB
      England
      HX1
      True
      53.72095
      -1.85880
      2016-09-15 17:55:00
      BST
      2016-09-15 16:55:00
      gal
      USD
      $
      3.54
      regular
      2.60910
      2.60910
      0
      0
      101.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw
      3.72326
      3.72326
      31.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      None
      None
      None
      None
      False
      0.00000
      0.00000
      None
      2016-09-15 16:42:06
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw/staticmap?t=0g1-hUz0lbYGuN5je_w-C6Ksmwo
      41.60000
      41.60000
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-15 17:42:06
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      774.00000
      Halifax
      GB
      England
      HX1
      True
      53.72095
      -1.85880
      2016-09-15 17:55:00
      BST
      2016-09-15 16:55:00
      gal
      USD
      $
      3.54
      regular
      2.60910
      2.60910
      0
      0
      101.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw
      3.72326
      3.72326
      31.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      None
      None
      None
      None
      False
      0.00000
      0.00000
      None
      2016-09-15 16:42:06
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw/staticmap?t=0g1-hUz0lbYGuN5je_w-C6Ksmwo
      41.60000
      41.60000
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-15 18:53:20
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      919.00000
      Luddenden Foot
      GB
      England
      HX2
      True
      53.72364
      -1.94611
      2016-09-15 19:08:39
      BST
      2016-09-15 18:08:39
      gal
      EUR
      €
      2.39
      regular
      4.54532
      4.54532
      0
      0
      73.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww
      4.44156
      4.44156
      37.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      Halifax
      GB
      England
      HX1
      True
      53.72102
      -1.85883
      BST
      2016-09-15 17:53:20
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww/staticmap?t=LJwmxTea_AVqafy52J-KTAlcJcg
      41.00000
      41.00000
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-15 18:53:20
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      919.00000
      Luddenden Foot
      GB
      England
      HX2
      True
      53.72364
      -1.94611
      2016-09-15 19:08:39
      BST
      2016-09-15 18:08:39
      gal
      EUR
      €
      2.39
      regular
      4.54532
      4.54532
      0
      0
      73.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww
      4.44156
      4.44156
      37.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      Halifax
      GB
      England
      HX1
      True
      53.72102
      -1.85883
      BST
      2016-09-15 17:53:20
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww/staticmap?t=LJwmxTea_AVqafy52J-KTAlcJcg
      41.00000
      41.00000
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-15 18:53:20
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      919.00000
      Luddenden Foot
      GB
      England
      HX2
      True
      53.72364
      -1.94611
      2016-09-15 19:08:39
      BST
      2016-09-15 18:08:39
      gal
      EUR
      €
      2.39
      regular
      4.54532
      4.54532
      0
      0
      73.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww
      4.44156
      4.44156
      37.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      Halifax
      GB
      England
      HX1
      True
      53.72102
      -1.85883
      BST
      2016-09-15 17:53:20
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww/staticmap?t=LJwmxTea_AVqafy52J-KTAlcJcg
      41.00000
      41.00000
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-15 18:53:20
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      919.00000
      Luddenden Foot
      GB
      England
      HX2
      True
      53.72364
      -1.94611
      2016-09-15 19:08:39
      BST
      2016-09-15 18:08:39
      gal
      EUR
      €
      2.39
      regular
      4.54532
      4.54532
      0
      0
      73.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww
      4.44156
      4.44156
      37.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      Halifax
      GB
      England
      HX1
      True
      53.72102
      -1.85883
      BST
      2016-09-15 17:53:20
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww/staticmap?t=LJwmxTea_AVqafy52J-KTAlcJcg
      41.00000
      41.00000
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-16 10:53:46
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      144.00000
      Luddenden Foot
      GB
      England
      HX2 6NS
      True
      53.72584
      -1.94266
      2016-09-16 10:56:10
      BST
      2016-09-16 09:56:10
      gal
      EUR
      €
      2.39
      regular
      0.42214
      0.42214
      0
      0
      36.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWWWtINUhvU2poZHR1WFdWREdCTHhMYww
      0.46603
      0.46603
      7.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      Luddenden Foot
      GB
      England
      HX2
      True
      53.72353
      -1.94616
      BST
      2016-09-16 09:53:46
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWWWtINUhvU2poZHR1WFdWREdCTHhMYww/staticmap?t=9i1qsNObX05Bd4LQp3jwR-DpTow
      30.44719
      30.44719
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2016-09-16 10:53:46
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      144.00000
      Luddenden Foot
      GB
      England
      HX2 6NS
      True
      53.72584
      -1.94266
      2016-09-16 10:56:10
      BST
      2016-09-16 09:56:10
      gal
      EUR
      €
      2.39
      regular
      0.42214
      0.42214
      0
      0
      36.00000
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWWWtINUhvU2poZHR1WFdWREdCTHhMYww
      0.46603
      0.46603
      7.00000
      0.00000
      mph
      0.00000
      0.00000
      0.00000
      0.00000
      Luddenden Foot
      GB
      England
      HX2
      True
      53.72353
      -1.94616
      BST
      2016-09-16 09:53:46
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWWWtINUhvU2poZHR1WFdWREdCTHhMYww/staticmap?t=9i1qsNObX05Bd4LQp3jwR-DpTow
      30.44719
      30.44719
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN

Get a new DF with unique rows



In [20]:

    
df = df.dropDuplicates()

print( "Number of unique rows/total  {}/{}".format( df.count(), numOfRows ) )

numOfRows = df.count()









    



Number of unique rows/total  5295/28952

Drop rows with NA fields

Dropping all the rows with at lest one NA field preserve about 75% of the data
This is an acceptable loss for now, individual fields can be made to discard from the selection fields we are not interested in.
To prepare for this counts are coumputed for each field beforehand



In [21]:

    
# get single field NA counts
for c in df.columns:
  n = df.where( F.col(c).isNull() ).count()
  print ("\'{}\' -- null counts/total: {}/{}".format(c, n, numOfRows))









    



'device_key' -- null counts/total: 0/5295
'distance_um' -- null counts/total: 0/5295
'duration_seconds' -- null counts/total: 57/5295
'end_point_address_city' -- null counts/total: 483/5295
'end_point_address_country' -- null counts/total: 108/5295
'end_point_address_state' -- null counts/total: 145/5295
'end_point_address_zipcode' -- null counts/total: 131/5295
'end_point_daylight_saving_time_flag' -- null counts/total: 57/5295
'end_point_latitude' -- null counts/total: 57/5295
'end_point_longitude' -- null counts/total: 57/5295
'end_point_timestamp' -- null counts/total: 57/5295
'end_point_timestamp_tz' -- null counts/total: 153/5295
'end_point_timestamp_utc' -- null counts/total: 57/5295
'fuel_consumed_um' -- null counts/total: 0/5295
'fuel_cost_currency_code' -- null counts/total: 0/5295
'fuel_cost_currency_symbol' -- null counts/total: 0/5295
'fuel_ppg' -- null counts/total: 57/5295
'fuel_type' -- null counts/total: 57/5295
'gps_distance' -- null counts/total: 511/5295
'gps_miles' -- null counts/total: 511/5295
'hard_accel_count' -- null counts/total: 0/5295
'hard_brake_count' -- null counts/total: 0/5295
'idle_seconds' -- null counts/total: 85/5295
'key' -- null counts/total: 0/5295
'obd_distance' -- null counts/total: 59/5295
'obd_miles' -- null counts/total: 59/5295
'points_city_count' -- null counts/total: 330/5295
'points_hwy_count' -- null counts/total: 330/5295
'speed_um' -- null counts/total: 0/5295
'speeding_city_major_count' -- null counts/total: 330/5295
'speeding_city_minor_count' -- null counts/total: 330/5295
'speeding_hwy_major_count' -- null counts/total: 330/5295
'speeding_hwy_minor_count' -- null counts/total: 330/5295
'start_point_address_city' -- null counts/total: 422/5295
'start_point_address_country' -- null counts/total: 60/5295
'start_point_address_state' -- null counts/total: 73/5295
'start_point_address_zipcode' -- null counts/total: 82/5295
'start_point_daylight_saving_time_flag' -- null counts/total: 0/5295
'start_point_latitude' -- null counts/total: 0/5295
'start_point_longitude' -- null counts/total: 0/5295
'start_point_timestamp' -- null counts/total: 0/5295
'start_point_timestamp_tz' -- null counts/total: 94/5295
'start_point_timestamp_utc' -- null counts/total: 0/5295
'static_map_url' -- null counts/total: 0/5295
'top_speed' -- null counts/total: 3/5295
'top_speed_mph' -- null counts/total: 3/5295
'trip_segments' -- null counts/total: 0/5295
'vehicle_key' -- null counts/total: 0/5295
'vehicle_nickname' -- null counts/total: 0/5295



In [22]:

    
# drop all rows with one or more NA field 
df = df.dropna(how='any')

print( "Number of non-NA rows/total:  {}/{}".format( df.count(), numOfRows ) )

numOfRows = df.count()









    



Number of non-NA rows/total:  3903/5295

check URL field

This varible may contain useful information however direct inspection of a few links reveals that they are not available
Second check: a random 10% sample is extracted to perfom a URL request.
An error code (4xx) is returned.
The associate column is dropped



In [23]:

    
# extract a few rows to show URL
df.select('static_map_url').limit(5).toPandas() #show(5,truncate=False)

import requests

# sample 10% of the rows and get status code request (retain on 5 for printing)
for url in [ c['static_map_url'] for c in df.sample(False,0.1,seed=0).limit(5).rdd.collect() ]:
  print (url)
  code = requests.get(url).status_code 
  print (' satus code {}\n'.format( code ))

# drop column
df = df.drop('static_map_url')









    Out[23]:






  
    
      
      static_map_url
    
  
  
    
      0
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUmVUaEpUNlFaRmJhQ1IyREtTTmFSNww/staticmap?t=QTd3VNpBsUxZ7dW2GPUkOycYN48
    
    
      1
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWOEViRzVwb0FDc21vSzdabkpQTTZraAw/staticmap?t=MWYD3chkvl1AqUxUwsrviEnxn0U
    
    
      2
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWRE14bjNIYUtIUlVrRXlTVHRjeW9mZww/staticmap?t=y0KEAxjFAOE5zvZr7dlueq_gCFo
    
    
      3
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWdGNtVmh4ejhSbmg5NGNScXY3WXR4eQw/staticmap?t=Dk0uV27dMBIVyQDR05r_bayFUFg
    
    
      4
      https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWaVM3QkFEV1FYNGJINDI4d05lV1ptaQw/staticmap?t=4GOnC6A4FNpOm4pd_UHYX3FMQDU
    
  








    



https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFmhHUXpCR1pNNGlMTmd1c1hSaDQ3OVYMCxIEVHJpcCIWUGdrU1c0VTk4elhIa3loNm9nYnJNQQw/staticmap?t=zbcV_wL30OM09PX2kaVtAX-YgJE
 satus code 401

https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFmhHUXpCR1pNNGlMTmd1c1hSaDQ3OVYMCxIEVHJpcCIWeXZaa2k4YlR3WFN1SFhWaEx6NlV0Sgw/staticmap?t=Ecj7eDIdrNi6-JxCivDWz1YpYE8
 satus code 401

https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFmhHUXpCR1pNNGlMTmd1c1hSaDQ3OVYMCxIEVHJpcCIWVFpUb05GQnNOMjJoNVRGa0FBZW82cAw/staticmap?t=ULs57abB8w_vpHmAOx1-UpHz6Zw
 satus code 401

https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFlRROHlDUzlQNzhNNGlENVhwMmgzU0IMCxIEVHJpcCIWUGVyV0tzUXZ6bjNzNDhGV0czYzZCTAw/staticmap?t=eMr5-d6LrEnh8un0eagf7Xz5mGY
 satus code 401

https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFlRROHlDUzlQNzhNNGlENVhwMmgzU0IMCxIEVHJpcCIWZTJIdm90aEgyVld2eVhuUHV0cWVxRQw/staticmap?t=S-Xbx0NfpKXPuWhxUAse_Hdmdm4
 satus code 401

Check Unit of Measurement

Control if the same unit of measurment is used for all the entries of the same field
If different units are used, values would be convert to a common one
In the present sample the same units are used for distance, speed and fuel consumption
Currency can be EUR or USD and the fuel price per gallon is constant
We know already that fuel_consumed field is empty so those information can be dropped



In [24]:

    
# list how many distinct units exists for distance,speed and fuel_consumed
df.select('distance_um').distinct().show()

df.select('speed_um').distinct().show()

df.select('fuel_consumed_um').distinct().show()

# list how many different combination exists for currency, symbol and price per gallon of fuel.
df.select('fuel_cost_currency_code','fuel_cost_currency_symbol','fuel_ppg').distinct().show()

# drop fuel ralted fields since there no observations are available (e.g.: fuel_consumed is empty)
df = df.drop('fuel_cost_currency_code','fuel_cost_currency_symbol','fuel_ppg','fuel_consumed_um')









    



+-----------+
|distance_um|
+-----------+
|         mi|
+-----------+

+--------+
|speed_um|
+--------+
|     mph|
+--------+

+----------------+
|fuel_consumed_um|
+----------------+
|             gal|
+----------------+

+-----------------------+-------------------------+--------+
|fuel_cost_currency_code|fuel_cost_currency_symbol|fuel_ppg|
+-----------------------+-------------------------+--------+
|                    EUR|                        €|    2.39|
|                    USD|                        $|    3.54|
+-----------------------+-------------------------+--------+

Redundant Fields

From data display looks like some field columns are duplicated
To control the all the values are the same in principle it is possible to pair column values row-by-row but that would be not efficient on a large distributed dataset For the purpose it's enough to look at summarization parameters to decide if two columns are identical
Redundant fields are removed



In [25]:

    
# get stat descritption of
df.select('top_speed','top_speed_mph').describe().show()

df.select('obd_distance','obd_miles').describe().show()

df.select('gps_distance','gps_miles').describe().show()

# remove field with names specifying the unit since this is stored in a separate field
# and unit uniformity has been already ensured
df = df.drop('top_speed_mph','obd_miles','gps_miles')









    



+-------+------------------+------------------+
|summary|         top_speed|     top_speed_mph|
+-------+------------------+------------------+
|  count|              3903|              3903|
|   mean| 45.53637113548452| 45.53637113548452|
| stddev|20.040925556040133|20.040925556040133|
|    min|               0.0|               0.0|
|    max|127.38109440865345|127.38109440865345|
+-------+------------------+------------------+

+-------+------------------+------------------+
|summary|      obd_distance|         obd_miles|
+-------+------------------+------------------+
|  count|              3903|              3903|
|   mean| 8.095386916834068| 8.095386916834068|
| stddev| 16.38601543277141| 16.38601543277141|
|    min|               0.0|               0.0|
|    max|232.86755348763222|232.86755348763222|
+-------+------------------+------------------+

+-------+--------------------+--------------------+
|summary|        gps_distance|           gps_miles|
+-------+--------------------+--------------------+
|  count|                3903|                3903|
|   mean|   7.257100007399001|   7.257100007399001|
| stddev|  15.855331245522395|  15.855331245522395|
|    min|8.006704572768198E-5|8.006704572768198E-5|
|    max|  237.73788151299107|  237.73788151299107|
+-------+--------------------+--------------------+

Constant Fields

Check for constant value fields and remove them if necessary



In [26]:

    
#
df.select('trip_segments').distinct().show() 
df.select('fuel_type').distinct().show() 

#
df = df.drop('trip_segments','fuel_type')









    



+-------------+
|trip_segments|
+-------------+
|            1|
+-------------+

+---------+
|fuel_type|
+---------+
|  regular|
+---------+

Inspection of Country and State

Get list of different countries present in the whole dataset

 - Drop constant value fields so as to reduce data to be  
   susequently transformed into a local Data Frame (and improve data display readability)    
   although not critical with the present datasize



In [27]:

    
# list of different countries
df.select('start_point_address_country').distinct().show()
df.select('end_point_address_country').distinct().show()

# drop since they are constant fields
df = df.drop('start_point_address_country','end_point_address_country')









    



+---------------------------+
|start_point_address_country|
+---------------------------+
|                         GB|
+---------------------------+

+-------------------------+
|end_point_address_country|
+-------------------------+
|                       GB|
+-------------------------+

Get list of different states

  - There is a "suspicious" name: 'Cymru'
  - Cymru is the Welsh name of Wales



In [28]:

    
# list of different states 
df.select('start_point_address_state').distinct().show() 
df.select('end_point_address_state').distinct().show()

# list of start-end point state combinations
df.select('start_point_address_state','end_point_address_state').distinct().show()









    



+-------------------------+
|start_point_address_state|
+-------------------------+
|                  England|
|                 Scotland|
|                    Wales|
|                    Cymru|
|         Northern Ireland|
+-------------------------+

+-----------------------+
|end_point_address_state|
+-----------------------+
|                England|
|               Scotland|
|                  Wales|
|       Northern Ireland|
+-----------------------+

+-------------------------+-----------------------+
|start_point_address_state|end_point_address_state|
+-------------------------+-----------------------+
|                  England|                England|
|                 Scotland|               Scotland|
|                  England|                  Wales|
|         Northern Ireland|       Northern Ireland|
|                    Wales|                England|
|                    Wales|                  Wales|
|                    Cymru|                  Wales|
+-------------------------+-----------------------+

- Inspect rows with 'Cymru' (only one actually) and then change the name to Wales  
  to get consistent naming



In [29]:

    
print (" Num of records with with state adress = Cymru: {} "
       .format( df.filter(df['start_point_address_state']=='Cymru').count() )
      )

# look at the data
df.filter(df['start_point_address_state']=='Cymru')\
    .toPandas().head()

# convert name
df = df.replace('Cymru','Wales')









    



 Num of records with with state adress = Cymru: 1 






    Out[29]:






  
    
      
      device_key
      distance_um
      duration_seconds
      end_point_address_city
      end_point_address_state
      end_point_address_zipcode
      end_point_daylight_saving_time_flag
      end_point_latitude
      end_point_longitude
      end_point_timestamp
      end_point_timestamp_tz
      end_point_timestamp_utc
      gps_distance
      hard_accel_count
      hard_brake_count
      idle_seconds
      key
      obd_distance
      points_city_count
      points_hwy_count
      speed_um
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
      start_point_address_city
      start_point_address_state
      start_point_address_zipcode
      start_point_daylight_saving_time_flag
      start_point_latitude
      start_point_longitude
      start_point_timestamp
      start_point_timestamp_tz
      start_point_timestamp_utc
      top_speed
      vehicle_key
      vehicle_nickname
    
  
  
    
      0
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk5MTjJrdFZrQUNpZEJLZ3FHbWFFaGUM
      mi
      1,661.00000
      Bridgend
      Wales
      CF32 9SU
      False
      51.53157
      -3.57945
      2016-11-21 12:45:18
      GMT
      2016-11-21 12:45:18
      21.13067
      0
      0
      51
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk5MTjJrdFZrQUNpZEJLZ3FHbWFFaGUMCxIEVHJpcCIWQ2dxNWZ2WjNLR1hTYWoyU282Y2tDegw
      21.09431
      54
      1
      mph
      16
      5
      0
      0
      Caerphilly
      Cymru
      CF83
      False
      51.56780
      -3.23142
      2016-11-21 12:17:37
      GMT
      2016-11-21 12:17:37
      72.70000
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZKblAyYmE0cTM5OVAzSGNhbTRiUTdCDAsSA0NhciILNW9xNHV5UHJIUHgM
      KS12VRN

Time Zone Tags

Check if some trips start and end in different time zones
If not keep only one of them and convert filed name for a shorter one
Check is 'daylight saving time flag' is assigned consistently. Remove field since it is not needed



In [30]:

    
# get occurencies of start/end point time zone combinations
# only BST-BST and GMT-GMT. Keep only one of the two
df.groupBy('start_point_timestamp_tz').pivot('end_point_timestamp_tz').count().show()

# get occurencies  of timezone/daylight_saving_time_flag
# since pairing is consitent no fix is needed and the field can be removed
df.groupBy('start_point_timestamp_tz').pivot('start_point_daylight_saving_time_flag').count().show()
df.groupBy('end_point_timestamp_tz').pivot('end_point_daylight_saving_time_flag').count().show()


df =  df.drop('start_point_daylight_saving_time_flag','end_point_daylight_saving_time_flag')\
        .drop('end_point_timestamp_tz')

# rename for field name length more similar to a (3-character) field values 
df = df.withColumnRenamed('start_point_timestamp_tz','trip_tz')









    



+------------------------+----+----+
|start_point_timestamp_tz| BST| GMT|
+------------------------+----+----+
|                     BST| 872|null|
|                     GMT|null|3031|
+------------------------+----+----+

+------------------------+-----+----+
|start_point_timestamp_tz|false|true|
+------------------------+-----+----+
|                     BST| null| 872|
|                     GMT| 3031|null|
+------------------------+-----+----+

+----------------------+-----+----+
|end_point_timestamp_tz|false|true|
+----------------------+-----+----+
|                   BST| null| 872|
|                   GMT| 3031|null|
+----------------------+-----+----+

Time stamps

Data schema shows that time stamps field values are casted as string
Proceed to convertion to TimestampType specifying the format (that will also assure that time format is uniform across the dataset)



In [31]:

    
# define funciton for type casting a list of columns to pyspark.sql.types=Timestamp
def colsToTimestamp(df,colnames,format):
  for name in colnames:
    df = df.withColumn(name, F.unix_timestamp(df[name],timestamp_format).cast(TimestampType()))
  return df

timestamp_format = 'yyyy-MM-dd HH:mm:ss'

#fields to be converted
timestamp_cols   = ['start_point_timestamp','start_point_timestamp_utc','end_point_timestamp','end_point_timestamp_utc']

df = colsToTimestamp(df,timestamp_cols,timestamp_format) 

# print checks
print (" check types schema: ")
print ( [(c.name,c.dataType) for c in df.schema if c.name in timestamp_cols] )

# display some rows
df.limit(5).toPandas().head()









    



 check types schema: 
[('end_point_timestamp', TimestampType), ('end_point_timestamp_utc', TimestampType), ('start_point_timestamp', TimestampType), ('start_point_timestamp_utc', TimestampType)]






    Out[31]:






  
    
      
      device_key
      distance_um
      duration_seconds
      end_point_address_city
      end_point_address_state
      end_point_address_zipcode
      end_point_latitude
      end_point_longitude
      end_point_timestamp
      end_point_timestamp_utc
      gps_distance
      hard_accel_count
      hard_brake_count
      idle_seconds
      key
      obd_distance
      points_city_count
      points_hwy_count
      speed_um
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
      start_point_address_city
      start_point_address_state
      start_point_address_zipcode
      start_point_latitude
      start_point_longitude
      start_point_timestamp
      trip_tz
      start_point_timestamp_utc
      top_speed
      vehicle_key
      vehicle_nickname
    
  
  
    
      0
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      425.00000
      Cleckheaton
      England
      BD19 3TY
      53.73208
      -1.71761
      2016-11-24 12:33:01
      2016-11-24 12:33:01
      0.74782
      0
      0
      238
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUmVUaEpUNlFaRmJhQ1IyREtTTmFSNww
      0.78106
      12
      0
      mph
      0
      0
      0
      0
      Cleckheaton
      England
      BD19 3BE
      53.72496
      -1.71555
      2016-11-24 12:25:56
      GMT
      2016-11-24 12:25:56
      32.30000
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      1
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      738.00000
      Luddenden Foot
      England
      HX2
      53.72361
      -1.94621
      2016-11-23 17:55:30
      2016-11-23 17:55:30
      3.56389
      0
      0
      117
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWOEViRzVwb0FDc21vSzdabkpQTTZraAw
      3.46539
      26
      0
      mph
      0
      0
      0
      0
      Halifax
      England
      HX1 3BA
      53.71449
      -1.87807
      2016-11-23 17:43:12
      GMT
      2016-11-23 17:43:12
      37.90000
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      2
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      241.00000
      Cleckheaton
      England
      BD19 3TY
      53.73208
      -1.71757
      2016-11-23 13:50:00
      2016-11-23 13:50:00
      0.91447
      0
      0
      62
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWRE14bjNIYUtIUlVrRXlTVHRjeW9mZww
      0.91714
      11
      0
      mph
      0
      0
      0
      0
      Cleckheaton
      England
      BD19 3BE
      53.72501
      -1.71539
      2016-11-23 13:45:59
      GMT
      2016-11-23 13:45:59
      36.00000
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      3
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      526.00000
      Luddenden Foot
      England
      HX2 6AH
      53.72363
      -1.94622
      2016-11-22 21:39:30
      2016-11-22 21:39:30
      3.57647
      0
      0
      61
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWdGNtVmh4ejhSbmg5NGNScXY3WXR4eQw
      3.43805
      22
      0
      mph
      2
      1
      0
      0
      Halifax
      England
      HX1 3BA
      53.71457
      -1.87802
      2016-11-22 21:30:44
      GMT
      2016-11-22 21:30:44
      44.10000
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
    
    
      4
      agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM
      mi
      532.00000
      Halifax
      England
      HX1 3BA
      53.71457
      -1.87801
      2016-11-22 21:25:20
      2016-11-22 21:25:20
      1.53763
      0
      0
      48
      agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWaVM3QkFEV1FYNGJINDI4d05lV1ptaQw
      1.55964
      20
      0
      mph
      1
      0
      0
      0
      Halifax
      England
      HX1 4TA
      53.72466
      -1.88786
      2016-11-22 21:16:28
      GMT
      2016-11-22 21:16:28
      32.30000
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN

Verify that change of time occurs on the right day (Oct 30th for 2016 for UK)



In [32]:

    
# keep only October
# group to get one entry per day 
# order

col = 'start_point_timestamp'

df.select(col,'trip_tz' )\
        .filter(F.month(col)==10)\
        .groupBy(F.year(col),F.month(col),F.dayofmonth(col),'trip_tz')\
        .count()\
        .orderBy(F.year(col),F.month(col),F.dayofmonth(col))\
        .toPandas().head(31)









    Out[32]:






  
    
      
      year(start_point_timestamp)
      month(start_point_timestamp)
      dayofmonth(start_point_timestamp)
      trip_tz
      count
    
  
  
    
      0
      2016
      10
      1
      BST
      12
    
    
      1
      2016
      10
      2
      BST
      3
    
    
      2
      2016
      10
      3
      BST
      5
    
    
      3
      2016
      10
      4
      BST
      2
    
    
      4
      2016
      10
      5
      BST
      2
    
    
      5
      2016
      10
      6
      BST
      10
    
    
      6
      2016
      10
      8
      BST
      1
    
    
      7
      2016
      10
      11
      BST
      5
    
    
      8
      2016
      10
      13
      BST
      4
    
    
      9
      2016
      10
      15
      BST
      11
    
    
      10
      2016
      10
      16
      BST
      13
    
    
      11
      2016
      10
      17
      BST
      14
    
    
      12
      2016
      10
      18
      BST
      13
    
    
      13
      2016
      10
      19
      BST
      18
    
    
      14
      2016
      10
      20
      BST
      5
    
    
      15
      2016
      10
      21
      BST
      16
    
    
      16
      2016
      10
      22
      BST
      46
    
    
      17
      2016
      10
      23
      BST
      32
    
    
      18
      2016
      10
      24
      BST
      21
    
    
      19
      2016
      10
      25
      BST
      15
    
    
      20
      2016
      10
      26
      BST
      28
    
    
      21
      2016
      10
      27
      BST
      48
    
    
      22
      2016
      10
      28
      BST
      60
    
    
      23
      2016
      10
      29
      BST
      86
    
    
      24
      2016
      10
      30
      GMT
      63
    
    
      25
      2016
      10
      31
      GMT
      81

Cross check calculation of trip duration

Verify correctness of trip duration calculation as difference between
start point and end point timestamps (for both UTC and local timezones)
Drop UTC time reporting since not needed at the moment



In [33]:

    
# select timestamp-type columns
timing_cols = [n for n in df.columns if 'timestamp' in n]

# add necessary info for the calculations
timing_cols.append('trip_tz')
timing_cols.append('duration_seconds')

# re-compute trip duration as timestamps difference for a few entries
# and show results
df.select( [c for c in timing_cols] ) \
  .withColumn('mydt',F.unix_timestamp(df['end_point_timestamp_utc'])-F.unix_timestamp(df['start_point_timestamp_utc'])) \
  .limit(5).toPandas().head() #(.show(5,truncate=False)
  

# re-compute for the whole dataset to check aggregated stats 
# as resonable proof that the two calculations return always same result
df.select( [c for c in timing_cols] ) \
  .withColumn('mydt',F.unix_timestamp(df['end_point_timestamp_utc'])-F.unix_timestamp(df['start_point_timestamp_utc'])) \
  .describe('duration_seconds','mydt') \
  .show() 

# drop field for now not necessary 
df = df.drop('start_point_timestamp_utc','end_point_timestamp_utc')









    Out[33]:






  
    
      
      end_point_timestamp
      end_point_timestamp_utc
      start_point_timestamp
      start_point_timestamp_utc
      trip_tz
      duration_seconds
      mydt
    
  
  
    
      0
      2016-11-24 12:33:01
      2016-11-24 12:33:01
      2016-11-24 12:25:56
      2016-11-24 12:25:56
      GMT
      425.00000
      425
    
    
      1
      2016-11-23 17:55:30
      2016-11-23 17:55:30
      2016-11-23 17:43:12
      2016-11-23 17:43:12
      GMT
      738.00000
      738
    
    
      2
      2016-11-23 13:50:00
      2016-11-23 13:50:00
      2016-11-23 13:45:59
      2016-11-23 13:45:59
      GMT
      241.00000
      241
    
    
      3
      2016-11-22 21:39:30
      2016-11-22 21:39:30
      2016-11-22 21:30:44
      2016-11-22 21:30:44
      GMT
      526.00000
      526
    
    
      4
      2016-11-22 21:25:20
      2016-11-22 21:25:20
      2016-11-22 21:16:28
      2016-11-22 21:16:28
      GMT
      532.00000
      532
    
  








    



+-------+------------------+------------------+
|summary|  duration_seconds|              mydt|
+-------+------------------+------------------+
|  count|              3903|              3903|
|   mean|1164.3219336407892| 1164.322316167051|
| stddev|1374.7726984147882|1374.7725450494697|
|    min|              18.0|                18|
|    max|           17805.0|             17805|
+-------+------------------+------------------+

Create useful list of columns

do some name shortening
group columns by subject



In [34]:

    
#
df = df.withColumnRenamed('start_point_latitude','start_point_lat') \
  .withColumnRenamed('start_point_longitude','start_point_lon') \
  .withColumnRenamed('end_point_latitude','end_point_lat') \
  .withColumnRenamed('end_point_longitude','end_point_lon')



In [35]:

    
#
latlong_cols = ['start_point_lat','start_point_lon','end_point_lat','end_point_lon']

timestamp_cols = [n for n in df.columns if 'timestamp' in n]

distance_cols = ['gps_distance','obd_distance']

duration_cols = ['duration_seconds','idle_seconds']

Fields consistency: duration time and idle time

Ensure that is always idle time < trip duration time and that there are no negative values for idle time
(already checked above that duration is calculate correctly)
Remove non-consistent rows (actually a small fraction of the data)



In [36]:

    
# useful column for the study
cols = timestamp_cols + duration_cols + latlong_cols + ['gps_distance']

# 
print (" number of rows where idle time  >= trip duration: {}"\
 .format( df.where(df['idle_seconds'] >= df['duration_seconds']).count() )
      )
# 
print (" number of rows where idle time  < 0: {}" \
 .format( df.where(df['idle_seconds'] < 0).count() )
      )

# display data for direct inspection
print(' \n Entries with idle > duration')
df.select(cols)\
    .where( df['idle_seconds'] >= df['duration_seconds'] )\
    .toPandas() 
    
print(' Entries with idle < 0 ')
df.select(cols)\
    .where( df['idle_seconds'] < 0 )\
    .toPandas() #.show(truncate=False)
  
# remove inconsistent entries
df = df.where( (df['duration_seconds'] > df['idle_seconds']) & (df['idle_seconds'] >= 0) )









    



 number of rows where idle time  >= trip duration: 23
 number of rows where idle time  < 0: 1
 
 Entries with idle > duration






    Out[36]:






  
    
      
      end_point_timestamp
      start_point_timestamp
      duration_seconds
      idle_seconds
      start_point_lat
      start_point_lon
      end_point_lat
      end_point_lon
      gps_distance
    
  
  
    
      0
      2016-11-21 17:19:44
      2016-11-21 17:19:09
      35.00000
      35
      51.63796
      -0.46845
      51.63795
      -0.46844
      0.00052
    
    
      1
      2016-11-23 10:00:57
      2016-11-23 10:00:28
      29.00000
      29
      53.30265
      -3.54296
      53.30273
      -3.54290
      0.00072
    
    
      2
      2016-10-31 14:22:57
      2016-10-31 14:14:34
      503.00000
      503
      51.27742
      -0.20994
      51.27740
      -0.20987
      0.00277
    
    
      3
      2016-10-29 11:26:59
      2016-10-29 11:26:34
      25.00000
      25
      51.13667
      -0.03229
      51.13667
      -0.03227
      0.00045
    
    
      4
      2016-09-18 18:53:03
      2016-09-18 18:51:32
      91.00000
      91
      53.03751
      -2.16971
      53.03758
      -2.16996
      0.01067
    
    
      5
      2016-11-10 10:42:34
      2016-11-10 10:42:12
      22.00000
      22
      50.81133
      -0.77265
      50.81134
      -0.77276
      0.00469
    
    
      6
      2016-09-16 12:38:17
      2016-09-16 12:34:30
      227.00000
      227
      55.95875
      -3.12300
      55.95874
      -3.12300
      0.00057
    
    
      7
      2016-11-15 19:40:34
      2016-11-15 19:39:51
      43.00000
      43
      51.53670
      -1.02413
      51.53679
      -1.02414
      0.00574
    
    
      8
      2016-11-27 21:02:36
      2016-11-27 21:01:05
      91.00000
      91
      51.18283
      -0.02588
      51.18279
      -0.02585
      0.00190
    
    
      9
      2016-09-29 17:43:42
      2016-09-29 17:41:21
      141.00000
      141
      51.14703
      -0.97698
      51.14702
      -0.97695
      0.00143
    
    
      10
      2016-10-29 12:14:50
      2016-10-29 12:00:22
      868.00000
      868
      55.95857
      -3.12247
      55.95853
      -3.12248
      0.00313
    
    
      11
      2016-11-24 17:42:54
      2016-11-24 17:42:11
      43.00000
      43
      51.43344
      -1.07573
      51.43344
      -1.07573
      0.00041
    
    
      12
      2016-10-29 15:44:01
      2016-10-29 15:43:36
      25.00000
      25
      52.77901
      -0.17527
      52.77896
      -0.17518
      0.00194
    
    
      13
      2016-09-18 09:17:55
      2016-09-18 09:16:39
      76.00000
      76
      53.03760
      -2.16988
      53.03753
      -2.16974
      0.00827
    
    
      14
      2016-11-13 10:16:33
      2016-11-13 10:15:56
      37.00000
      37
      53.85990
      -1.51111
      53.85982
      -1.51104
      0.00751
    
    
      15
      2016-11-10 20:54:07
      2016-11-10 20:53:30
      37.00000
      37
      53.85981
      -1.51114
      53.85979
      -1.51106
      0.00353
    
    
      16
      2016-11-03 11:40:35
      2016-11-03 11:36:04
      271.00000
      531488141
      53.78643
      -1.54367
      53.78645
      -1.54371
      0.84060
    
    
      17
      2016-11-11 12:45:16
      2016-11-11 12:44:39
      37.00000
      37
      54.56519
      -5.97924
      54.56517
      -5.97921
      0.00008
    
    
      18
      2016-10-23 11:32:24
      2016-10-23 11:31:49
      35.00000
      35
      53.03758
      -2.16986
      53.03753
      -2.16969
      0.00942
    
    
      19
      2016-11-30 19:08:07
      2016-11-30 19:07:33
      34.00000
      34
      51.18279
      -0.02590
      51.18283
      -0.02584
      0.00342
    
    
      20
      2016-10-19 13:02:40
      2016-10-19 13:01:14
      86.00000
      86
      51.10771
      -1.04498
      51.10757
      -1.04516
      0.01481
    
    
      21
      2016-09-24 06:31:02
      2016-09-24 06:29:01
      121.00000
      121
      51.10760
      -1.04526
      51.10761
      -1.04518
      0.00477
    
    
      22
      2016-10-27 22:27:13
      2016-10-27 22:26:55
      18.00000
      18
      53.69080
      -2.32692
      53.69072
      -2.32716
      0.01227
    
  








    



 Entries with idle < 0 






    Out[36]:






  
    
      
      end_point_timestamp
      start_point_timestamp
      duration_seconds
      idle_seconds
      start_point_lat
      start_point_lon
      end_point_lat
      end_point_lon
      gps_distance
    
  
  
    
      0
      2016-10-31 10:24:47
      2016-10-31 09:57:11
      1,656.00000
      -3326
      51.07413
      -1.77164
      51.07443
      -1.77063
      0.10528

get remaining entries



In [37]:

    
print ("rows after selection/total: {}/{}".format( df.count(), numOfRows) )

numOfRows = df.count()









    



rows after selection/total: 3879/3903

Latitude Longitude fields: basic checks

Simple stats computations to verify that values are within expected range for UK Basic check useful on the distributed data before plotting data distribution



In [38]:

    
df.select(latlong_cols)\
  .describe()\
  .show()









    



+-------+------------------+-------------------+------------------+-------------------+
|summary|   start_point_lat|    start_point_lon|     end_point_lat|      end_point_lon|
+-------+------------------+-------------------+------------------+-------------------+
|  count|              3879|               3879|              3879|               3879|
|   mean|  52.4211081309616|-1.2384743000773375|52.422046621551914|-1.2375207383346223|
| stddev|1.4824738917067224|  1.178389915446085|1.4817053846773547| 1.1743991693920588|
|    min|         50.052223|           -6.29264|          50.05226|          -6.254267|
|    max|         56.494857|           1.437023|         56.494875|           1.436827|
+-------+------------------+-------------------+------------------+-------------------+

Speed and driving-score fields: basic checks

- Verify that speed data and scores do not show unexpcted stats values, in particular
  for min and max values.  
  Basic check useful on the distributed data before plotting data distribution



In [39]:

    
df.select('top_speed')\
  .describe()\
  .show()

df.select( [c for c in df.columns if c.startswith('speeding')] )\
  .describe()\
  .show()

df.select( [c for c in df.columns if 'brake' in c ] )\
  .describe()\
  .show()
  
df.select( [c for c in df.columns if 'accel' in c] )\
  .describe()\
  .show()
  
df.select( [c for c in df.columns if c.startswith('points')] )\
  .describe()\
  .show()









    



+-------+------------------+
|summary|         top_speed|
+-------+------------------+
|  count|              3879|
|   mean|45.782874724678294|
| stddev|19.822138603306705|
|    min|               0.0|
|    max|127.38109440865345|
+-------+------------------+

+-------+-------------------------+-------------------------+------------------------+------------------------+
|summary|speeding_city_major_count|speeding_city_minor_count|speeding_hwy_major_count|speeding_hwy_minor_count|
+-------+-------------------------+-------------------------+------------------------+------------------------+
|  count|                     3879|                     3879|                    3879|                    3879|
|   mean|       1.1874194379994845|       0.9151843258571797|      0.2031451405001289|      0.1500386697602475|
| stddev|        4.963942862667783|        2.015354650938146|      3.4551544422836504|      1.5285386825067353|
|    min|                        0|                        0|                       0|                       0|
|    max|                      128|                       26|                     167|                      45|
+-------+-------------------------+-------------------------+------------------------+------------------------+

+-------+------------------+
|summary|  hard_brake_count|
+-------+------------------+
|  count|              3879|
|   mean|0.5225573601443672|
| stddev|1.2394761981345064|
|    min|                 0|
|    max|                19|
+-------+------------------+

+-------+-------------------+
|summary|   hard_accel_count|
+-------+-------------------+
|  count|               3879|
|   mean|0.20108275328692962|
| stddev| 0.7056749724151983|
|    min|                  0|
|    max|                 11|
+-------+-------------------+

+-------+------------------+------------------+
|summary| points_city_count|  points_hwy_count|
+-------+------------------+------------------+
|  count|              3879|              3879|
|   mean| 27.88837329208559| 4.840422789378706|
| stddev|30.296287733772218|20.752739069399357|
|    min|                 0|                 0|
|    max|               291|               356|
+-------+------------------+------------------+

Some fields re-cast

Double's -> Float's
Duration_seconds: Float -> Int



In [40]:

    
schema_cols = [(c.name,c.dataType) for c in df.schema]

for c in schema_cols:
  name,dtype = c
  if  dtype == DoubleType():
    df = df.withColumn(name,df[name].cast(FloatType()))
  elif dtype == LongType():
    df = df.withColumn(name,df[name].cast(IntegerType()))
    
#
df = df.withColumn('duration_seconds',df['duration_seconds'].cast(IntegerType()))

df.printSchema()









    



root
 |-- device_key: string (nullable = true)
 |-- distance_um: string (nullable = true)
 |-- duration_seconds: integer (nullable = true)
 |-- end_point_address_city: string (nullable = true)
 |-- end_point_address_state: string (nullable = true)
 |-- end_point_address_zipcode: string (nullable = true)
 |-- end_point_lat: float (nullable = true)
 |-- end_point_lon: float (nullable = true)
 |-- end_point_timestamp: timestamp (nullable = true)
 |-- gps_distance: float (nullable = true)
 |-- hard_accel_count: integer (nullable = true)
 |-- hard_brake_count: integer (nullable = true)
 |-- idle_seconds: integer (nullable = true)
 |-- key: string (nullable = true)
 |-- obd_distance: float (nullable = true)
 |-- points_city_count: integer (nullable = true)
 |-- points_hwy_count: integer (nullable = true)
 |-- speed_um: string (nullable = true)
 |-- speeding_city_major_count: integer (nullable = true)
 |-- speeding_city_minor_count: integer (nullable = true)
 |-- speeding_hwy_major_count: integer (nullable = true)
 |-- speeding_hwy_minor_count: integer (nullable = true)
 |-- start_point_address_city: string (nullable = true)
 |-- start_point_address_state: string (nullable = true)
 |-- start_point_address_zipcode: string (nullable = true)
 |-- start_point_lat: float (nullable = true)
 |-- start_point_lon: float (nullable = true)
 |-- start_point_timestamp: timestamp (nullable = true)
 |-- trip_tz: string (nullable = true)
 |-- top_speed: float (nullable = true)
 |-- vehicle_key: string (nullable = true)
 |-- vehicle_nickname: string (nullable = true)

'Key' field

There are as many key as rows
Field can be dropped



In [41]:

    
print (" key entries/total: {}/{}"\
                      .format( df.select('key').distinct().count(), numOfRows )
      )

df = df.drop('key')









    



 key entries/total: 3879/3879

Device key and vehicle key/nickname field

Get number of different devices
Get number of different nickanmes
Get combination of Device key and vehicle nickname to control if a device is used
on multiple vehicles. It results that there is always one device per vehicle



In [42]:

    
print (" Number of device_key: {}"\
                               .format( df.select('device_key').distinct().count() ) 
      )









    



 Number of device_key: 55



In [43]:

    
print ("Number of unique vehicle_key, vehicle_nickname: ",\
       df.select('vehicle_key','vehicle_nickname').groupBy('vehicle_key','vehicle_nickname').count().count()
      )

df.select('vehicle_key','vehicle_nickname')\
    .groupBy('vehicle_key','vehicle_nickname')\
    .count().orderBy('vehicle_nickname')\
    .toPandas()
    
#.show(n=100,truncate=False)









    



('Number of unique vehicle_key, vehicle_nickname: ', 55)






    Out[43]:






  
    
      
      vehicle_key
      vehicle_nickname
      count
    
  
  
    
      0
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZrSEp3RHhIZzhVakdTR0xiQkhqY3dFDAsSA0NhciILRnczZlVRVjk0WnEM
      BG61YNF
      12
    
    
      1
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZBS2JGM0hUVkVUQzM3ZHh4YjJaZUtCDAsSA0NhciILTGt2OHZVVEFFd0MM
      BN03NAE
      130
    
    
      2
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZMQjRZRXVuNGtKcTVZWGJGWXpYYkdKDAsSA0NhciILWDlIaVR5dGI4ZXQM
      Bill's Honda
      64
    
    
      3
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZLNFJHZllXMnViaEdqekxGbzRFWjlDDAsSA0NhciILWjdIMkFwdWpYWVYM
      California
      18
    
    
      4
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZQUzhMZHBMUDh3TDU2NVdNR21WVUJTDAsSA0NhciILWUdRVFRZY3VHWDcM
      D4EHH
      129
    
    
      5
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZxOE1icnVodXYyVHFaM25ONTV6VkxlDAsSA0NhciILaENaWUE1OHBLVUIM
      EO14GZR
      123
    
    
      6
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZpUEpCajJxVjNFWGdZTktYOEhyS0x3DAsSA0NhciILYlRROWZWeUhleTcM
      F9ELL
      45
    
    
      7
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ3TjduVmFGM0c1ZmtFekhTUUpyRlc0DAsSA0NhciILRW5aSHRrV280TXAM
      FX16VYG
      39
    
    
      8
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZlejJLMlB6NHdDTW9rb1R1VVIydW1aDAsSA0NhciILc0pyZWFMbnJnWXkM
      FY64AZT
      5
    
    
      9
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ5a3NlYVU1UXd0RDg1TWJXRDd2NTZIDAsSA0NhciILYVFIQlUzbjQ1ZlMM
      GF07CGG
      25
    
    
      10
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZuQ1BQdlZqVXZOeFZmQ21ndlJxVmd5DAsSA0NhciILOHN2eVNTdXhRRVoM
      GF10LZH
      5
    
    
      11
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ2RFZFeVp2Q245cWZ3QUNYOWlnRG5aDAsSA0NhciILYWRUVnNERWFQcGsM
      GN64UTO
      78
    
    
      12
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhYzVk45ZWh5elp6VGpjdXBZcVlvRnlhDAsSA0NhciILY2pnUmFqZU1kNTcM
      GU07LZD
      34
    
    
      13
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZxRFM1V29ETDhwNGI1d1c3VmR4U2NhDAsSA0NhciILbzdBamhnWnNXcjIM
      GX05RHU
      116
    
    
      14
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZzaDhya1M0Uzg2OENvVkdzZDU3dnFGDAsSA0NhciILeW91YWZyZGdGdzIM
      Jamie
      6
    
    
      15
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZhNEJudVUyM0VFWXozaWQ1VldpWWpUDAsSA0NhciILOWh0c3hjY2FEYXgM
      KM59WDP
      180
    
    
      16
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZxaVh2b0RDRlVFUVhKVUt3VEVDZ3pWDAsSA0NhciILTEVpcDhab2JxWUsM
      KN15UAC
      153
    
    
      17
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ5eGlFRlo1U3RIUkxqV0NrTE5xcGVTDAsSA0NhciILUWFETGpiV21abVkM
      KN15UAG
      40
    
    
      18
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ6WEVrSzl4WW84RUY1V3diUkNtRUJ4DAsSA0NhciILV1p4R2VvUzY0SGkM
      KO16WBJ
      129
    
    
      19
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZKblAyYmE0cTM5OVAzSGNhbTRiUTdCDAsSA0NhciILNW9xNHV5UHJIUHgM
      KS12VRN
      66
    
    
      20
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZEMm1DQU5mOHhQRXJ0QVNnTkRLdUZHDAsSA0NhciILWXBBNVBqZUpXN0wM
      KU62TNN
      32
    
    
      21
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhY4cmFyREROVDJDWFgydUdma0tVcGllDAsSA0NhciILTERXd3RLN2JyQ2QM
      L111TYM
      34
    
    
      22
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZFVU5WazY1ZFFDeTN3VEFVSDNSWXJSDAsSA0NhciILYjJUb3FpYUhDUVoM
      L3TGG
      39
    
    
      23
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZRYUg2dU5RVnRZTVFBUWZlRFE4QUF0DAsSA0NhciILeXVRQW9qYTlySjUM
      LG12YYW
      22
    
    
      24
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZFWUNXUVFlQ0FKZVpYVXptZVY4VlhkDAsSA0NhciILbXloYVJIeXNReWcM
      LN53SWX
      53
    
    
      25
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZiRFI0b0xWeHBnalhnSEJBWFNrbW5CDAsSA0NhciILRktNQVBhVHVzb1UM
      LT16NVG
      56
    
    
      26
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ6VnZMS0F3bkpGbXZTYjVqVFc1azlSDAsSA0NhciILQjU2WVU3b1BhYWIM
      LX14NYS
      99
    
    
      27
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZwVEN3UDRyU3VydWJ5cFY0RmM2OWlZDAsSA0NhciILWVl4b3FpWkVIdVAM
      LX64KWB
      4
    
    
      28
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZONVJjdVBKR2c0Z0RCTnlHdXJ6cHppDAsSA0NhciILRVprcmI4V2JBdjkM
      M100NEB
      104
    
    
      29
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZmOVdvd0xORFJYWnhTNndocWZZaUptDAsSA0NhciILUVZGQ3BLbzNVd3AM
      MJ16JPO
      76
    
    
      30
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM
      MV13ZFN
      214
    
    
      31
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZYOUs2QXYyWTdaU2RSS21HQVZmRFZZDAsSA0NhciILYzN0NGdXTk43a2IM
      NJZ8764
      39
    
    
      32
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ4TktCV2lrbTRRcDV4dEMyd28zbk5MDAsSA0NhciILSjRmdWdtNlBtY0oM
      NL62TNX
      69
    
    
      33
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZOVUpTbUZyRWtoY2JmbWQ0TmE5eEFXDAsSA0NhciILMnVXVjhLQ2RWSm4M
      OE07CKK
      112
    
    
      34
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZZem9LZm5vQkxad0c0dk1ReWg1UVVKDAsSA0NhciILdUhIckw4NVVqVDIM
      OY09WKK
      141
    
    
      35
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ4eWdBVkhnUG9BSkRVc0FyRWNjUnBGDAsSA0NhciILQ0g1Qlo3c3BHQ3MM
      PL08ZVH
      33
    
    
      36
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZoZ3d0TU5lWFdYWkprZVBhRHBtVHE4DAsSA0NhciILWEVmbVg5MmM0MlkM
      PX10YSL
      77
    
    
      37
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDNDhDeXV5S1oyeEtwR2J5OGs3aW5TDAsSA0NhciILb3VnSzJhNGdlOEsM
      RK58SRO
      10
    
    
      38
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZCc1lDbkFwaWJpVlFqTTJ3VTM1eWdlDAsSA0NhciILN0pOd3ZVcFdLR1IM
      SB14FWG
      196
    
    
      39
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhY4ckpZMmlTNzZiclgzS1J5NXBGeEJyDAsSA0NhciILelQ5VFBHUlhueWYM
      SC14WGV
      34
    
    
      40
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZxbXlnR1Y3RVFhU3VlUmZMajJBR1NZDAsSA0NhciILZWtZazRYcjZjUlAM
      SG63XMW
      19
    
    
      41
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhY3Q1ZvRU04enJjc29ZNFliRWpjOExuDAsSA0NhciILTVR3RWNHc3ozemoM
      ST61MWJ
      111
    
    
      42
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZuNmsySkpNRVVaYk5wWWlORTNUY1ZwDAsSA0NhciILTVRGRXNvdkFuWmcM
      Sk12ZGS
      220
    
    
      43
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZpYXN4MlQzRFdMaFZxaFRxaUxFa2V2DAsSA0NhciILNEhWeVhGVUdKbWEM
      T6FTO
      29
    
    
      44
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZNdUhtZTZOVkFOUnBDa0dQRUJQRGpSDAsSA0NhciILdVZ1ZWY2V2RoOWMM
      VK61VZS
      76
    
    
      45
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZwQlIzWnBaQ0pBVUR1SGVBWUhYSlZjDAsSA0NhciILcFgyaVJXQTNrTEIM
      W31PMW
      83
    
    
      46
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZEVUI2R2tmc0FrUEVSUkFSV0dYVVdKDAsSA0NhciILc0I5ZUpGU2lnUEwM
      YC15CXR
      62
    
    
      47
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZzU0RHdTRkSHA4Z3lvNUJyS3IzbTZiDAsSA0NhciILekFhUmNLcTZqOEMM
      YC62VAO
      43
    
    
      48
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZabnBqRjhmcmdKeWJZTkJnQUVjbXZZDAsSA0NhciILNE1QcUVIaGVEUDMM
      YE14DGF
      31
    
    
      49
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZyOTIyd2NmajJYbWN1cWdyY0JleUJGDAsSA0NhciILb1loRkxYQnZMMkMM
      YK65WLE
      33
    
    
      50
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhY3a3NoM2lNdnNTTGVnWFg5aVRtQnJFDAsSA0NhciILTVdialZVWjVncEoM
      YP63LSE
      24
    
    
      51
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZQakZxWXBXb0JiQ0JENmM5Y3hWaEhoDAsSA0NhciILNmtDZFNMUjNDb0UM
      bens
      72
    
    
      52
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ1TWdoQVhxYjZHWVRCY0NHRGNHQW5vDAsSA0NhciILVTJ1RWJmb0ZKa3kM
      harrow55
      40
    
    
      53
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZiQ3NGdGFVOHhIcmREaFd5RjZ0UlBjDAsSA0NhciILcHRnRmhGZWNGeVQM
      ll58hxj
      130
    
    
      54
      agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZQRGR1WnJZeGl2NmJtZHJmNTlVdkozDAsSA0NhciILV0w3cU41S01QR3AM
      ov54phj
      65

Convert device keys to more manageable IntegerTypes id's: add 'device_id' fields
That would be useful also for ML algorithms, which normally require int or floating point values
Remove vehicle and device keys



In [44]:

    
# 
df = df.drop('vehicle_key')



In [45]:

    
# get list of unique device_key's
device_keys = df.select('device_key').distinct().orderBy('device_key').rdd.flatMap(lambda x:x).collect()

# register UDF function to map key to an integer given by its position in the array
getID_udf = F.udf(lambda k: device_keys.index(k), IntegerType())

# create new column
df = df.withColumn('device_id',getID_udf(df['device_key']))

df = df.drop('device_key')

check data frame schema



In [46]:

    
df.printSchema()









    



root
 |-- distance_um: string (nullable = true)
 |-- duration_seconds: integer (nullable = true)
 |-- end_point_address_city: string (nullable = true)
 |-- end_point_address_state: string (nullable = true)
 |-- end_point_address_zipcode: string (nullable = true)
 |-- end_point_lat: float (nullable = true)
 |-- end_point_lon: float (nullable = true)
 |-- end_point_timestamp: timestamp (nullable = true)
 |-- gps_distance: float (nullable = true)
 |-- hard_accel_count: integer (nullable = true)
 |-- hard_brake_count: integer (nullable = true)
 |-- idle_seconds: integer (nullable = true)
 |-- obd_distance: float (nullable = true)
 |-- points_city_count: integer (nullable = true)
 |-- points_hwy_count: integer (nullable = true)
 |-- speed_um: string (nullable = true)
 |-- speeding_city_major_count: integer (nullable = true)
 |-- speeding_city_minor_count: integer (nullable = true)
 |-- speeding_hwy_major_count: integer (nullable = true)
 |-- speeding_hwy_minor_count: integer (nullable = true)
 |-- start_point_address_city: string (nullable = true)
 |-- start_point_address_state: string (nullable = true)
 |-- start_point_address_zipcode: string (nullable = true)
 |-- start_point_lat: float (nullable = true)
 |-- start_point_lon: float (nullable = true)
 |-- start_point_timestamp: timestamp (nullable = true)
 |-- trip_tz: string (nullable = true)
 |-- top_speed: float (nullable = true)
 |-- vehicle_nickname: string (nullable = true)
 |-- device_id: integer (nullable = true)

Some column rearrangement

Shorten some long field names
Re-shuffle colums based on typology
Order by time and device id



In [47]:

    
# remove the 'address' tag
for c in [c for c in df.columns if 'address' in c]:
  df = df.withColumnRenamed(c,c.replace('address_',''))



In [48]:

    
# ordered list
cols = ['device_id','vehicle_nickname'] \
       + [c for c in df.columns if c.startswith('start')] \
       + [c for c in df.columns if c.startswith('end')] \
       + ['trip_tz'] \
       + ['duration_seconds','idle_seconds'] \
       + ['gps_distance','obd_distance'] \
       + ['top_speed'] \
       + ['hard_accel_count','hard_brake_count'] \
       + ['points_city_count','points_hwy_count'] \
       + [c for c in df.columns if c.startswith('speeding')] \
       + ['speed_um','distance_um']
        
# check that no cols are left out
print( 'new/old colunm list length {}/{}'.format( len(cols), len(df.columns) ) )









    



new/old colunm list length 30/30



In [49]:

    
# apply re-shuffling
df = df.select(cols)

#
df = df.orderBy('start_point_timestamp','device_id')

Summary

Prepared Data Frame:



In [50]:

    
print('\n list of columns: ')
df.columns









    



 list of columns: 






    Out[50]:





['device_id',
 'vehicle_nickname',
 'start_point_city',
 'start_point_state',
 'start_point_zipcode',
 'start_point_lat',
 'start_point_lon',
 'start_point_timestamp',
 'end_point_city',
 'end_point_state',
 'end_point_zipcode',
 'end_point_lat',
 'end_point_lon',
 'end_point_timestamp',
 'trip_tz',
 'duration_seconds',
 'idle_seconds',
 'gps_distance',
 'obd_distance',
 'top_speed',
 'hard_accel_count',
 'hard_brake_count',
 'points_city_count',
 'points_hwy_count',
 'speeding_city_major_count',
 'speeding_city_minor_count',
 'speeding_hwy_major_count',
 'speeding_hwy_minor_count',
 'speed_um',
 'distance_um']

Display some rows



In [51]:

    
df.limit(10).toPandas()









    Out[51]:






  
    
      
      device_id
      vehicle_nickname
      start_point_city
      start_point_state
      start_point_zipcode
      start_point_lat
      start_point_lon
      start_point_timestamp
      end_point_city
      end_point_state
      end_point_zipcode
      end_point_lat
      end_point_lon
      end_point_timestamp
      trip_tz
      duration_seconds
      idle_seconds
      gps_distance
      obd_distance
      top_speed
      hard_accel_count
      hard_brake_count
      points_city_count
      points_hwy_count
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
      speed_um
      distance_um
    
  
  
    
      0
      37
      bens
      London
      England
      SE22 0NN
      51.45408
      -0.05773
      2016-08-10 17:33:52
      London
      England
      SE20 8SH
      51.40580
      -0.06562
      2016-08-10 17:55:50
      BST
      1318
      360
      4.56465
      4.49189
      41.60000
      0
      3
      47
      0
      0
      1
      0
      0
      mph
      mi
    
    
      1
      37
      bens
      London
      England
      SE20 8SJ
      51.40570
      -0.06566
      2016-08-13 09:00:25
      London
      England
      SE6 4RS
      51.43126
      -0.03489
      2016-08-13 09:13:16
      BST
      771
      189
      2.26938
      2.81295
      32.30000
      0
      0
      5
      0
      0
      0
      0
      0
      mph
      mi
    
    
      2
      37
      bens
      London
      England
      SE6 4RS
      51.43126
      -0.03489
      2016-08-13 10:41:20
      London
      England
      SE20 8SH
      51.40581
      -0.06558
      2016-08-13 10:54:48
      BST
      808
      214
      2.21226
      2.72223
      29.80000
      0
      0
      6
      0
      0
      0
      0
      0
      mph
      mi
    
    
      3
      37
      bens
      London
      England
      SE20 8SH
      51.40616
      -0.06565
      2016-08-14 00:53:25
      London
      England
      SE20 8SW
      51.40634
      -0.06818
      2016-08-14 01:01:18
      BST
      473
      104
      0.11582
      1.40492
      28.00000
      0
      0
      3
      0
      0
      0
      0
      0
      mph
      mi
    
    
      4
      37
      bens
      London
      England
      SE20 8SH
      51.40576
      -0.06559
      2016-08-17 19:27:08
      Orpington
      England
      BR6 0AT
      51.37328
      0.09562
      2016-08-17 20:00:46
      BST
      2018
      605
      9.23671
      9.05462
      39.10000
      0
      1
      61
      0
      1
      1
      0
      0
      mph
      mi
    
    
      5
      37
      bens
      Orpington
      England
      BR6 0AT
      51.37328
      0.09562
      2016-08-17 20:41:04
      London
      England
      SE20 8SH
      51.40581
      -0.06560
      2016-08-17 21:05:24
      BST
      1460
      138
      7.95704
      8.92848
      42.30000
      0
      2
      45
      0
      1
      2
      0
      0
      mph
      mi
    
    
      6
      37
      bens
      London
      England
      SE20 8SJ
      51.40579
      -0.06579
      2016-08-18 19:53:53
      London
      England
      SE23 3XP
      51.44048
      -0.05667
      2016-08-18 20:04:44
      BST
      651
      75
      2.53011
      2.94716
      32.90000
      0
      1
      5
      0
      0
      1
      0
      0
      mph
      mi
    
    
      7
      37
      bens
      London
      England
      SE23 3XP
      51.44078
      -0.05680
      2016-08-18 21:16:34
      London
      England
      SE20 8SJ
      51.40582
      -0.06583
      2016-08-18 21:38:01
      BST
      1287
      449
      3.26761
      3.59960
      34.20000
      0
      0
      27
      0
      0
      0
      0
      0
      mph
      mi
    
    
      8
      37
      bens
      London
      England
      SE20 8BL
      51.41071
      -0.06535
      2016-08-19 17:07:36
      Fleet
      England
      GU51
      51.29453
      -0.85891
      2016-08-19 19:45:12
      BST
      9456
      3629
      74.70174
      72.48668
      90.10000
      1
      15
      97
      89
      6
      6
      5
      10
      mph
      mi
    
    
      9
      37
      bens
      Tatsfield
      England
      TN16 2ER
      51.27354
      0.04222
      2016-08-21 12:49:35
      London
      England
      SE19 2BA
      51.41868
      -0.07489
      2016-08-21 13:51:55
      BST
      3740
      761
      26.97838
      25.79001
      84.50000
      0
      2
      93
      22
      7
      0
      1
      2
      mph
      mi



In [52]:

    
print(' Number of selected fields/total: {}/{} '.format( len(df.columns), numOfFieldsRaw ) )
print(' Number of selected rows/total: {}/{} '.format( numOfRows, numOfRowsRaw) )









    



 Number of selected fields/total: 30/56 
 Number of selected rows/total: 3879/28952



In [53]:

    
# Prepared Spark data frame can be written to file system for later use
#! rm -rf ./output
#df.repartition(1).write.parquet('./output')



In [54]:

    
#spark.stop()

Data Analytics

to index

Analysis is performed on a Padas Data Frame
Feature exploration, validation, engineering, selection
Data exploration, visualization and selection
Predictive modelling using scikit learn libray



In [55]:

    
# control cell printout 
InteractiveShell.ast_node_interactivity = 'last_expr'



In [56]:

    
# import graphics library

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns



In [57]:

    
# graphics settings

SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 20

plt.rc('font', size=BIGGER_SIZE)         # controls default text sizes
plt.rc('axes', titlesize=BIGGER_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=BIGGER_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=BIGGER_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=BIGGER_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

#
axis_font = {'fontname':'Arial', 'size':'24'}

Convert Spark Data Frame into a Pandas Data Frame

Locally resident object



In [151]:

    
pdf = df.toPandas()

#write to file for later use
#pdf.to_pickle('zubie.pkl')

Feature analysis: Visualization of geo-spatial data

Exploit lat/long coordinates to create dynamic maps to visualize geographical distribution of data features

As first data exploration a base map with two layers displaying density of trips (the starting point) is created:

-One layer holding a heatmap  
-One layer holding a clustered rapresentation of trip starting points. Point clusters can be
 recomputed according to the distance scale



In [155]:

    
# Load necessary modules
# 
from IPython.core.display import HTML, display
from IPython.display import IFrame

# Module to handle Pandas Dataframes and geo-data
import geopandas as gpd
from geopandas.tools import sjoin

# Interactive Map with layer support
import folium
from folium.plugins import MarkerCluster
from folium.plugins import HeatMap
from folium.element import IFrame

# utilities
import shapely
from shapely.geometry import Point

Create a GeoSeries object: a series of Shapely geometry objects (a Point), one for each starting_point location
Build a GeoDataFrame: a Pandas data frame pdf with a geometry-specilized-field



In [61]:

    
# Build a GeoSeries of shapely Points
# Select coordinate system ESPG4326 which is the standard WGS84 coordinate system
geo_series = gpd.GeoSeries( pdf.apply(lambda z: Point(z['start_point_lat'], z['start_point_lon']), 1)
                           ,crs={'init': 'epsg:4326'} )



In [62]:

    
geo_series.head()









    Out[62]:





0    POINT (51.45407867431641 -0.05773499980568886)
1    POINT (51.40570068359375 -0.06565999984741211)
2    POINT (51.43125915527344 -0.03488999977707863)
3    POINT (51.40615844726562 -0.06565000116825104)
4    POINT (51.40576171875 -0.06558500230312347)   
dtype: object



In [63]:

    
# join geo_series to pdf and drop orinal columns
# In general it is advisable to subset pdf to retain fields relevant for maps
# (not critical on small data sets)
geo_df = gpd.GeoDataFrame( pdf.drop(['start_point_lat', 'start_point_lon'], 1)
                          ,geometry=geo_series )



In [64]:

    
# display data frame with the new field 'geometry'
# Such field is added as the last one: let's put it as first
# before displaying
geo_df = geo_df[ ['geometry']+geo_df.columns[:-1].tolist() ]
geo_df.head()









    Out[64]:






  
    
      
      geometry
      device_id
      vehicle_nickname
      start_point_city
      start_point_state
      start_point_zipcode
      start_point_timestamp
      end_point_city
      end_point_state
      end_point_zipcode
      end_point_lat
      end_point_lon
      end_point_timestamp
      trip_tz
      duration_seconds
      idle_seconds
      gps_distance
      obd_distance
      top_speed
      hard_accel_count
      hard_brake_count
      points_city_count
      points_hwy_count
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
      speed_um
      distance_um
    
  
  
    
      0
      POINT (51.45407867431641 -0.05773499980568886)
      37
      bens
      London
      England
      SE22 0NN
      2016-08-10 17:33:52
      London
      England
      SE20 8SH
      51.40580
      -0.06562
      2016-08-10 17:55:50
      BST
      1318
      360
      4.56465
      4.49189
      41.60000
      0
      3
      47
      0
      0
      1
      0
      0
      mph
      mi
    
    
      1
      POINT (51.40570068359375 -0.06565999984741211)
      37
      bens
      London
      England
      SE20 8SJ
      2016-08-13 09:00:25
      London
      England
      SE6 4RS
      51.43126
      -0.03489
      2016-08-13 09:13:16
      BST
      771
      189
      2.26938
      2.81295
      32.30000
      0
      0
      5
      0
      0
      0
      0
      0
      mph
      mi
    
    
      2
      POINT (51.43125915527344 -0.03488999977707863)
      37
      bens
      London
      England
      SE6 4RS
      2016-08-13 10:41:20
      London
      England
      SE20 8SH
      51.40581
      -0.06558
      2016-08-13 10:54:48
      BST
      808
      214
      2.21226
      2.72223
      29.80000
      0
      0
      6
      0
      0
      0
      0
      0
      mph
      mi
    
    
      3
      POINT (51.40615844726562 -0.06565000116825104)
      37
      bens
      London
      England
      SE20 8SH
      2016-08-14 00:53:25
      London
      England
      SE20 8SW
      51.40634
      -0.06818
      2016-08-14 01:01:18
      BST
      473
      104
      0.11582
      1.40492
      28.00000
      0
      0
      3
      0
      0
      0
      0
      0
      mph
      mi
    
    
      4
      POINT (51.40576171875 -0.06558500230312347)
      37
      bens
      London
      England
      SE20 8SH
      2016-08-17 19:27:08
      Orpington
      England
      BR6 0AT
      51.37328
      0.09562
      2016-08-17 20:00:46
      BST
      2018
      605
      9.23671
      9.05462
      39.10000
      0
      1
      61
      0
      1
      1
      0
      0
      mph
      mi

Create a base map object



In [65]:

    
# center map at the mean of the lat-long grid form the data (starting point only)
lat,lon = pdf[['start_point_lat']].mean(),  pdf[['start_point_lon']].mean() 

geo_map = folium.Map( 
        location=[np.float(lat), np.float(lon)]
        ,control_scale = True # show distance scales
        ,zoom_start = 7 # ajusted to show most of the country
        )

Write a function to build a heatmap layer from the GeoDataFrame



In [66]:

    
def add_heatmap(base_map, gdf, mag_field=None):
    """
        Add an heatmap layer to Folium base_map using points from the 
        POINT-type field 'geometry'in the GeoDataFrame gdf
        Values in the mag_field colums can be used as weights
    """
    # Create empty lists to hold coordinates and magnitude values if provided
    lat, lon, mag = [], [], []
    coords = []
    # loop through rows of the GeoDataFrame
    for i, row in gdf.iterrows():
        coords.append([np.float(row.geometry.x), np.float(row.geometry.y)])
        lat.append( np.float(row.geometry.x) )
        lon.append( np.float(row.geometry.y) )
        m = row[mag_field] if mag_field else 1
        mag.append( m )
        
    # create a Folium Feature Group for the layer
    layer = folium.FeatureGroup(name = 'heatmap')
    
    layer.add_children(HeatMap(zip(lat,lon,mag)))
    # add this point layer to the map object
    base_map.add_children(layer)
    return base_map

Write a function to build a clusters layer from the GeoDataFrame



In [67]:

    
def add_clustered_points(base_map, gdf, popup_fields=None ):
    """
        Add a cluster point layer to Folium base_map using points from the 
        POINT-type field 'geometry'in the GeoDataFrame gdf
        Values in the popup_fileds column can be used as text to be displayed in
        the marker popup window
    """
    # Create empty lists to hold coordinates and popup info text if provided
    coords, pops = [], [] 
    # loop through rows of the GeoDataFrame
    for i, row in gdf.iterrows():
        coords.append([np.float(row.geometry.x), np.float(row.geometry.y)])
        if popup_fields:
            # create a string of HTML code joining values from  
            # the fields listed in 'popup_fields' with a linebreak between them
            label = '<br>'.join([row[field] for field in popup_fields])
            # append an IFrame that uses the HTML string to the "popups" list 
            pops.append( IFrame(label, width = 300, height = 100) )
        
    # create a Folium feature group for the layer
    layer = folium.FeatureGroup(name = 'clusters')
    
    # add the clustered points of crime locations and popups to this layer
    if not popup_fields: pops = None
    layer.add_children( MarkerCluster(locations = coords, popups=pops) )
    
    # add this point layer to the map object
    base_map.add_children(layer)
    return base_map

Create the interactive map as a base map with two layers:
Heatmap and a map of Point Clusters. It is possible to swhich can be activated via layer control
```
- no magnitude field is provided to the heatmap: it will represents density of points
```



In [68]:

    
#
geo_map = add_heatmap(geo_map, geo_df)

#
geo_map = add_clustered_points(geo_map, geo_df)
# Optionally you can pass a list of fields to be dispalyed
# in the popup infobox (not advisable on large datasets)
# e.g.:
# geo_map = add_clustered_points(geo_map, geo_df, ['vehicle_nickname','top_speed'])

# add layer control to be shown on the map
folium.LayerControl().add_to( geo_map )









    Out[68]:





<folium.map.LayerControl at 0x115434a90>

Display interactive map

  - zooming in and out will trigger different reclustering of the points. Polygon defining the  
    cluster boundaries can be visualized as well by hovering over the cluster.



In [69]:

    
# geo_map.save('geo.html')
geo_map









    Out[69]:

Static map images

Just as a reference, a few static images are shown here in case proper rendering is not available.

Clusters layers



In [162]:

    
from IPython.display import Image
Image(filename='cluster_map_0.png',width=500, height=100)









    Out[162]:



In [165]:

    
from IPython.display import Image
Image(filename='cluster_map_1.png',width=500, height=100)









    Out[165]:



In [163]:

    
from IPython.display import Image
Image(filename='heat_map_0.png',width=500, height=100)









    Out[163]:

Heat map layer



In [164]:

    
from IPython.display import Image
Image(filename='heat_map_1.png',width=500, height=100)









    Out[164]:

Feature analysis



In [70]:

    
#pdf = pd.read_pickle('zubie.pkl')

Feature creation

combine features into more descriptive quantities and define non linear combination of features which may help exploit higher order correlations:
- first: for convenience, express time range variables in minutes
- calculate effective driving time and trip average speed
  $t_{drive}=t_{duration}-t_{idle}$
  $v_{ave}=\frac{d}{t_{drive}}$
  where $d$ is the obd measured distance
- compute total number of $hard$ $drive$ $maneuvers$ per trip as the sum hard accelerations and brakes
  $N_{hm} = N_{acc}+N_{brake}$
- for each trip, compute number of hard maneuvers per mile: $N_{hm}/d$



In [71]:

    
# convert seconds to minutes
pdf.rename(columns={'duration_seconds':'duration','idle_seconds':'idle'}, inplace = True)
pdf['duration'] = pdf.eval('duration/60.')
pdf['idle'] = pdf.eval('idle/60.')

# effective driven time
pdf['drive_time'] = pdf.eval('duration - idle')

# mph 
pdf['ave_speed'] = pdf.eval('60.*(obd_distance/drive_time)')

# hard maneuvers
pdf['hard_manv_count'] = pdf.eval('hard_accel_count + hard_brake_count')

#  hard maneuvers per mile (distance has been already validated to be always >0)
pdf['hard_manv_pm'] = pdf.eval('hard_manv_count/obd_distance')

Add one more distance-like variable: geo_distance

  - geodesic distance (i.e. the shortest) between start-point and end-point of a trip may capture other
    feature charateristic, in addition to the effective driven distance (obd_distance)



In [72]:

    
# calculate geo-distance from lat/lon coordinates
import geopy
from geopy.distance import vincenty

def getGeoDist(row):
    start = (row['start_point_lat'],row['start_point_lon'])
    end   = (row['end_point_lat'],row['end_point_lon'])
    return vincenty(start,end).miles

pdf['geo_distance'] = pdf.apply(getGeoDist, axis=1)

reshuffle columns to group them by subject



In [73]:

    
columns = ['device_id', 'vehicle_nickname', 'start_point_city' \
       ,'start_point_state', 'start_point_zipcode', 'start_point_lat' \
       ,'start_point_lon', 'start_point_timestamp', 'end_point_city' \
       ,'end_point_state', 'end_point_zipcode', 'end_point_lat', 'end_point_lon' \
       ,'end_point_timestamp', 'duration','idle','drive_time' \
       ,'gps_distance','obd_distance','geo_distance' \
       ,'top_speed','ave_speed' \
       ,'hard_accel_count', 'hard_brake_count','hard_manv_count', 'hard_manv_pm'\
       ,'points_city_count', 'points_hwy_count', 'speeding_city_major_count' \
       ,'speeding_city_minor_count', 'speeding_hwy_major_count' \
       ,'speeding_hwy_minor_count']



In [74]:

    
pdf = pdf[columns]
pdf.head()









    Out[74]:






  
    
      
      device_id
      vehicle_nickname
      start_point_city
      start_point_state
      start_point_zipcode
      start_point_lat
      start_point_lon
      start_point_timestamp
      end_point_city
      end_point_state
      end_point_zipcode
      end_point_lat
      end_point_lon
      end_point_timestamp
      duration
      idle
      drive_time
      gps_distance
      obd_distance
      geo_distance
      top_speed
      ave_speed
      hard_accel_count
      hard_brake_count
      hard_manv_count
      hard_manv_pm
      points_city_count
      points_hwy_count
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
    
  
  
    
      0
      37
      bens
      London
      England
      SE22 0NN
      51.45408
      -0.05773
      2016-08-10 17:33:52
      London
      England
      SE20 8SH
      51.40580
      -0.06562
      2016-08-10 17:55:50
      21.96667
      6.00000
      15.96667
      4.56465
      4.49189
      3.35521
      41.60000
      16.87976
      0
      3
      3
      0.66787
      47
      0
      0
      1
      0
      0
    
    
      1
      37
      bens
      London
      England
      SE20 8SJ
      51.40570
      -0.06566
      2016-08-13 09:00:25
      London
      England
      SE6 4RS
      51.43126
      -0.03489
      2016-08-13 09:13:16
      12.85000
      3.15000
      9.70000
      2.26938
      2.81295
      2.21154
      32.30000
      17.39967
      0
      0
      0
      0.00000
      5
      0
      0
      0
      0
      0
    
    
      2
      37
      bens
      London
      England
      SE6 4RS
      51.43126
      -0.03489
      2016-08-13 10:41:20
      London
      England
      SE20 8SH
      51.40581
      -0.06558
      2016-08-13 10:54:48
      13.46667
      3.56667
      9.90000
      2.21226
      2.72223
      2.20356
      29.80000
      16.49835
      0
      0
      0
      0.00000
      6
      0
      0
      0
      0
      0
    
    
      3
      37
      bens
      London
      England
      SE20 8SH
      51.40616
      -0.06565
      2016-08-14 00:53:25
      London
      England
      SE20 8SW
      51.40634
      -0.06818
      2016-08-14 01:01:18
      7.88333
      1.73333
      6.15000
      0.11582
      1.40492
      0.10999
      28.00000
      13.70654
      0
      0
      0
      0.00000
      3
      0
      0
      0
      0
      0
    
    
      4
      37
      bens
      London
      England
      SE20 8SH
      51.40576
      -0.06559
      2016-08-17 19:27:08
      Orpington
      England
      BR6 0AT
      51.37328
      0.09562
      2016-08-17 20:00:46
      33.63333
      10.08333
      23.55000
      9.23671
      9.05462
      7.32516
      39.10000
      23.06910
      0
      1
      1
      0.11044
      61
      0
      1
      1
      0
      0

Tot. number of rows at this point



In [75]:

    
pdf.shape[0]









    Out[75]:





3879

Feature validation

Make sense of feature distribution and relations
Perform data selection: remove outliers, inconsistent entries, ...

Hard accel and hard brake counts
for about 65$\%$ of the trips no hard accelerations or brakes are
registered. Ditribution falls quite rapidly for higher counts



In [76]:

    
print(' counts of hard accelerations+brakes, as a fraction of the total: ')
(pdf['hard_manv_count'].value_counts()/pdf.shape[0]).head()









    



 counts of hard accelerations+brakes, as a fraction of the total: 






    Out[76]:





0   0.66228
1   0.19309
2   0.06419
3   0.03145
4   0.01830
Name: hard_manv_count, dtype: float64



In [77]:

    
# 
fig, ax = plt.subplots(1,2,figsize=(20,5.5));

# superimpose three distributions, normalized to show the
# relative frequency of each count
args = dict(bins=5, range=(0,5), normed=True, alpha=0.3, align='left'); #opts

ax[0].hist(pdf.hard_accel_count, label='accels', **args);
ax[0].hist(pdf.hard_brake_count, label='brakes', **args);
ax[0].hist(pdf.hard_manv_count,  label='accels+brakes',**args);
ax[0].set_xlabel('counts')
ax[0].set_ylabel('frequency [fractional]')
ax[0].legend();

# blank second panel
#ax[1].axis('off');
ax[1].hist(pdf[pdf.hard_manv_count>0].hard_manv_pm,bins=20, range=(0,1), alpha=0.5);
ax[1].set_xlabel('accels+braks per mile')
ax[1].text(0.4,150,' only trips with at least one \n hard accel or one hard brake')









    Out[77]:





<matplotlib.text.Text at 0x11a1f9350>

distances
Inspect correlation plots between the three distances available.
For a given trip it is expected that the covered distance is always larger than the geo-distance between the
starting point and the end point.
Can be oserved that:
For a small number of trips the OBD distance is shorter than the geo-distance.
For a quite large number of trips the GPS distance is shorter than the geo-distance.
In both cases this is seen mostly for short trips, an indication that the cause could be
GPS miscalculation and/or GPS precision.
Moreover, difference between OBD and GPS measurement can be quite large.
Such results justify the usage of the OBD distance as the default measured trip length.



In [78]:

    
# keep only nedeed features
vars = ['geo_distance','obd_distance','gps_distance']

data = pdf[vars]


# some data selection for better visualization
dmax= 30

data = data[ (data.gps_distance<dmax) & (data.obd_distance<dmax) & (data.geo_distance<dmax) ]

# flag trips based on the condition on obd and geo distance 
hue = ['','obd > geo','obd < geo'] #[legend title,items...]
data[hue[0]] = (data.obd_distance > data.geo_distance).apply(lambda x: hue[1] if x else hue[2])


# prepare for visualization 
data.columns = ['geo [mi]', 'obd [mi]','gps [mi]',hue[0]]

sns.set_style('whitegrid', {'axes.edgecolor': '0.5'})


g = sns.PairGrid(data, hue=hue[0], size=3, palette={hue[1]:'g',hue[2]:'r'})

# histo on diagonal
g = g.map_diag(plt.hist, bins=30, histtype="step", linewidth=3)


# scatter on the upper side
g = g.map_upper(plt.scatter, alpha=0.7, edgecolor="w")

# heatmaps (2d-histograms) on the lower side (useful to visualize data clusterization)
# Add plt.histo2d into the sns PairGrid API it's not straightforward
# The plt.histo2d API is not fully compatible with PairGrid
# because it doesn't know how to handle a color= kwarg. 
# It is necessary to use a thin wrapper function.
def pairgrid_heatmap(x, y, **kws):
    cmap = sns.light_palette(kws.pop("color"), as_cmap=True)
    plt.hist2d(x, y, cmap=cmap, cmin=1, **kws)
    

g.map_lower(pairgrid_heatmap, bins=(np.linspace(0,dmax,dmax+1),np.linspace(0,dmax,dmax+1)))

#
g = g.add_legend()

# adjust number of ticks and avoid overlaps
for ax in g.axes.flat:
    ax.xaxis.set_major_locator(mpl.ticker.MaxNLocator(4, prune="both"))
    ax.yaxis.set_major_locator(mpl.ticker.MaxNLocator(4, prune="both"))
    
# adjust subplots spacing
g.fig.subplots_adjust(hspace=.05, wspace=.05);



In [79]:

    
print(' num. of trips where gps distance <=  geo distance: {}/{} '.format(
          pdf[pdf['gps_distance'] <= pdf['geo_distance']].shape[0], pdf.shape[0] )
     )

#show data
pdf[pdf['gps_distance'] < pdf['geo_distance']].head()









    



 num. of trips where gps distance <=  geo distance: 514/3879 






    Out[79]:






  
    
      
      device_id
      vehicle_nickname
      start_point_city
      start_point_state
      start_point_zipcode
      start_point_lat
      start_point_lon
      start_point_timestamp
      end_point_city
      end_point_state
      end_point_zipcode
      end_point_lat
      end_point_lon
      end_point_timestamp
      duration
      idle
      drive_time
      gps_distance
      obd_distance
      geo_distance
      top_speed
      ave_speed
      hard_accel_count
      hard_brake_count
      hard_manv_count
      hard_manv_pm
      points_city_count
      points_hwy_count
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
    
  
  
    
      12
      37
      bens
      Beckenham
      England
      BR3 4EH
      51.40113
      -0.05232
      2016-08-23 07:27:33
      Beckenham
      England
      BR3 4EA
      51.40072
      -0.05150
      2016-08-23 07:33:51
      6.30000
      2.11667
      4.18333
      0.04361
      1.17626
      0.04541
      29.20000
      16.87060
      0
      1
      1
      0.85016
      2
      0
      0
      0
      0
      0
    
    
      18
      37
      bens
      Beckenham
      England
      BR3 4AA
      51.39754
      -0.04958
      2016-08-28 12:01:19
      London
      England
      SE20 8SH
      51.40579
      -0.06561
      2016-08-28 12:09:24
      8.08333
      1.91667
      6.16667
      0.01975
      1.61246
      0.89738
      28.60000
      15.68878
      0
      0
      0
      0.00000
      3
      0
      0
      0
      0
      0
    
    
      37
      6
      Bill's Honda
      Sutton
      England
      SM1 3RD
      51.36700
      -0.17840
      2016-09-05 12:34:10
      Sutton
      England
      SM1 3RD
      51.36703
      -0.17853
      2016-09-05 12:34:58
      0.80000
      0.48333
      0.31667
      0.00411
      0.00497
      0.00592
      3.10686
      0.94187
      0
      0
      0
      0.00000
      1
      0
      0
      0
      0
      0
    
    
      65
      37
      bens
      Croydon
      England
      SE25
      51.39001
      -0.08990
      2016-09-10 18:03:17
      London
      England
      SE20 8SJ
      51.40584
      -0.06578
      2016-09-10 18:12:12
      8.91667
      2.58333
      6.33333
      0.03999
      1.63607
      1.51158
      37.30000
      15.49961
      0
      0
      0
      0.00000
      3
      0
      0
      0
      0
      0
    
    
      67
      37
      bens
      London
      England
      SE20 8SH
      51.40589
      -0.06566
      2016-09-11 11:13:31
      Croydon
      England
      CR0 2DJ
      51.38809
      -0.08958
      2016-09-11 11:22:21
      8.83333
      2.06667
      6.76667
      0.01172
      1.67584
      1.60733
      27.30000
      14.85965
      0
      0
      0
      0.00000
      3
      0
      0
      0
      0
      0



In [80]:

    
print(' num. of trips where obd distance <=  geo distance: {}/{} '.format(
          pdf[pdf['obd_distance'] <= pdf['geo_distance']].shape[0],pdf.shape[0] )
     )

#show data
pdf[pdf['obd_distance'] < pdf['geo_distance']].head()









    



 num. of trips where obd distance <=  geo distance: 111/3879 






    Out[80]:






  
    
      
      device_id
      vehicle_nickname
      start_point_city
      start_point_state
      start_point_zipcode
      start_point_lat
      start_point_lon
      start_point_timestamp
      end_point_city
      end_point_state
      end_point_zipcode
      end_point_lat
      end_point_lon
      end_point_timestamp
      duration
      idle
      drive_time
      gps_distance
      obd_distance
      geo_distance
      top_speed
      ave_speed
      hard_accel_count
      hard_brake_count
      hard_manv_count
      hard_manv_pm
      points_city_count
      points_hwy_count
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
    
  
  
    
      37
      6
      Bill's Honda
      Sutton
      England
      SM1 3RD
      51.36700
      -0.17840
      2016-09-05 12:34:10
      Sutton
      England
      SM1 3RD
      51.36703
      -0.17853
      2016-09-05 12:34:58
      0.80000
      0.48333
      0.31667
      0.00411
      0.00497
      0.00592
      3.10686
      0.94187
      0
      0
      0
      0.00000
      1
      0
      0
      0
      0
      0
    
    
      155
      40
      ov54phj
      Stoke
      England
      ST6
      53.06413
      -2.21262
      2016-09-17 10:25:14
      Stoke
      England
      ST6
      53.06408
      -2.21259
      2016-09-17 10:27:17
      2.05000
      1.63333
      0.41667
      0.00706
      0.00186
      0.00380
      2.50000
      0.26843
      0
      0
      0
      0.00000
      2
      0
      0
      0
      0
      0
    
    
      233
      20
      L3TGG
      Four Marks
      England
      GU34 5DG
      51.10756
      -1.04513
      2016-09-18 14:17:53
      Four Marks
      England
      GU34 5DG
      51.10756
      -1.04523
      2016-09-18 14:19:06
      1.21667
      0.73333
      0.48333
      0.00408
      0.00373
      0.00414
      3.10686
      0.46281
      0
      0
      0
      0.00000
      3
      0
      0
      0
      0
      0
    
    
      238
      5
      W31PMW
      Perth
      Scotland
      PH1 3UN
      56.41909
      -3.47579
      2016-09-18 14:56:41
      Perth
      Scotland
      PH1 3ED
      56.41898
      -3.47600
      2016-09-18 14:57:19
      0.63333
      0.00000
      0.63333
      0.00881
      0.00186
      0.01133
      3.72823
      0.17660
      0
      0
      0
      0.00000
      2
      0
      0
      0
      0
      0
    
    
      267
      40
      ov54phj
      Stoke
      England
      ST1 6BG
      53.03754
      -2.16995
      2016-09-19 09:48:50
      Stoke
      England
      ST1 6BG
      53.03752
      -2.16990
      2016-09-19 09:55:28
      6.63333
      6.61667
      0.01667
      0.00331
      0.00000
      0.00269
      0.00000
      0.00000
      0
      0
      0
      nan
      1
      0
      0
      0
      0
      0

Apply selection: assuming the obd distance is the most robust one, it is safe to remove trips for which obd distance > geo distance. This helps removing trips for which start and/or end point gps are not well measured and some very short trips which will be anyhow treated in the following.



In [81]:

    
pdf = pdf[pdf['obd_distance'] > pdf['geo_distance']]

print( 'Tot. num of rows: {}'.format(pdf.shape[0]))









    



Tot. num of rows: 3768

speed
For a small number of trips top speed > average speed.
Corresponding rows are removed



In [82]:

    
print(' num. of trips where top speed < ave speed: {}/{} '.format(
          pdf[pdf['top_speed'] <= pdf['ave_speed']].shape[0], pdf.shape[0] )
     )

# filter out those trips
pdf = pdf[pdf['top_speed'] >  pdf['ave_speed']]









    



 num. of trips where top speed < ave speed: 33/3768



In [83]:

    
print( 'Tot. num of rows: {}'.format(pdf.shape[0]))









    



Tot. num of rows: 3735

Exploratory study of the relation between everage speed and top speed for different values of:
drive time, total number of accelerations+brakes (hard maneuvers), hwy points and city points.
Already without additional information about the actual meaning of the last two variables it is possible to conclude that:

      - trips with high number of hard meneuvers have higher average and top speed 
      - trips with non-null highway counts show a top speed peak around highway or non-urban speed limit    
      - trips with null highway counts show a top speed peak around the city (urban areas) speed limit  
      - shorter trips are usually urban area trips



In [84]:

    
# features to plot
vars = ['ave_speed','top_speed','obd_distance']


# prepare data selections

data_1_1 = pdf[ pdf.hard_manv_count>3 ][vars]
data_1_2 = pdf[ pdf.hard_manv_count==0 ][vars]

data_2 = pdf[ ((pdf.drive_time>15) | (pdf.points_hwy_count>0)) ][vars]


data_3 = pdf[ ((pdf.drive_time<15) | (pdf.points_hwy_count==0)) ][vars]


# setup graphics
fig, ax = plt.subplots(1,3,figsize=(20,5),sharey=True);

# first panel
#
sns.kdeplot(data_1_1.top_speed, data_1_1.ave_speed, shade=True, shade_lowest=False, cmap='Blues',alpha=0.7, ax=ax[0])
sns.kdeplot(data_1_2.top_speed, data_1_2.ave_speed, shade=True, shade_lowest=False, cmap='Reds',alpha=0.7, ax=ax[0])
ax[0].set_title('')
ax[0].set_xlabel('top speed [mph]')
ax[0].set_xlim(0,100)
ax[0].set_ylabel('ave speed [mph]')
ax[0].set_ylim(0,75)
ax[0].text(30,60,'hard accel+brake\'s >3 ',color='b', fontsize=20)
ax[0].text(10,5,'hard accel+brake\'s =0 ',color='r', fontsize=20)
ax[0].axvline(x=30, linewidth=1, linestyle='--')
ax[0].axvline(x=70, linewidth=1, linestyle='--')
ax[0].text(60,80,'UK highway \n speed limit',multialignment='center',color='b',fontsize=15)
ax[0].text(20,80,'UK city \n speed limit',multialignment='center',color='b',fontsize=15)
ax[0].grid(False)


# second panel
#
sns.kdeplot(data_2.top_speed, data_2.ave_speed, shade=True, shade_lowest=False, cmap='Greens',alpha=1, ax=ax[1])

ax[1].set_title('')
ax[1].set_xlabel('top speed [mph]')
ax[1].set_xlim(0,100)
ax[1].set_ylabel('ave speed [mph]')
ax[1].set_ylim(0,75)
ax[1].text(10,50,'points hwy count > 0 \n OR \n drive time > 15 min',multialignment='center',color='g',fontsize=20)
ax[1].axvline(x=70, linewidth=1, linestyle='--')
ax[1].text(60,80,'UK highway \n speed limit',multialignment='center',color='b',fontsize=15)
ax[1].grid(False)


# third panel
#
sns.kdeplot(data_3.top_speed, data_3.ave_speed, shade=True, shade_lowest=False, cmap='Purples', alpha=1, ax=ax[2])

ax[2].set_title('')
ax[2].set_xlabel('top speed [mph]')
ax[2].set_xlim(0,100)
ax[2].set_ylabel('ave speed [mph]')
ax[2].set_ylim(0,75)
ax[2].axvline(x=30, linewidth=1, linestyle='--')
ax[2].text(10,50,'points hwy count = 0 \n OR \n drive time < 15 min', multialignment='center',color='Purple',fontsize=20)
ax[2].text(20,80,'UK city \n speed limit',multialignment='center',color='b',fontsize=15)
ax[2].grid(False)

Comments to the plots above:

First panel: densities are not to scale, indeed the fraction of trips with both zero hard accel and brakes is much larger (see plot)
Second panel: cluster at around 50 mph is related to the increase of the mean top speed for longer trips which are a mixture of both city and non city rides.

Feature Engineering

Prepare features for ML algos, in particular the day of the week and the hour of the day

Categorical Features breakdown:



In [85]:

    
# define func to show categorical fields and the number of levels
def dump_categories(df):
    for col_name in df.columns:
        if df[col_name].dtypes == 'object':
            unique_cat = len(df[col_name].unique())
            print("Feature '{col_name}' has {unique_cat} unique categories".format(col_name=col_name, unique_cat=unique_cat))



In [86]:

    
dump_categories(pdf)









    



Feature 'vehicle_nickname' has 55 unique categories
Feature 'start_point_city' has 352 unique categories
Feature 'start_point_state' has 4 unique categories
Feature 'start_point_zipcode' has 1305 unique categories
Feature 'end_point_city' has 363 unique categories
Feature 'end_point_state' has 4 unique categories
Feature 'end_point_zipcode' has 1361 unique categories

zipcode:

There are ~1.3k different zip codes out of ~4k trips.
A query to the UK postcodes web API shows that this is a 'very fine granularity' post code
and that the first part is the 'Area Code'
For this dataset it is necessary to extract only the Area Code in order to have a usable feature.
There are 89 different area codes.
New features created: start_area_zip and end_area_zip.



In [87]:

    
pdf[ ['start_point_zipcode','end_point_zipcode']].head(3)









    Out[87]:






  
    
      
      start_point_zipcode
      end_point_zipcode
    
  
  
    
      0
      SE22 0NN
      SE20 8SH
    
    
      1
      SE20 8SJ
      SE6 4RS
    
    
      2
      SE6 4RS
      SE20 8SH



In [88]:

    
print( 'toal number of trips: {}'.format(pdf.shape[0]))
print( 'num. of unique starting point zip codes: {}'.format(pdf['start_point_zipcode'].unique().shape[0]))









    



toal number of trips: 3735
num. of unique starting point zip codes: 1305



In [89]:

    
# query UK postcode web API
# use the first zipcode in the dataset
import urllib
code = pdf['start_point_zipcode'].values[0]
url = 'http://api.postcodes.io/postcodes/'+str(code)
res = urllib.urlopen(url).read()

#print results
code
url
res









    Out[89]:





'{"status":200,"result":{"postcode":"SE22 0NN","quality":1,"eastings":535064,"northings":174592,"country":"England","nhs_ha":"London","longitude":-0.0575022174614355,"latitude":51.4542065140995,"parliamentary_constituency":"Camberwell and Peckham","european_electoral_region":"London","primary_care_trust":"Southwark","region":"London","lsoa":"Southwark 029C","msoa":"Southwark 029","incode":"0NN","outcode":"SE22","admin_district":"Southwark","parish":"Southwark, unparished area","admin_county":null,"admin_ward":"Peckham Rye","ccg":"NHS Southwark","nuts":"Lewisham and Southwark","codes":{"admin_district":"E09000028","admin_county":"E99999999","admin_ward":"E05000547","parish":"E43000218","ccg":"E38000171","nuts":"UKI44"}}}'



In [90]:

    
# keep only literal characther of the first part of the code
from string import digits

pdf['start_area_zip']=pdf['start_point_zipcode'].apply(lambda x: str(x).split(' ')[0].translate(None,digits))
pdf['end_area_zip']=pdf['end_point_zipcode'].apply(lambda x: str(x).split(' ')[0].translate(None,digits))

# show some data
pdf[['start_point_zipcode','start_area_zip']].head(3)









    Out[90]:






  
    
      
      start_point_zipcode
      start_area_zip
    
  
  
    
      0
      SE22 0NN
      SE
    
    
      1
      SE20 8SJ
      SE
    
    
      2
      SE6 4RS
      SE



In [91]:

    
# find list of unique Area Codes
area_code_list = pdf['start_area_zip'].unique().tolist()
second_list    = pdf['end_area_zip'].unique().tolist()
area_code_list.extend(x for x in second_list if x not in area_code_list)

print( 'num. of unique area zip codes: {}'.format(len(area_code_list)) )









    



num. of unique area zip codes: 89

Number of Area Codes is still too high.
Considering that 17$\%$ of the trips start in the 'RG' area and the second most frequent code is present ~6$\%$ of the times, it is advisable to group codes in just two categories (to begin with): 'RG' and 'Other'



In [92]:

    
# define a handy function
def get_fractional_series(s):
    """get a pd series and return a series 
    of unique values sorted by relative frequency"""
    if s.shape[0]==0:
        return s
    f = s.value_counts()
    if f.sum():
        f /= f.sum()
    return f



In [93]:

    
# compute relative fractions of starting area codes
fracs = get_fractional_series(pdf['start_area_zip'])
fracs.head()









    Out[93]:





RG   0.17323
EH   0.05944
GU   0.04123
DN   0.04070
HX   0.03829
Name: start_area_zip, dtype: float64



In [94]:

    
# plot first n entries 
ax = fracs[:40].plot(kind='bar',figsize=(20,5))
ax.set_xlabel('Area Code',fontsize=24)
ax.set_ylabel('Frequency',fontsize=24)









    Out[94]:





<matplotlib.text.Text at 0x1164d7ed0>



In [95]:

    
# select only the first item and group the other in the 'Other' category
code_list = fracs.index[0]

pdf['start_area_zip'] = pdf['start_area_zip'].apply(lambda x: x if x in code_list else 'Other')
pdf['end_area_zip']   = pdf['end_area_zip'].apply(lambda x: x if x in code_list else 'Other')

# cross check
print(' area code categories (levels={}): '.format(pdf['start_area_zip'].unique().shape[0]))
pdf['start_area_zip'].unique()









    



 area code categories (levels=2): 






    Out[95]:





array(['Other', 'RG'], dtype=object)

city:

There are more than 300 different start/end point cities.
Distribution of cities is quite even. In order to reduce number of categories cities contributing for less then 3$\%$ are binned into the category 'Other'



In [96]:

    
print( 'toal number of trips: {}'.format(pdf.shape[0]))
print( 'num. of unique starting point cities: {}'.format(pdf['start_point_city'].unique().shape[0]))









    



toal number of trips: 3735
num. of unique starting point cities: 352



In [97]:

    
fracs = get_fractional_series(pdf['start_point_city'])
fracs.head()









    Out[97]:





Scunthorpe   0.04043
London       0.03748
Bracknell    0.03347
Reading      0.03266
Sutton       0.02865
Name: start_point_city, dtype: float64



In [98]:

    
# plot first n entries 
ax = fracs[:40].plot(kind='bar',figsize=(20,5))
ax.set_xlabel('Starting Point City',fontsize=24)
ax.set_ylabel('Frequency',fontsize=24)









    Out[98]:





<matplotlib.text.Text at 0x11add49d0>



In [99]:

    
#
def cut_fraction(series,f,opt=None):
    """
     takes a series and returns a list of UNIQUE items for which
     the occurrence frquency higher than f
     if option='cumulative' if specified returns items contributing
     up to a fraction f of the total 
    """
    if series.shape[0]==0:
        return series
    s = series
    s.sort_values(ascending=False, inplace=True)

    if str(opt).lower()=='cumulative':
        s.cumsum()
        print('h')
        return s[s<f].index.tolist()
    else:
        return s[s>f].index.tolist()



In [100]:

    
city_list = cut_fraction(fracs, 0.03)

pdf['start_point_city'] = pdf['start_point_city'].apply(lambda x: x if x in city_list else 'Other')
pdf['end_point_city'] = pdf['end_point_city'].apply(lambda x: x if x in city_list else 'Other')

# cross check
print(' city categories (counts={}): '.format(pdf['start_point_city'].unique().shape[0]))
pdf['start_point_city'].unique()









    



 city categories (counts=5): 






    Out[100]:





array([u'London', 'Other', u'Reading', u'Scunthorpe', u'Bracknell'], dtype=object)

(Timeseries) Date Time Features:

Inspect distribution of relevant quantities over time

Index rows with the timestamp in order to use timeseries realted utilities



In [101]:

    
pdf.reset_index(inplace=True)
pdf.set_index('start_point_timestamp', inplace=True)

Data time interval is from 2016-08-10 to 2016-12-13



In [102]:

    
# first two trips
pdf.head(2)









    Out[102]:






  
    
      
      index
      device_id
      vehicle_nickname
      start_point_city
      start_point_state
      start_point_zipcode
      start_point_lat
      start_point_lon
      end_point_city
      end_point_state
      end_point_zipcode
      end_point_lat
      end_point_lon
      end_point_timestamp
      duration
      idle
      drive_time
      gps_distance
      obd_distance
      geo_distance
      top_speed
      ave_speed
      hard_accel_count
      hard_brake_count
      hard_manv_count
      hard_manv_pm
      points_city_count
      points_hwy_count
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
      start_area_zip
      end_area_zip
    
    
      start_point_timestamp
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2016-08-10 17:33:52
      0
      37
      bens
      London
      England
      SE22 0NN
      51.45408
      -0.05773
      London
      England
      SE20 8SH
      51.40580
      -0.06562
      2016-08-10 17:55:50
      21.96667
      6.00000
      15.96667
      4.56465
      4.49189
      3.35521
      41.60000
      16.87976
      0
      3
      3
      0.66787
      47
      0
      0
      1
      0
      0
      Other
      Other
    
    
      2016-08-13 09:00:25
      1
      37
      bens
      London
      England
      SE20 8SJ
      51.40570
      -0.06566
      London
      England
      SE6 4RS
      51.43126
      -0.03489
      2016-08-13 09:13:16
      12.85000
      3.15000
      9.70000
      2.26938
      2.81295
      2.21154
      32.30000
      17.39967
      0
      0
      0
      0.00000
      5
      0
      0
      0
      0
      0
      Other
      Other



In [103]:

    
# last two trips
pdf.tail(2)









    Out[103]:






  
    
      
      index
      device_id
      vehicle_nickname
      start_point_city
      start_point_state
      start_point_zipcode
      start_point_lat
      start_point_lon
      end_point_city
      end_point_state
      end_point_zipcode
      end_point_lat
      end_point_lon
      end_point_timestamp
      duration
      idle
      drive_time
      gps_distance
      obd_distance
      geo_distance
      top_speed
      ave_speed
      hard_accel_count
      hard_brake_count
      hard_manv_count
      hard_manv_pm
      points_city_count
      points_hwy_count
      speeding_city_major_count
      speeding_city_minor_count
      speeding_hwy_major_count
      speeding_hwy_minor_count
      start_area_zip
      end_area_zip
    
    
      start_point_timestamp
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2016-12-13 18:16:32
      3877
      37
      bens
      London
      England
      SE6 4PU
      51.44081
      -0.02927
      London
      England
      SE20 8SJ
      51.40582
      -0.06575
      2016-12-13 18:34:25
      17.88333
      6.78333
      11.10000
      1.99922
      3.42438
      2.88743
      37.30000
      18.51014
      0
      0
      0
      0.00000
      2
      0
      0
      0
      0
      0
      Other
      Other
    
    
      2016-12-13 19:11:44
      3878
      48
      D4EHH
      Other
      England
      GU16 6PX
      51.30424
      -0.72358
      Other
      England
      GU14 6EN
      51.27596
      -0.74820
      2016-12-13 19:21:25
      9.68333
      1.05000
      8.63333
      3.29158
      3.45482
      2.22731
      38.50000
      24.01036
      0
      1
      1
      0.28945
      28
      0
      0
      1
      0
      0
      Other
      Other

Plot number of trips with a sampling of 1 week and 1 day.
- Most of the data refers to November 2016, besides a spike in mid Sep
- Day rate shows some pattern, days with higher activity are not always correlated with weekday or weekends
- Sample is likely too limited to perform timeseries forecasting



In [104]:

    
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns



In [105]:

    
#
fig,ax = plt.subplots(1,2,figsize=(18,5))


# first panel
#

# resampling a 1W, count number of trips
pdf['device_id'].resample('w', how='count')\
                 .plot(kind='line',style='-o',use_index=True, ax=ax[0])
    
ax[0].set_ylabel('num. of trips', fontsize=24)
ax[0].set_xlabel('')


# second panel
#

# reamplig ad 1 day, count number of trips
pdf['device_id'].resample('1D', how='count')\
                 .plot(kind='line',style='-o',use_index=True, ax=ax[1], label='all')

# select only Saturaday and Sundays
d = pdf[pdf.index.dayofweek>=5]['device_id'].resample('1D', how='count')
d[d!=0].plot(kind='line',style='o',use_index=True, ax=ax[1], label='weekends (Sat/Sun)')


ax[1].set_ylabel('num. of trips', fontsize=24)
ax[1].set_xlabel('')
ax[1].legend()









    Out[105]:





<matplotlib.legend.Legend at 0x11752e610>

More interesting could be to use the information on the day of the week and on the hour of the day
- Produce distribution of number of trips by day of the week and by hour of the day
  - In order to reduce the number of categories, hours are grouped into three categories (shown on the plot)
  - A field is added for the day of the week



In [106]:

    
fig,ax = plt.subplots(1,2,figsize=(18,5))

# a little bit convolute way to show bar labels in the right order
d = pdf.groupby( [pdf.index.to_datetime().strftime('%a'), pdf.index.weekday] )\
            .count()\
            .reset_index()\
            .sort_values('level_1')\
            .set_index('level_0')['device_id']
d.plot(kind='bar',ax=ax[0])
ax[0].set_xlabel('')
ax[0].set_ylabel('number of trips',color='g',fontsize=28)


# bar plot grouping by hour
d = pdf.groupby(pdf.index.hour )\
            .count()\
            ['device_id']

d.plot(kind='bar',ax=ax[1])
ax[1].set_xlabel('hour', color='g', fontsize=24)
ax[1].set_ylabel('number of trips',color='g',fontsize=28)
ax[1].grid(False)


# based on the hour-barplot, let's divide the 24h into three categories
# draw the sepration lines
day_zones = [-0.1,6,13,19,23]

for h in day_zones[1:-1]:
    ax[1].axvline(x=h+0.5, linewidth=3, color='r', linestyle='--')


fig.subplots_adjust(hspace=.05, wspace=.2);

cut the datatime.hour info by the hour bins and add a categorical feature day_zone
(for both start and end point time)



In [107]:

    
# cut data
pdf['start_day_zone'] = pd.cut(pdf.index.hour, day_zones,labels=['Early','Morning','Afternoon','Night'])
# workaround as two categories wih the same name are not possible in pd.cut
pdf['start_day_zone'] = pdf['start_day_zone'].apply(lambda x: x if 'Early' not in x else 'Night')



In [108]:

    
# set datetime index to end_point
pdf.reset_index(inplace=True)
pdf.set_index('end_point_timestamp', inplace=True)

# apply categorization
pdf['end_day_zone'] = pd.cut(pdf.index.hour, day_zones,labels=['Early','Morning','Afternoon','Night'])
pdf['end_day_zone'] = pdf['end_day_zone'].apply(lambda x: x if 'Early' not in x else 'Night')
pdf[['end_day_zone']].head()

# reset index to start_point
pdf.reset_index(inplace=True)
pdf.set_index('start_point_timestamp', inplace=True)



In [109]:

    
# check results
pdf[['start_day_zone','end_point_timestamp','end_day_zone']].head()









    Out[109]:






  
    
      
      start_day_zone
      end_point_timestamp
      end_day_zone
    
    
      start_point_timestamp
      
      
      
    
  
  
    
      2016-08-10 17:33:52
      Afternoon
      2016-08-10 17:55:50
      Afternoon
    
    
      2016-08-13 09:00:25
      Morning
      2016-08-13 09:13:16
      Morning
    
    
      2016-08-13 10:41:20
      Morning
      2016-08-13 10:54:48
      Morning
    
    
      2016-08-14 00:53:25
      Night
      2016-08-14 01:01:18
      Night
    
    
      2016-08-17 19:27:08
      Afternoon
      2016-08-17 20:00:46
      Night



In [110]:

    
#reset index to ordinal
pdf.reset_index(inplace=True)

#add weekday feature
pdf['start_weekday'] = pdf['start_point_timestamp'].apply(lambda x : x.to_datetime().strftime('%a') )
pdf['end_weekday'] = pdf['end_point_timestamp'].apply(lambda x : x.to_datetime().strftime('%a') )

(Longitudinal Data) vehicle based data: :

There are about 55 unique vehicles
Produce plots, for each vehicle (over the entire time interval) of:
- accumulated drive time
- total number of trips
To help removing noise/outliers only vehicles with:
- at least 10 trips and at least 1h of accumulated drive time
  Such selection will remove about 70 records and 10 vehicles



In [111]:

    
# 
plt = pdf.groupby(['vehicle_nickname'],as_index=True)['drive_time']\
                .sum().sort_values(ascending=False)\
                .apply(lambda x: x/60.)\
                .plot(kind='bar',figsize=(20,7)) 
    
plt.set_xlabel('vehicle name', fontsize=28)
plt.set_ylabel('accumualted drive time [h]', fontsize=28)









    Out[111]:





<matplotlib.text.Text at 0x11b972b50>



In [112]:

    
# to apply the selection it's necessary to add for each row the information about the acculated drive time
# over the entire period for the vehicle of the row
pdf['drive_time_sum'] = pdf.groupby(['vehicle_nickname'],as_index=False)['drive_time'].transform('sum')



In [113]:

    
print(' number of vehicles: {}'.format(pdf['vehicle_nickname'].unique().shape[0]))









    



 number of vehicles: 55



In [114]:

    
print( ' num of rows after removing vehicle with less then 2h accumulated drive time/total rows: {}/{} '.
    format(pdf[pdf['drive_time_sum']>120].shape[0], pdf.shape[0])
     )









    



 num of rows after removing vehicle with less then 2h accumulated drive time/total rows: 3709/3735



In [115]:

    
pdf = pdf[ pdf['drive_time_sum']>120 ]

pdf.drop('drive_time_sum',axis=1,inplace=True)



In [116]:

    
# 
size = pdf.groupby(['vehicle_nickname'],as_index=False).size().sort_values(ascending=False)

plt = size.plot(kind='bar',figsize=(20,6))
plt.set_xlabel('vehicle name', fontsize=28)
plt.set_ylabel('number of trips', fontsize=28)









    Out[116]:





<matplotlib.text.Text at 0x118b6fdd0>



In [117]:

    
# 
print(' bin values: ')
np.sort(size)









    



 bin values: 






    Out[117]:





array([  5,  10,  16,  17,  22,  24,  24,  25,  29,  30,  32,  32,  32,
        33,  33,  34,  37,  39,  39,  40,  42,  43,  49,  52,  59,  61,
        63,  63,  69,  70,  72,  75,  76,  78,  82,  95, 100, 106, 107,
       108, 108, 119, 121, 126, 127, 136, 151, 178, 193, 212, 215])



In [118]:

    
# select names of the last four
names = size.index[-4:].tolist()
print(names)









    



[u'California', u'SG63XMW', u'RK58SRO', u'FY64AZT']



In [119]:

    
# select DF rows
n = pdf.shape[0]
pdf = pdf[ ~pdf.vehicle_nickname.isin(names) ]

print( ' num of rows after removing vehicle with less then 10 trips/total rows: {}/{} '.
    format(pdf.shape[0],n)
     )









    



 num of rows after removing vehicle with less then 10 trips/total rows: 3661/3709



In [120]:

    
print(' number of vehicles: {}'.format(pdf['vehicle_nickname'].unique().shape[0]))









    



 number of vehicles: 47

Final Feature Selection. Data Structure



In [121]:

    
pdf.dtypes









    Out[121]:





start_point_timestamp        datetime64[ns]
end_point_timestamp          datetime64[ns]
index                        int64         
device_id                    int64         
vehicle_nickname             object        
start_point_city             object        
start_point_state            object        
start_point_zipcode          object        
start_point_lat              float64       
start_point_lon              float64       
end_point_city               object        
end_point_state              object        
end_point_zipcode            object        
end_point_lat                float64       
end_point_lon                float64       
duration                     float64       
idle                         float64       
drive_time                   float64       
gps_distance                 float64       
obd_distance                 float64       
geo_distance                 float64       
top_speed                    float64       
ave_speed                    float64       
hard_accel_count             int64         
hard_brake_count             int64         
hard_manv_count              int64         
hard_manv_pm                 float64       
points_city_count            int64         
points_hwy_count             int64         
speeding_city_major_count    int64         
speeding_city_minor_count    int64         
speeding_hwy_major_count     int64         
speeding_hwy_minor_count     int64         
start_area_zip               object        
end_area_zip                 object        
start_day_zone               object        
end_day_zone                 object        
start_weekday                object        
end_weekday                  object        
dtype: object



In [122]:

    
# list of colums with ordering
columns = [ 'vehicle_nickname'\
            #
            ,'start_point_state','start_point_city','start_area_zip'\
            ,'start_point_lat','start_point_lon'\
            ,'start_day_zone','start_weekday'\
            #
            ,'end_point_state','end_point_city','end_area_zip'\
            ,'end_point_lat','end_point_lon'\
            ,'end_day_zone','end_weekday'\
            #
            ,'top_speed','ave_speed'\
            ,'obd_distance','geo_distance'\
            ,'drive_time','idle'\
            ,'hard_manv_count'
          ]

#            #
#            ,'points_city_count','points_hwy_count'\
#            ,'speeding_city_major_count','speeding_city_minor_count'\
#            ,'speeding_hwy_major_count','speeding_hwy_minor_count'
#]



In [123]:

    
# extract columns
pdf = pdf[ columns ]



In [124]:

    
# final Data Frame shape
print(' Number of rows, features:')
pdf.shape









    



 Number of rows, features:






    Out[124]:





(3661, 22)

Numeric and Categorical Features



In [125]:

    
numeric_features = list(pdf.dtypes[(pdf.dtypes == 'int64') | (pdf.dtypes == 'float64') ].index)
categorical_features = list(pdf.dtypes[(pdf.dtypes == 'object') ].index)



In [126]:

    
numeric_features









    Out[126]:





['start_point_lat',
 'start_point_lon',
 'end_point_lat',
 'end_point_lon',
 'top_speed',
 'ave_speed',
 'obd_distance',
 'geo_distance',
 'drive_time',
 'idle',
 'hard_manv_count']



In [127]:

    
categorical_features









    Out[127]:





['vehicle_nickname',
 'start_point_state',
 'start_point_city',
 'start_area_zip',
 'start_day_zone',
 'start_weekday',
 'end_point_state',
 'end_point_city',
 'end_area_zip',
 'end_day_zone',
 'end_weekday']

Random Forest Regression

An interesting quantity to predict can be the number of hard accelerations and brakes a driver perform per trip. Such quantity can give indication about the driving habit of the user and the usage of its vehicle.
This quantity should be predicted using information that can be retrived from any vehicle equipped with a gps system: like basically the speed, the distance. That means info that can be recorded even without having installed a special device on the car.
The present dataset, has (among the others) the additional information about specific maneuvers (like hard accelerations and brake) which provide the necessary set of labels for the prediction task.

A Random Forest Regression model will be used for the prediction.
- Automatically model non-linear relations and interactions between variables. No need to add 'expensive' interaction terms. (and perfect collinearity it's not harmful)
- Easy to tune
- Is useful as a step in exploratory data analysis
- Can handle high dimensional data
- In general, perfoms in the top among models at most prediction tasks

Target and Feature Set

The target array will be the column hard_manv_count (hard maneuvers count) dervived has the sum of hard acceleration and hard brakes.
- The sample size is not large enough to allow a prediction for each value of the variable therefore the target will be a binary label where 1 indicates trips with hard_manv_count=0 (i.e. no hard accelrations or brake during the whole trip) and 0 the rest. This will also ensure a more balanced target labels distrinution.

Features definition can be revised here

Create feature set X and target set y from the pdf data frame



In [128]:

    
X = pdf
y = X.pop('hard_manv_count')

target encoding
target =0 as 1
target >0 as 0



In [129]:

    
y = (y==0).astype(int)

Feature Engineering: Encoding

Summary of categorical features and levels



In [130]:

    
dump_categories(X)









    



Feature 'vehicle_nickname' has 47 unique categories
Feature 'start_point_state' has 4 unique categories
Feature 'start_point_city' has 5 unique categories
Feature 'start_area_zip' has 2 unique categories
Feature 'start_day_zone' has 3 unique categories
Feature 'start_weekday' has 7 unique categories
Feature 'end_point_state' has 4 unique categories
Feature 'end_point_city' has 5 unique categories
Feature 'end_area_zip' has 2 unique categories
Feature 'end_day_zone' has 3 unique categories
Feature 'end_weekday' has 7 unique categories

For the regression models to work fetures with categorical values need to be converted into features with integer values. There tipically two ways to do accomplish that:
- OneHotEncoder: to transofrm a features with n categories into n fetures with 0/1 possible states
- LabelEncoder: each category of the feature is mapped to an integer
- OneHotEncoding is preferred when a numerical ordering among the category has no real meaning. Such encoding implies the creation of a very large number of features that should be than reduce via dimensionality reduction techniques. It also make the model more prone to overfitting and can be computationally intensive.
  
  RandomForest regression can handle LabelEncoding therefore this is the chosen encoding. A test was perfomed also with the OneHotEncoding and no differences in terms of performances have been observed



In [131]:

    
def onehot_encoder(df, col_list):
    if col_list:
        feat_list = col_list
    else:
        feat_list = list(df.dtypes[(df.dtypes == 'object') ].index)
        
    for x in col_list:
        dummies = pd.get_dummies(df[x], prefix=x, dummy_na=False)
        df = df.drop(x, 1)
        df = pd.concat([df, dummies], axis=1)
    return df



In [132]:

    
def label_encoder(df, col_list=None):
    if col_list:
        feat_list = col_list
    else:
        feat_list = list(df.dtypes[(df.dtypes == 'object') ].index)
        
    for feat in feat_list:
        unique_cat = df[feat].unique().tolist()
        df[feat] = df[feat].apply(lambda x: unique_cat.index(x))
    return df



In [133]:

    
# example of OneHotEncoding of the feature 'start_weekday' 
#dm = pd.get_dummies(X['start_weekday'],prefix='start_weekday', dummy_na=False)
#dm.head()

transform categorical features into integers array



In [134]:

    
X = label_encoder(X)



In [135]:

    
X.head()









    Out[135]:






  
    
      
      vehicle_nickname
      start_point_state
      start_point_city
      start_area_zip
      start_point_lat
      start_point_lon
      start_day_zone
      start_weekday
      end_point_state
      end_point_city
      end_area_zip
      end_point_lat
      end_point_lon
      end_day_zone
      end_weekday
      top_speed
      ave_speed
      obd_distance
      geo_distance
      drive_time
      idle
    
  
  
    
      0
      0
      0
      0
      0
      51.45408
      -0.05773
      0
      0
      0
      0
      0
      51.40580
      -0.06562
      0
      0
      41.60000
      16.87976
      4.49189
      3.35521
      15.96667
      6.00000
    
    
      1
      0
      0
      0
      0
      51.40570
      -0.06566
      1
      1
      0
      0
      0
      51.43126
      -0.03489
      1
      1
      32.30000
      17.39967
      2.81295
      2.21154
      9.70000
      3.15000
    
    
      2
      0
      0
      0
      0
      51.43126
      -0.03489
      1
      1
      0
      0
      0
      51.40581
      -0.06558
      1
      1
      29.80000
      16.49835
      2.72223
      2.20356
      9.90000
      3.56667
    
    
      3
      0
      0
      0
      0
      51.40616
      -0.06565
      2
      2
      0
      0
      0
      51.40634
      -0.06818
      2
      2
      28.00000
      13.70654
      1.40492
      0.10999
      6.15000
      1.73333
    
    
      4
      0
      0
      0
      0
      51.40576
      -0.06559
      0
      0
      0
      1
      0
      51.37328
      0.09562
      2
      0
      39.10000
      23.06910
      9.05462
      7.32516
      23.55000
      10.08333

Model Execution and performances

Combine a bunch of decision trees into one predictive model
For each tree in the forest:
Take a bootstrap sample of the data Randomly select some variables.
For each variable selected, find the split point which minimizes MSE (or Gini Impurity or Information Gain if classification).
Split the data using the variable with the lowest MSE.
Repeat step 2 through 4 (randomly selecting new sets of variables at each split) until some stopping condition is satisfied or all the data is exhausted.
Repeat this process to build several trees.
To make a prediction, run an observation down several trees and average the predicted values from all the trees (for regression)

Out of Bag (OOB) Error estimate is is activated. This removes the need for a set aside test set.



In [136]:

    
# import module
from sklearn.ensemble import RandomForestRegressor

# The error metric. In this case, we will use c-stat (aka ROC/AUC)
from sklearn.metrics import roc_auc_score



In [137]:

    
#
model = RandomForestRegressor(n_estimators=100, max_depth=None, oob_score=True, random_state=42)

#
model.fit(X, y)









    Out[137]:





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=42,
           verbose=0, warm_start=False)



In [138]:

    
y_oob = model.oob_prediction_
print(' ROC AUC score: {0:2.5f}'.format(roc_auc_score(y, y_oob )) )









    



 ROC AUC score: 0.78369

Features importance, visualization



In [139]:

    
#just as cross check
model.feature_importances_
print(' sum of fractional feature importances: {0:0.2f}'.format(sum(model.feature_importances_)))









    



 sum of fractional feature importances: 1.00



In [140]:

    
# First prepare a plotting function to able to handle also OneHotEncoded features if any
# If features are exploded via OneHotEncoding they have to be summed up before visualization

def graph_feature_importances(model, feature_names, autoscale=True, headroom=0.05, width=10, summarized_columns=None):
    """
    Graphs the feature importances of on using a horizontal bar chart.
    
    feature_names = A list of the names of those featurs, displayed on the Y axis.
    autoscale = True (Automatically adjust the X axis size to the largest feature +.headroom) / False = scale from 0 to 1
    headroom = used with autoscale, .05 default
    width=figure width in inches
    summarized_columns = a list of column prefixes to summarize on, for dummy variables (e.g. ["day_"] would summarize all day_ vars
    """
    
    if autoscale:
        x_scale = model.feature_importances_.max()+ headroom
    else:
        x_scale = 1
    
    feature_dict=dict(zip(feature_names, model.feature_importances_))
    
    if summarized_columns: 
        #some dummy columns need to be summarized
        for col_name in summarized_columns: 
            #sum all the features that contain col_name, store in temp sum_value
            sum_value = sum(x for i, x in feature_dict.items() if col_name in i )  
            
            #now remove all keys that are part of col_name
            keys_to_remove = [i for i in feature_dict.keys() if col_name in i ]
            for i in keys_to_remove:
                feature_dict.pop(i)
            #lastly, read the summarized field
            feature_dict[col_name] = sum_value
        
    results = pd.Series(feature_dict)
    results.sort_values(inplace=True)
    results.plot(kind="barh", figsize=(width,len(results)/2), xlim=(0,x_scale))



In [141]:

    
graph_feature_importances(model, X.columns, summarized_columns=categorical_features)

Hyperparameters Optimization

Parameters to test in order to improve th model:
- n_estimators: The number of trees in the forest. Choose as high of a number as your computer can handle.
- max_features: The number of features to consider when looking for the best split.
  Try ["auto", "None", "sqrt", "log2", 0.9, and 0.2]
- min_samples_leaf: The minimum number of samples in newly created leaves. Try [1-10]
Other params
- n_jobs: Determines if multiple processors should be used to train and test the model. Always set this to -1 and %%timeit vs. if it is set to 1.
  It should be much faster (especially when many trees are trained).
- random_state: seed for random num generator, using the same value makes the results reproducible
- oob_score: Random Forest's custom validation method: out-of-bag predictions.

Single Hyperparameter optimization
- in the following params a scanned one by one. This gives a direct idea about the impact of the parameter with varying values.
- best way is to perform a grid scan so as to exploit also the correlation between the hyperparameters. This test has been performed as well (see below) but without any noticeble improvement



In [142]:

    
# turn this off for this section
InteractiveShell.ast_node_interactivity = 'none'



In [143]:

    
# init values to default values
n_trees_best = 100
features_type_best = 'auto'
samples_leaf_best = 1
max_depth_best = None



In [144]:

    
roc_scores = []
n_trees = [30, 50, 100, 200, 500, 1000, 2000]

for trees in n_trees:
    model = RandomForestRegressor( n_estimators = trees, oob_score=True, n_jobs=-1, random_state=42 );
    model.fit(X, y);
    score = roc_auc_score(y, model.oob_prediction_)
    roc_scores.append(score)
    print(' num of trees: {0:3d} -> roc score: {1:0.5f}'.format(trees,score))
    
#
n_trees_best = n_trees[ np.argmax(roc_scores) ]
print('\n num of trees best option: {}'.format(n_trees_best))

#
ax = pd.Series(roc_scores, n_trees).plot(fontsize=15, ylim=(0.7,0.9));
ax.set_ylabel('ROC AUC score',**axis_font)
ax.set_xlabel('num of trees',**axis_font)









    



 num of trees:  30 -> roc score: 0.76968
 num of trees:  50 -> roc score: 0.77386
 num of trees: 100 -> roc score: 0.78369
 num of trees: 200 -> roc score: 0.78744
 num of trees: 500 -> roc score: 0.79091
 num of trees: 1000 -> roc score: 0.79040
 num of trees: 2000 -> roc score: 0.79105

 num of trees best option: 2000



In [145]:

    
roc_scores = []
tree_deps = [4, 6, 10, 15]

for dep in tree_deps:
    model = RandomForestRegressor(n_estimators=n_trees_best, max_depth=dep, oob_score=True, n_jobs=-1, random_state=42 );
    model.fit(X, y);
    score = roc_auc_score(y, model.oob_prediction_)
    roc_scores.append(score)
    print(' trees max depth: {0:s} -> roc score: {1:0.5f}'.format(str(dep),score))
  
#
tree_depth_best = tree_deps[ np.argmax(roc_scores) ]
print('\n tree max depth best option: {}'.format(tree_depth_best))

#ax = pd.Series(roc_scores, tree_deps).plot(kind='barh',fontsize=15, ylim=(0.7,0.9));
#ax.set_ylabel('ROC AUC score',**axis_font)
#ax.set_xlabel('num of trees',**axis_font)









    



 trees max depth: 4 -> roc score: 0.79059
 trees max depth: 6 -> roc score: 0.79945
 trees max depth: 10 -> roc score: 0.79858
 trees max depth: 15 -> roc score: 0.79301

 tree max depth best option: 6



In [146]:

    
roc_scores = []
max_features_type = ['auto', None, 'sqrt', 'log2', 0.9, 0.2]

for max_features in max_features_type:
    model = RandomForestRegressor(n_estimators = n_trees_best, oob_score=True, n_jobs=-1, max_depth=tree_depth_best,random_state=42,max_features=max_features )
    model.fit(X, y)
    score = roc_auc_score(y, model.oob_prediction_)
    roc_scores.append(score)
    print(' max features option: {0:s} -> roc score: {1:0.5f}'.format(str(max_features),score))
    
#
features_type_best = max_features_type[ np.argmax(roc_scores) ]
print('\n max_feature_type best option: {}'.format(features_type_best))

#ax = pd.Series(roc_scores, max_features_type).plot(kind="barh", fontsize=20, xlim=(.6,.9));
#ax.set_xlabel('ROC AUC score',**axis_font)
#ax.set_ylabel('max feature type',**axis_font)









    



 max features option: auto -> roc score: 0.79945
 max features option: None -> roc score: 0.79945
 max features option: sqrt -> roc score: 0.80136
 max features option: log2 -> roc score: 0.80136
 max features option: 0.9 -> roc score: 0.80052
 max features option: 0.2 -> roc score: 0.80136

 max_feature_type best option: sqrt



In [147]:

    
roc_scores = []
min_samples_leaf_num = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

#
for min_samples in min_samples_leaf_num:
    model = RandomForestRegressor(n_estimators=n_trees_best, 
                                  oob_score=True, 
                                  n_jobs=-1, 
                                  random_state=42, 
                                  max_depth=tree_depth_best,
                                  max_features = features_type_best, 
                                  min_samples_leaf=min_samples)
    model.fit(X, y)
    score = roc_auc_score(y, model.oob_prediction_)
    roc_scores.append(score)
    print(' num of min_samples_leaf: {0:2d} -> roc score: {1:0.5f}'.format(min_samples,score))


    #
samples_leaf_best = min_samples_leaf_num[ np.argmax(roc_scores) ]
print('\n min_samples_leaf_num best option: {}'.format(samples_leaf_best))


#ax = pd.Series(roc_scores, min_samples_leaf_num).plot(fontsize=20, ylim=(0.7,0.9));
#ax.set_ylabel('ROC AUC score',**axis_font)
#ax.set_xlabel('min samples leaf',**axis_font)









    



 num of min_samples_leaf:  1 -> roc score: 0.80136
 num of min_samples_leaf:  2 -> roc score: 0.80123
 num of min_samples_leaf:  3 -> roc score: 0.80139
 num of min_samples_leaf:  4 -> roc score: 0.80165
 num of min_samples_leaf:  5 -> roc score: 0.80141
 num of min_samples_leaf:  6 -> roc score: 0.80097
 num of min_samples_leaf:  7 -> roc score: 0.80104
 num of min_samples_leaf:  8 -> roc score: 0.80101
 num of min_samples_leaf:  9 -> roc score: 0.80065
 num of min_samples_leaf: 10 -> roc score: 0.80058

 min_samples_leaf_num best option: 4

example of grid search

from sklearn.grid_search import GridSearchCV

model = RandomForestRegressor(random_state=42, n_estimators=2000) param_grid = { "max_features" : [3, 5], "max_depth" : [4, 10, 20], "min_samples_split" : [2, 4, 10] }

grid_search = GridSearchCV(model, param_grid, n_jobs=-1, cv=2) grid_search.fit(X, y) print grid_search.bestparams



In [ ]:

    
# restore full cell output
InteractiveShell.ast_node_interactivity = 'all'

Conclusions

to index

Major efforts have been put in the data cleaning/preparation and characterization with the goal of creating a reliable dataset ready to be used for extensive exploration and modelling. The final sample consists of about 4k event out of ~30k of the raw data (many duplicated entries) Almost all fields have been examined and checked for integrity and consistency.
Data exploration has been carried out with the aim of making sense of the main features and identify potential room for ML prediction algorithms. In this phase additional features have been introduced.
As first prediction study a Random Forest Regression algo is implemented in order to predict the number of hard maneuvers a drive collect during a single trip. This step is preceded by of a phase of feature reduction and engineering. A prediction accuracy of 80% is achieved at this stage.
Additional studies that can be performed with the present sample:
- Improve feature understanding (i.e. the one related to points and speeding) and try to assess the their power as predictors or targets.
- Implement pipelines with addional models in order to improve accuracy.
- Look at longitudinal data: predict individual user behaviour (can be habits, churn etc..)
Further studies should involve a larger dataset. This will open the possibility to run more sophisticated models especially for longitudinal data modelling.
It will also make possible to run unsupervised ML algorithms which could undercover hidden data correlations to be used for prediction models.



In [ ]:

	device_key	distance_um	duration_seconds	end_point_address_city	end_point_address_country	end_point_address_state	end_point_address_zipcode	end_point_daylight_saving_time_flag	end_point_latitude	end_point_longitude	end_point_place	end_point_timestamp	end_point_timestamp_tz	end_point_timestamp_utc	fuel_consumed	fuel_consumed_um	fuel_cost	fuel_cost_currency_code	fuel_cost_currency_symbol	fuel_ppg	fuel_type	gps_distance	gps_miles	hard_accel_count	hard_brake_count	idle_seconds	key	mpg_combined	obd_distance	obd_miles	points_city_count	points_hwy_count	speed_um	speeding_city_major_count	speeding_city_minor_count	start_point_address_city	start_point_address_country	start_point_address_state	start_point_address_zipcode	start_point_daylight_saving_time_flag	start_point_latitude	start_point_longitude	start_point_place	start_point_timestamp	start_point_timestamp_tz	start_point_timestamp_utc	static_map_url	tags	top_speed	top_speed_mph	trip_segments	user	vehicle_key	vehicle_nickname
0	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	425.00000	Cleckheaton	GB	England	BD19 3TY	False	53.73208	-1.71761	None	2016-11-24 12:33:01	GMT	2016-11-24 12:33:01	None	gal	None	EUR	€	2.39	regular	0.74782	0.74782	0	0	238	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUmVUaEpUNlFaRmJhQ1IyREtTTmFSNww	None	0.78106	0.78106	12	0	mph	0	0	Cleckheaton	GB	England	BD19 3BE	False	53.72496	-1.71555	None	2016-11-24 12:25:56	GMT	2016-11-24 12:25:56	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUmVUaEpUNlFaRmJhQ1IyREtTTmFSNww/staticmap?t=QTd3VNpBsUxZ7dW2GPUkOycYN48	[]	32.30000	32.30000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
1	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	235.00000	Cleckheaton	GB	England	BD19 3BJ	False	53.72505	-1.71570	None	2016-11-24 12:12:05	GMT	2016-11-24 12:12:05	None	gal	None	EUR	€	2.39	regular	0.87467	0.87467	0	0	25	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWazNmMmdEb05oRHJWY3NQSEh5aGdlVgw	None	0.88483	0.88483	12	0	mph	0	0	Cleckheaton	GB	England	BD19 3TY	False	53.73204	-1.71759	None	2016-11-24 12:08:10	GMT	2016-11-24 12:08:10	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWazNmMmdEb05oRHJWY3NQSEh5aGdlVgw/staticmap?t=PIamI1nMri8X7alsbJIJYBfyExs	[]	27.30000	27.30000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	2,439.00000	Cleckheaton	GB	England	BD19 3TY	False	53.73204	-1.71758	None	2016-11-24 08:28:24	GMT	2016-11-24 08:28:24	None	gal	None	EUR	€	2.39	regular	11.28354	11.28354	3	0	541	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWNFB6V1N5YVhKZFhuOGdlcDNtVVoyVww	None	10.95850	10.95850	83	0	mph	2	1	Luddenden Foot	GB	England	HX2	False	53.72362	-1.94588	None	2016-11-24 07:47:45	GMT	2016-11-24 07:47:45	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWNFB6V1N5YVhKZFhuOGdlcDNtVVoyVww/staticmap?t=FEn4Y1GG7dmU6Mad56rY5IU9Nho	[]	52.80000	52.80000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
3	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	738.00000	Luddenden Foot	GB	England	HX2	False	53.72361	-1.94621	None	2016-11-23 17:55:30	GMT	2016-11-23 17:55:30	None	gal	None	EUR	€	2.39	regular	3.56389	3.56389	0	0	117	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWOEViRzVwb0FDc21vSzdabkpQTTZraAw	None	3.46539	3.46539	26	0	mph	0	0	Halifax	GB	England	HX1 3BA	False	53.71449	-1.87807	None	2016-11-23 17:43:12	GMT	2016-11-23 17:43:12	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWOEViRzVwb0FDc21vSzdabkpQTTZraAw/staticmap?t=MWYD3chkvl1AqUxUwsrviEnxn0U	[]	37.90000	37.90000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
4	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	1,808.00000	Halifax	GB	England	HX1 3BA	False	53.71443	-1.87809	None	2016-11-23 17:32:16	GMT	2016-11-23 17:32:16	None	gal	None	EUR	€	2.39	regular	8.28467	8.28467	0	0	478	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZlNyRFpVR1ZBbXBmVXNlR3pqcVl3Sgw	None	8.05608	8.05608	58	6	mph	0	1	Cleckheaton	GB	England	BD19 3TY	False	53.73196	-1.71741	None	2016-11-23 17:02:08	GMT	2016-11-23 17:02:08	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZlNyRFpVR1ZBbXBmVXNlR3pqcVl3Sgw/staticmap?t=aesK_aPLG_aBH_B-3RXHFUFiYnE	[]	46.00000	46.00000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
5	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	241.00000	Cleckheaton	GB	England	BD19 3TY	False	53.73208	-1.71757	None	2016-11-23 13:50:00	GMT	2016-11-23 13:50:00	None	gal	None	EUR	€	2.39	regular	0.91447	0.91447	0	0	62	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWRE14bjNIYUtIUlVrRXlTVHRjeW9mZww	None	0.91714	0.91714	11	0	mph	0	0	Cleckheaton	GB	England	BD19 3BE	False	53.72501	-1.71539	None	2016-11-23 13:45:59	GMT	2016-11-23 13:45:59	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWRE14bjNIYUtIUlVrRXlTVHRjeW9mZww/staticmap?t=y0KEAxjFAOE5zvZr7dlueq_gCFo	[]	36.00000	36.00000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
6	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	252.00000	Cleckheaton	GB	England	BD19 3HU	False	53.72507	-1.71529	None	2016-11-23 13:35:05	GMT	2016-11-23 13:35:05	None	gal	None	EUR	€	2.39	regular	0.91864	0.91864	0	0	85	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWNGd1UmpBYmtEcGVVeFU1Q1pFUll4VQw	None	0.90720	0.90720	10	0	mph	0	0	Cleckheaton	GB	England	BD19	False	53.73222	-1.71735	None	2016-11-23 13:30:53	GMT	2016-11-23 13:30:53	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWNGd1UmpBYmtEcGVVeFU1Q1pFUll4VQw/staticmap?t=d7G_lFT1s4HDZDck3HKofRbsU3U	[]	35.40000	35.40000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
7	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	2,366.00000	Cleckheaton	GB	England	BD19 3TY	False	53.73207	-1.71757	None	2016-11-23 08:27:53	GMT	2016-11-23 08:27:53	None	gal	None	EUR	€	2.39	regular	11.33165	11.33165	0	1	381	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWR3BTemRLdXc3bjJKR1Z1ZHhYQ3pRQww	None	10.95291	10.95291	80	0	mph	1	1	Luddenden Foot	GB	England	HX2	False	53.72361	-1.94609	None	2016-11-23 07:48:27	GMT	2016-11-23 07:48:27	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWR3BTemRLdXc3bjJKR1Z1ZHhYQ3pRQww/staticmap?t=26FKe4-nnlFmtgSLe7BwwqbByWY	[]	50.30000	50.30000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
8	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	526.00000	Luddenden Foot	GB	England	HX2 6AH	False	53.72363	-1.94622	None	2016-11-22 21:39:30	GMT	2016-11-22 21:39:30	None	gal	None	EUR	€	2.39	regular	3.57647	3.57647	0	0	61	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWdGNtVmh4ejhSbmg5NGNScXY3WXR4eQw	None	3.43805	3.43805	22	0	mph	2	1	Halifax	GB	England	HX1 3BA	False	53.71457	-1.87802	None	2016-11-22 21:30:44	GMT	2016-11-22 21:30:44	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWdGNtVmh4ejhSbmg5NGNScXY3WXR4eQw/staticmap?t=Dk0uV27dMBIVyQDR05r_bayFUFg	[]	44.10000	44.10000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
9	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	532.00000	Halifax	GB	England	HX1 3BA	False	53.71457	-1.87801	None	2016-11-22 21:25:20	GMT	2016-11-22 21:25:20	None	gal	None	EUR	€	2.39	regular	1.53763	1.53763	0	0	48	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWaVM3QkFEV1FYNGJINDI4d05lV1ptaQw	None	1.55964	1.55964	20	0	mph	1	0	Halifax	GB	England	HX1 4TA	False	53.72466	-1.88786	None	2016-11-22 21:16:28	GMT	2016-11-22 21:16:28	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWaVM3QkFEV1FYNGJINDI4d05lV1ptaQw/staticmap?t=4GOnC6A4FNpOm4pd_UHYX3FMQDU	[]	32.30000	32.30000	1	None	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN

	device_key	distance_um	duration_seconds	end_point_address_city	end_point_address_country	end_point_address_state	end_point_address_zipcode	end_point_daylight_saving_time_flag	end_point_latitude	end_point_longitude	end_point_timestamp	end_point_timestamp_tz	end_point_timestamp_utc	fuel_consumed_um	fuel_cost_currency_code	fuel_cost_currency_symbol	fuel_ppg	fuel_type	gps_distance	gps_miles	hard_accel_count	hard_brake_count	idle_seconds	key	obd_distance	obd_miles	points_city_count	points_hwy_count	speed_um	speeding_city_major_count	speeding_city_minor_count	speeding_hwy_major_count	speeding_hwy_minor_count	start_point_address_city	start_point_address_country	start_point_address_state	start_point_address_zipcode	start_point_daylight_saving_time_flag	start_point_latitude	start_point_longitude	start_point_timestamp_tz	start_point_timestamp_utc	static_map_url	top_speed	top_speed_mph	trip_segments	vehicle_key	vehicle_nickname
start_point_timestamp
2016-09-15 17:42:06	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	774.00000	Halifax	GB	England	HX1	True	53.72095	-1.85880	2016-09-15 17:55:00	BST	2016-09-15 16:55:00	gal	USD	$	3.54	regular	2.60910	2.60910	0	0	101.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw	3.72326	3.72326	31.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	None	None	None	None	False	0.00000	0.00000	None	2016-09-15 16:42:06	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw/staticmap?t=0g1-hUz0lbYGuN5je_w-C6Ksmwo	41.60000	41.60000	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-15 17:42:06	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	774.00000	Halifax	GB	England	HX1	True	53.72095	-1.85880	2016-09-15 17:55:00	BST	2016-09-15 16:55:00	gal	USD	$	3.54	regular	2.60910	2.60910	0	0	101.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw	3.72326	3.72326	31.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	None	None	None	None	False	0.00000	0.00000	None	2016-09-15 16:42:06	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw/staticmap?t=0g1-hUz0lbYGuN5je_w-C6Ksmwo	41.60000	41.60000	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-15 17:42:06	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	774.00000	Halifax	GB	England	HX1	True	53.72095	-1.85880	2016-09-15 17:55:00	BST	2016-09-15 16:55:00	gal	USD	$	3.54	regular	2.60910	2.60910	0	0	101.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw	3.72326	3.72326	31.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	None	None	None	None	False	0.00000	0.00000	None	2016-09-15 16:42:06	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw/staticmap?t=0g1-hUz0lbYGuN5je_w-C6Ksmwo	41.60000	41.60000	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-15 17:42:06	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	774.00000	Halifax	GB	England	HX1	True	53.72095	-1.85880	2016-09-15 17:55:00	BST	2016-09-15 16:55:00	gal	USD	$	3.54	regular	2.60910	2.60910	0	0	101.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw	3.72326	3.72326	31.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	None	None	None	None	False	0.00000	0.00000	None	2016-09-15 16:42:06	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWZjV0R3l1TmNwbkMzcnZqQXBUYXQ2Vgw/staticmap?t=0g1-hUz0lbYGuN5je_w-C6Ksmwo	41.60000	41.60000	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-15 18:53:20	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	919.00000	Luddenden Foot	GB	England	HX2	True	53.72364	-1.94611	2016-09-15 19:08:39	BST	2016-09-15 18:08:39	gal	EUR	€	2.39	regular	4.54532	4.54532	0	0	73.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww	4.44156	4.44156	37.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	Halifax	GB	England	HX1	True	53.72102	-1.85883	BST	2016-09-15 17:53:20	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww/staticmap?t=LJwmxTea_AVqafy52J-KTAlcJcg	41.00000	41.00000	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-15 18:53:20	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	919.00000	Luddenden Foot	GB	England	HX2	True	53.72364	-1.94611	2016-09-15 19:08:39	BST	2016-09-15 18:08:39	gal	EUR	€	2.39	regular	4.54532	4.54532	0	0	73.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww	4.44156	4.44156	37.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	Halifax	GB	England	HX1	True	53.72102	-1.85883	BST	2016-09-15 17:53:20	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww/staticmap?t=LJwmxTea_AVqafy52J-KTAlcJcg	41.00000	41.00000	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-15 18:53:20	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	919.00000	Luddenden Foot	GB	England	HX2	True	53.72364	-1.94611	2016-09-15 19:08:39	BST	2016-09-15 18:08:39	gal	EUR	€	2.39	regular	4.54532	4.54532	0	0	73.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww	4.44156	4.44156	37.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	Halifax	GB	England	HX1	True	53.72102	-1.85883	BST	2016-09-15 17:53:20	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww/staticmap?t=LJwmxTea_AVqafy52J-KTAlcJcg	41.00000	41.00000	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-15 18:53:20	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	919.00000	Luddenden Foot	GB	England	HX2	True	53.72364	-1.94611	2016-09-15 19:08:39	BST	2016-09-15 18:08:39	gal	EUR	€	2.39	regular	4.54532	4.54532	0	0	73.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww	4.44156	4.44156	37.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	Halifax	GB	England	HX1	True	53.72102	-1.85883	BST	2016-09-15 17:53:20	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWUEVaYVR5cktjQnVxSnpYZkdmS2RQdww/staticmap?t=LJwmxTea_AVqafy52J-KTAlcJcg	41.00000	41.00000	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-16 10:53:46	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	144.00000	Luddenden Foot	GB	England	HX2 6NS	True	53.72584	-1.94266	2016-09-16 10:56:10	BST	2016-09-16 09:56:10	gal	EUR	€	2.39	regular	0.42214	0.42214	0	0	36.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWWWtINUhvU2poZHR1WFdWREdCTHhMYww	0.46603	0.46603	7.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	Luddenden Foot	GB	England	HX2	True	53.72353	-1.94616	BST	2016-09-16 09:53:46	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWWWtINUhvU2poZHR1WFdWREdCTHhMYww/staticmap?t=9i1qsNObX05Bd4LQp3jwR-DpTow	30.44719	30.44719	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN
2016-09-16 10:53:46	agplfnp1YmllLWV1ciILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MM	mi	144.00000	Luddenden Foot	GB	England	HX2 6NS	True	53.72584	-1.94266	2016-09-16 10:56:10	BST	2016-09-16 09:56:10	gal	EUR	€	2.39	regular	0.42214	0.42214	0	0	36.00000	agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWWWtINUhvU2poZHR1WFdWREdCTHhMYww	0.46603	0.46603	7.00000	0.00000	mph	0.00000	0.00000	0.00000	0.00000	Luddenden Foot	GB	England	HX2	True	53.72353	-1.94616	BST	2016-09-16 09:53:46	https://zubie-eu.appspot.com/api/v1/trips/agplfnp1YmllLWV1ckILEgZEZXZpY2UiFk1aVEdCQ3NMZ1BjRzJOU0JMN1FWV1MMCxIEVHJpcCIWWWtINUhvU2poZHR1WFdWREdCTHhMYww/staticmap?t=9i1qsNObX05Bd4LQp3jwR-DpTow	30.44719	30.44719	1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN

	year(start_point_timestamp)	month(start_point_timestamp)	dayofmonth(start_point_timestamp)	trip_tz	count
0	2016	10	1	BST	12
1	2016	10	2	BST	3
2	2016	10	3	BST	5
3	2016	10	4	BST	2
4	2016	10	5	BST	2
5	2016	10	6	BST	10
6	2016	10	8	BST	1
7	2016	10	11	BST	5
8	2016	10	13	BST	4
9	2016	10	15	BST	11
10	2016	10	16	BST	13
11	2016	10	17	BST	14
12	2016	10	18	BST	13
13	2016	10	19	BST	18
14	2016	10	20	BST	5
15	2016	10	21	BST	16
16	2016	10	22	BST	46
17	2016	10	23	BST	32
18	2016	10	24	BST	21
19	2016	10	25	BST	15
20	2016	10	26	BST	28
21	2016	10	27	BST	48
22	2016	10	28	BST	60
23	2016	10	29	BST	86
24	2016	10	30	GMT	63
25	2016	10	31	GMT	81

	end_point_timestamp	start_point_timestamp	duration_seconds	idle_seconds	start_point_lat	start_point_lon	end_point_lat	end_point_lon	gps_distance
0	2016-11-21 17:19:44	2016-11-21 17:19:09	35.00000	35	51.63796	-0.46845	51.63795	-0.46844	0.00052
1	2016-11-23 10:00:57	2016-11-23 10:00:28	29.00000	29	53.30265	-3.54296	53.30273	-3.54290	0.00072
2	2016-10-31 14:22:57	2016-10-31 14:14:34	503.00000	503	51.27742	-0.20994	51.27740	-0.20987	0.00277
3	2016-10-29 11:26:59	2016-10-29 11:26:34	25.00000	25	51.13667	-0.03229	51.13667	-0.03227	0.00045
4	2016-09-18 18:53:03	2016-09-18 18:51:32	91.00000	91	53.03751	-2.16971	53.03758	-2.16996	0.01067
5	2016-11-10 10:42:34	2016-11-10 10:42:12	22.00000	22	50.81133	-0.77265	50.81134	-0.77276	0.00469
6	2016-09-16 12:38:17	2016-09-16 12:34:30	227.00000	227	55.95875	-3.12300	55.95874	-3.12300	0.00057
7	2016-11-15 19:40:34	2016-11-15 19:39:51	43.00000	43	51.53670	-1.02413	51.53679	-1.02414	0.00574
8	2016-11-27 21:02:36	2016-11-27 21:01:05	91.00000	91	51.18283	-0.02588	51.18279	-0.02585	0.00190
9	2016-09-29 17:43:42	2016-09-29 17:41:21	141.00000	141	51.14703	-0.97698	51.14702	-0.97695	0.00143
10	2016-10-29 12:14:50	2016-10-29 12:00:22	868.00000	868	55.95857	-3.12247	55.95853	-3.12248	0.00313
11	2016-11-24 17:42:54	2016-11-24 17:42:11	43.00000	43	51.43344	-1.07573	51.43344	-1.07573	0.00041
12	2016-10-29 15:44:01	2016-10-29 15:43:36	25.00000	25	52.77901	-0.17527	52.77896	-0.17518	0.00194
13	2016-09-18 09:17:55	2016-09-18 09:16:39	76.00000	76	53.03760	-2.16988	53.03753	-2.16974	0.00827
14	2016-11-13 10:16:33	2016-11-13 10:15:56	37.00000	37	53.85990	-1.51111	53.85982	-1.51104	0.00751
15	2016-11-10 20:54:07	2016-11-10 20:53:30	37.00000	37	53.85981	-1.51114	53.85979	-1.51106	0.00353
16	2016-11-03 11:40:35	2016-11-03 11:36:04	271.00000	531488141	53.78643	-1.54367	53.78645	-1.54371	0.84060
17	2016-11-11 12:45:16	2016-11-11 12:44:39	37.00000	37	54.56519	-5.97924	54.56517	-5.97921	0.00008
18	2016-10-23 11:32:24	2016-10-23 11:31:49	35.00000	35	53.03758	-2.16986	53.03753	-2.16969	0.00942
19	2016-11-30 19:08:07	2016-11-30 19:07:33	34.00000	34	51.18279	-0.02590	51.18283	-0.02584	0.00342
20	2016-10-19 13:02:40	2016-10-19 13:01:14	86.00000	86	51.10771	-1.04498	51.10757	-1.04516	0.01481
21	2016-09-24 06:31:02	2016-09-24 06:29:01	121.00000	121	51.10760	-1.04526	51.10761	-1.04518	0.00477
22	2016-10-27 22:27:13	2016-10-27 22:26:55	18.00000	18	53.69080	-2.32692	53.69072	-2.32716	0.01227

	vehicle_key	vehicle_nickname	count
0	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZrSEp3RHhIZzhVakdTR0xiQkhqY3dFDAsSA0NhciILRnczZlVRVjk0WnEM	BG61YNF	12
1	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZBS2JGM0hUVkVUQzM3ZHh4YjJaZUtCDAsSA0NhciILTGt2OHZVVEFFd0MM	BN03NAE	130
2	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZMQjRZRXVuNGtKcTVZWGJGWXpYYkdKDAsSA0NhciILWDlIaVR5dGI4ZXQM	Bill's Honda	64
3	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZLNFJHZllXMnViaEdqekxGbzRFWjlDDAsSA0NhciILWjdIMkFwdWpYWVYM	California	18
4	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZQUzhMZHBMUDh3TDU2NVdNR21WVUJTDAsSA0NhciILWUdRVFRZY3VHWDcM	D4EHH	129
5	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZxOE1icnVodXYyVHFaM25ONTV6VkxlDAsSA0NhciILaENaWUE1OHBLVUIM	EO14GZR	123
6	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZpUEpCajJxVjNFWGdZTktYOEhyS0x3DAsSA0NhciILYlRROWZWeUhleTcM	F9ELL	45
7	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ3TjduVmFGM0c1ZmtFekhTUUpyRlc0DAsSA0NhciILRW5aSHRrV280TXAM	FX16VYG	39
8	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZlejJLMlB6NHdDTW9rb1R1VVIydW1aDAsSA0NhciILc0pyZWFMbnJnWXkM	FY64AZT	5
9	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ5a3NlYVU1UXd0RDg1TWJXRDd2NTZIDAsSA0NhciILYVFIQlUzbjQ1ZlMM	GF07CGG	25
10	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZuQ1BQdlZqVXZOeFZmQ21ndlJxVmd5DAsSA0NhciILOHN2eVNTdXhRRVoM	GF10LZH	5
11	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ2RFZFeVp2Q245cWZ3QUNYOWlnRG5aDAsSA0NhciILYWRUVnNERWFQcGsM	GN64UTO	78
12	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhYzVk45ZWh5elp6VGpjdXBZcVlvRnlhDAsSA0NhciILY2pnUmFqZU1kNTcM	GU07LZD	34
13	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZxRFM1V29ETDhwNGI1d1c3VmR4U2NhDAsSA0NhciILbzdBamhnWnNXcjIM	GX05RHU	116
14	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZzaDhya1M0Uzg2OENvVkdzZDU3dnFGDAsSA0NhciILeW91YWZyZGdGdzIM	Jamie	6
15	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZhNEJudVUyM0VFWXozaWQ1VldpWWpUDAsSA0NhciILOWh0c3hjY2FEYXgM	KM59WDP	180
16	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZxaVh2b0RDRlVFUVhKVUt3VEVDZ3pWDAsSA0NhciILTEVpcDhab2JxWUsM	KN15UAC	153
17	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ5eGlFRlo1U3RIUkxqV0NrTE5xcGVTDAsSA0NhciILUWFETGpiV21abVkM	KN15UAG	40
18	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ6WEVrSzl4WW84RUY1V3diUkNtRUJ4DAsSA0NhciILV1p4R2VvUzY0SGkM	KO16WBJ	129
19	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZKblAyYmE0cTM5OVAzSGNhbTRiUTdCDAsSA0NhciILNW9xNHV5UHJIUHgM	KS12VRN	66
20	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZEMm1DQU5mOHhQRXJ0QVNnTkRLdUZHDAsSA0NhciILWXBBNVBqZUpXN0wM	KU62TNN	32
21	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhY4cmFyREROVDJDWFgydUdma0tVcGllDAsSA0NhciILTERXd3RLN2JyQ2QM	L111TYM	34
22	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZFVU5WazY1ZFFDeTN3VEFVSDNSWXJSDAsSA0NhciILYjJUb3FpYUhDUVoM	L3TGG	39
23	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZRYUg2dU5RVnRZTVFBUWZlRFE4QUF0DAsSA0NhciILeXVRQW9qYTlySjUM	LG12YYW	22
24	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZFWUNXUVFlQ0FKZVpYVXptZVY4VlhkDAsSA0NhciILbXloYVJIeXNReWcM	LN53SWX	53
25	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZiRFI0b0xWeHBnalhnSEJBWFNrbW5CDAsSA0NhciILRktNQVBhVHVzb1UM	LT16NVG	56
26	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ6VnZMS0F3bkpGbXZTYjVqVFc1azlSDAsSA0NhciILQjU2WVU3b1BhYWIM	LX14NYS	99
27	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZwVEN3UDRyU3VydWJ5cFY0RmM2OWlZDAsSA0NhciILWVl4b3FpWkVIdVAM	LX64KWB	4
28	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZONVJjdVBKR2c0Z0RCTnlHdXJ6cHppDAsSA0NhciILRVprcmI4V2JBdjkM	M100NEB	104
29	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZmOVdvd0xORFJYWnhTNndocWZZaUptDAsSA0NhciILUVZGQ3BLbzNVd3AM	MJ16JPO	76
30	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDSHh3clVuZzlyakNjRldaUUZiMm9TDAsSA0NhciILNENOaVpkM1U5SHEM	MV13ZFN	214
31	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZYOUs2QXYyWTdaU2RSS21HQVZmRFZZDAsSA0NhciILYzN0NGdXTk43a2IM	NJZ8764	39
32	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ4TktCV2lrbTRRcDV4dEMyd28zbk5MDAsSA0NhciILSjRmdWdtNlBtY0oM	NL62TNX	69
33	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZOVUpTbUZyRWtoY2JmbWQ0TmE5eEFXDAsSA0NhciILMnVXVjhLQ2RWSm4M	OE07CKK	112
34	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZZem9LZm5vQkxad0c0dk1ReWg1UVVKDAsSA0NhciILdUhIckw4NVVqVDIM	OY09WKK	141
35	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ4eWdBVkhnUG9BSkRVc0FyRWNjUnBGDAsSA0NhciILQ0g1Qlo3c3BHQ3MM	PL08ZVH	33
36	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZoZ3d0TU5lWFdYWkprZVBhRHBtVHE4DAsSA0NhciILWEVmbVg5MmM0MlkM	PX10YSL	77
37	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZDNDhDeXV5S1oyeEtwR2J5OGs3aW5TDAsSA0NhciILb3VnSzJhNGdlOEsM	RK58SRO	10
38	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZCc1lDbkFwaWJpVlFqTTJ3VTM1eWdlDAsSA0NhciILN0pOd3ZVcFdLR1IM	SB14FWG	196
39	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhY4ckpZMmlTNzZiclgzS1J5NXBGeEJyDAsSA0NhciILelQ5VFBHUlhueWYM	SC14WGV	34
40	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZxbXlnR1Y3RVFhU3VlUmZMajJBR1NZDAsSA0NhciILZWtZazRYcjZjUlAM	SG63XMW	19
41	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhY3Q1ZvRU04enJjc29ZNFliRWpjOExuDAsSA0NhciILTVR3RWNHc3ozemoM	ST61MWJ	111
42	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZuNmsySkpNRVVaYk5wWWlORTNUY1ZwDAsSA0NhciILTVRGRXNvdkFuWmcM	Sk12ZGS	220
43	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZpYXN4MlQzRFdMaFZxaFRxaUxFa2V2DAsSA0NhciILNEhWeVhGVUdKbWEM	T6FTO	29
44	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZNdUhtZTZOVkFOUnBDa0dQRUJQRGpSDAsSA0NhciILdVZ1ZWY2V2RoOWMM	VK61VZS	76
45	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZwQlIzWnBaQ0pBVUR1SGVBWUhYSlZjDAsSA0NhciILcFgyaVJXQTNrTEIM	W31PMW	83
46	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZEVUI2R2tmc0FrUEVSUkFSV0dYVVdKDAsSA0NhciILc0I5ZUpGU2lnUEwM	YC15CXR	62
47	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZzU0RHdTRkSHA4Z3lvNUJyS3IzbTZiDAsSA0NhciILekFhUmNLcTZqOEMM	YC62VAO	43
48	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZabnBqRjhmcmdKeWJZTkJnQUVjbXZZDAsSA0NhciILNE1QcUVIaGVEUDMM	YE14DGF	31
49	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZyOTIyd2NmajJYbWN1cWdyY0JleUJGDAsSA0NhciILb1loRkxYQnZMMkMM	YK65WLE	33
50	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhY3a3NoM2lNdnNTTGVnWFg5aVRtQnJFDAsSA0NhciILTVdialZVWjVncEoM	YP63LSE	24
51	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZQakZxWXBXb0JiQ0JENmM5Y3hWaEhoDAsSA0NhciILNmtDZFNMUjNDb0UM	bens	72
52	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZ1TWdoQVhxYjZHWVRCY0NHRGNHQW5vDAsSA0NhciILVTJ1RWJmb0ZKa3kM	harrow55	40
53	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZiQ3NGdGFVOHhIcmREaFd5RjZ0UlBjDAsSA0NhciILcHRnRmhGZWNGeVQM	ll58hxj	130
54	agplfnp1YmllLWV1cjcLEgdBY2NvdW50IhZQRGR1WnJZeGl2NmJtZHJmNTlVdkozDAsSA0NhciILV0w3cU41S01QR3AM	ov54phj	65

	device_id	vehicle_nickname	start_point_city	start_point_state	start_point_zipcode	start_point_lat	start_point_lon	start_point_timestamp	end_point_city	end_point_state	end_point_zipcode	end_point_lat	end_point_lon	end_point_timestamp	trip_tz	duration_seconds	idle_seconds	gps_distance	obd_distance	top_speed	hard_accel_count	hard_brake_count	points_city_count	points_hwy_count	speeding_city_major_count	speeding_city_minor_count	speeding_hwy_major_count	speeding_hwy_minor_count	speed_um	distance_um
0	37	bens	London	England	SE22 0NN	51.45408	-0.05773	2016-08-10 17:33:52	London	England	SE20 8SH	51.40580	-0.06562	2016-08-10 17:55:50	BST	1318	360	4.56465	4.49189	41.60000	0	3	47	0	0	1	0	0	mph	mi
1	37	bens	London	England	SE20 8SJ	51.40570	-0.06566	2016-08-13 09:00:25	London	England	SE6 4RS	51.43126	-0.03489	2016-08-13 09:13:16	BST	771	189	2.26938	2.81295	32.30000	0	0	5	0	0	0	0	0	mph	mi
2	37	bens	London	England	SE6 4RS	51.43126	-0.03489	2016-08-13 10:41:20	London	England	SE20 8SH	51.40581	-0.06558	2016-08-13 10:54:48	BST	808	214	2.21226	2.72223	29.80000	0	0	6	0	0	0	0	0	mph	mi
3	37	bens	London	England	SE20 8SH	51.40616	-0.06565	2016-08-14 00:53:25	London	England	SE20 8SW	51.40634	-0.06818	2016-08-14 01:01:18	BST	473	104	0.11582	1.40492	28.00000	0	0	3	0	0	0	0	0	mph	mi
4	37	bens	London	England	SE20 8SH	51.40576	-0.06559	2016-08-17 19:27:08	Orpington	England	BR6 0AT	51.37328	0.09562	2016-08-17 20:00:46	BST	2018	605	9.23671	9.05462	39.10000	0	1	61	0	1	1	0	0	mph	mi
5	37	bens	Orpington	England	BR6 0AT	51.37328	0.09562	2016-08-17 20:41:04	London	England	SE20 8SH	51.40581	-0.06560	2016-08-17 21:05:24	BST	1460	138	7.95704	8.92848	42.30000	0	2	45	0	1	2	0	0	mph	mi
6	37	bens	London	England	SE20 8SJ	51.40579	-0.06579	2016-08-18 19:53:53	London	England	SE23 3XP	51.44048	-0.05667	2016-08-18 20:04:44	BST	651	75	2.53011	2.94716	32.90000	0	1	5	0	0	1	0	0	mph	mi
7	37	bens	London	England	SE23 3XP	51.44078	-0.05680	2016-08-18 21:16:34	London	England	SE20 8SJ	51.40582	-0.06583	2016-08-18 21:38:01	BST	1287	449	3.26761	3.59960	34.20000	0	0	27	0	0	0	0	0	mph	mi
8	37	bens	London	England	SE20 8BL	51.41071	-0.06535	2016-08-19 17:07:36	Fleet	England	GU51	51.29453	-0.85891	2016-08-19 19:45:12	BST	9456	3629	74.70174	72.48668	90.10000	1	15	97	89	6	6	5	10	mph	mi
9	37	bens	Tatsfield	England	TN16 2ER	51.27354	0.04222	2016-08-21 12:49:35	London	England	SE19 2BA	51.41868	-0.07489	2016-08-21 13:51:55	BST	3740	761	26.97838	25.79001	84.50000	0	2	93	22	7	0	1	2	mph	mi

	geometry	device_id	vehicle_nickname	start_point_city	start_point_state	start_point_zipcode	start_point_timestamp	end_point_city	end_point_state	end_point_zipcode	end_point_lat	end_point_lon	end_point_timestamp	trip_tz	duration_seconds	idle_seconds	gps_distance	obd_distance	top_speed	hard_brake_count	points_city_count	speeding_city_major_count	speeding_city_minor_count	speed_um	distance_um
0	POINT (51.45407867431641 -0.05773499980568886)	37	bens	London	England	SE22 0NN	2016-08-10 17:33:52	London	England	SE20 8SH	51.40580	-0.06562	2016-08-10 17:55:50	BST	1318	360	4.56465	4.49189	41.60000	3	47	0	1	mph	mi
1	POINT (51.40570068359375 -0.06565999984741211)	37	bens	London	England	SE20 8SJ	2016-08-13 09:00:25	London	England	SE6 4RS	51.43126	-0.03489	2016-08-13 09:13:16	BST	771	189	2.26938	2.81295	32.30000	0	5	0	0	mph	mi
2	POINT (51.43125915527344 -0.03488999977707863)	37	bens	London	England	SE6 4RS	2016-08-13 10:41:20	London	England	SE20 8SH	51.40581	-0.06558	2016-08-13 10:54:48	BST	808	214	2.21226	2.72223	29.80000	0	6	0	0	mph	mi
3	POINT (51.40615844726562 -0.06565000116825104)	37	bens	London	England	SE20 8SH	2016-08-14 00:53:25	London	England	SE20 8SW	51.40634	-0.06818	2016-08-14 01:01:18	BST	473	104	0.11582	1.40492	28.00000	0	3	0	0	mph	mi
4	POINT (51.40576171875 -0.06558500230312347)	37	bens	London	England	SE20 8SH	2016-08-17 19:27:08	Orpington	England	BR6 0AT	51.37328	0.09562	2016-08-17 20:00:46	BST	2018	605	9.23671	9.05462	39.10000	1	61	1	1	mph	mi

	device_id	vehicle_nickname	start_point_city	start_point_state	start_point_zipcode	start_point_lat	start_point_lon	start_point_timestamp	end_point_city	end_point_state	end_point_zipcode	end_point_lat	end_point_lon	end_point_timestamp	duration	idle	drive_time	gps_distance	obd_distance	geo_distance	top_speed	ave_speed	hard_brake_count	hard_manv_count	hard_manv_pm	points_city_count
12	37	bens	Beckenham	England	BR3 4EH	51.40113	-0.05232	2016-08-23 07:27:33	Beckenham	England	BR3 4EA	51.40072	-0.05150	2016-08-23 07:33:51	6.30000	2.11667	4.18333	0.04361	1.17626	0.04541	29.20000	16.87060	1	1	0.85016	2
18	37	bens	Beckenham	England	BR3 4AA	51.39754	-0.04958	2016-08-28 12:01:19	London	England	SE20 8SH	51.40579	-0.06561	2016-08-28 12:09:24	8.08333	1.91667	6.16667	0.01975	1.61246	0.89738	28.60000	15.68878	0	0	0.00000	3
37	6	Bill's Honda	Sutton	England	SM1 3RD	51.36700	-0.17840	2016-09-05 12:34:10	Sutton	England	SM1 3RD	51.36703	-0.17853	2016-09-05 12:34:58	0.80000	0.48333	0.31667	0.00411	0.00497	0.00592	3.10686	0.94187	0	0	0.00000	1
65	37	bens	Croydon	England	SE25	51.39001	-0.08990	2016-09-10 18:03:17	London	England	SE20 8SJ	51.40584	-0.06578	2016-09-10 18:12:12	8.91667	2.58333	6.33333	0.03999	1.63607	1.51158	37.30000	15.49961	0	0	0.00000	3
67	37	bens	London	England	SE20 8SH	51.40589	-0.06566	2016-09-11 11:13:31	Croydon	England	CR0 2DJ	51.38809	-0.08958	2016-09-11 11:22:21	8.83333	2.06667	6.76667	0.01172	1.67584	1.60733	27.30000	14.85965	0	0	0.00000	3

	index	device_id	vehicle_nickname	start_point_city	start_point_state	start_point_zipcode	start_point_lat	start_point_lon	end_point_city	end_point_state	end_point_zipcode	end_point_lat	end_point_lon	end_point_timestamp	duration	idle	drive_time	gps_distance	obd_distance	geo_distance	top_speed	ave_speed	hard_accel_count	hard_brake_count	hard_manv_count	hard_manv_pm	points_city_count	points_hwy_count	speeding_city_major_count	speeding_city_minor_count	speeding_hwy_major_count	speeding_hwy_minor_count	start_area_zip	end_area_zip
start_point_timestamp
2016-12-13 18:16:32	3877	37	bens	London	England	SE6 4PU	51.44081	-0.02927	London	England	SE20 8SJ	51.40582	-0.06575	2016-12-13 18:34:25	17.88333	6.78333	11.10000	1.99922	3.42438	2.88743	37.30000	18.51014	0	0	0	0.00000	2	0	0	0	0	0	Other	Other
2016-12-13 19:11:44	3878	48	D4EHH	Other	England	GU16 6PX	51.30424	-0.72358	Other	England	GU14 6EN	51.27596	-0.74820	2016-12-13 19:21:25	9.68333	1.05000	8.63333	3.29158	3.45482	2.22731	38.50000	24.01036	0	1	1	0.28945	28	0	0	1	0	0	Other	Other

	start_day_zone	end_point_timestamp	end_day_zone
start_point_timestamp
2016-08-10 17:33:52	Afternoon	2016-08-10 17:55:50	Afternoon
2016-08-13 09:00:25	Morning	2016-08-13 09:13:16	Morning
2016-08-13 10:41:20	Morning	2016-08-13 10:54:48	Morning
2016-08-14 00:53:25	Night	2016-08-14 01:01:18	Night
2016-08-17 19:27:08	Afternoon	2016-08-17 20:00:46	Night