Explore and create ML datasets

In this notebook, we will explore data corresponding to taxi rides in New York City to build a Machine Learning model in support of a fare-estimation tool. The idea is to suggest a likely fare to taxi riders so that they are not surprised, and so that they can protest if the charge is much higher than expected.

Let's start off with the Python imports that we need.



In [1]:

    
from google.cloud import bigquery
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import shutil

Extract sample data from BigQuery

The dataset that we will use is a BigQuery public dataset. Click on the link, and look at the column names. Switch to the Details tab to verify that the number of records is one billion, and then switch to the Preview tab to look at a few rows.

Let's write a SQL query to pick up interesting fields from the dataset.



In [3]:

    
sql = """
  SELECT
    pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude,
    dropoff_latitude, passenger_count, trip_distance, tolls_amount, 
    fare_amount, total_amount 
  FROM `nyc-tlc.yellow.trips`
  LIMIT 10
"""



In [4]:

    
client = bigquery.Client()
trips = client.query(sql).to_dataframe()
trips









    Out[4]:







  
    
      
      pickup_datetime
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
      passenger_count
      trip_distance
      tolls_amount
      fare_amount
      total_amount
    
  
  
    
      0
      2010-03-15 17:18:34+00:00
      0.000000
      0.000000
      0.000000
      0.000000
      1
      0.0
      0.0
      0.0
      0.0
    
    
      1
      2015-03-18 01:07:02+00:00
      0.000000
      0.000000
      0.000000
      0.000000
      5
      0.0
      0.0
      0.0
      0.0
    
    
      2
      2015-04-29 18:45:03+00:00
      0.000000
      0.000000
      0.000000
      0.000000
      1
      1.0
      0.0
      0.0
      0.0
    
    
      3
      2013-08-24 01:58:23+00:00
      -73.972171
      40.759439
      0.000000
      0.000000
      4
      0.0
      0.0
      0.0
      0.0
    
    
      4
      2015-04-26 02:56:37+00:00
      -73.987656
      40.771656
      -73.987556
      40.771751
      1
      0.0
      0.0
      0.0
      0.0
    
    
      5
      2015-03-09 18:24:03+00:00
      -73.937248
      40.758202
      -73.937263
      40.758190
      1
      0.0
      0.0
      0.0
      0.0
    
    
      6
      2010-03-04 00:35:16+00:00
      -74.035201
      40.721548
      -74.035201
      40.721548
      1
      0.0
      0.0
      0.0
      0.0
    
    
      7
      2013-08-07 00:42:45+00:00
      -74.025817
      40.763044
      -74.046752
      40.783240
      1
      4.8
      0.0
      0.0
      0.0
    
    
      8
      2010-03-11 21:24:48+00:00
      -74.571511
      40.910800
      -74.628928
      40.964321
      1
      68.4
      0.0
      0.0
      0.0
    
    
      9
      2010-03-06 06:33:41+00:00
      -73.785514
      40.645400
      -73.784564
      40.648681
      2
      4.1
      0.0
      0.0
      0.0

Let's increase the number of records so that we can do some neat graphs. There is no guarantee about the order in which records are returned, and so no guarantee about which records get returned if we simply increase the LIMIT. To properly sample the dataset, let's use the HASH of the pickup time and return 1 in 100,000 records -- because there are 1 billion records in the data, we should get back approximately 10,000 records if we do this.



In [5]:

    
sql = """
  SELECT
    pickup_datetime,
    pickup_longitude, pickup_latitude, 
    dropoff_longitude, dropoff_latitude,
    passenger_count,
    trip_distance,
    tolls_amount,
    fare_amount,
    total_amount
  FROM
    `nyc-tlc.yellow.trips`
  WHERE
    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
"""



In [6]:

    
trips = client.query(sql).to_dataframe()
trips[:10]









    Out[6]:







  
    
      
      pickup_datetime
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
      passenger_count
      trip_distance
      tolls_amount
      fare_amount
      total_amount
    
  
  
    
      0
      2009-07-04 08:36:00+00:00
      -73.992533
      40.756207
      -73.992555
      40.756205
      1
      0.00
      0.0
      2.5
      2.5
    
    
      1
      2009-08-20 23:04:58+00:00
      -73.980657
      40.765322
      -73.962737
      40.769690
      1
      2.50
      0.0
      2.5
      2.5
    
    
      2
      2009-09-04 21:49:30+00:00
      -73.991085
      40.755503
      -73.991185
      40.755543
      1
      0.00
      0.0
      2.5
      2.5
    
    
      3
      2009-08-31 13:27:07+00:00
      -73.979360
      40.735598
      -73.971661
      40.758827
      1
      1.80
      0.0
      2.5
      2.5
    
    
      4
      2009-09-28 17:47:22+00:00
      -73.984128
      40.780583
      -73.984141
      40.780562
      1
      0.00
      0.0
      2.5
      2.5
    
    
      5
      2009-05-27 20:37:00+00:00
      -73.967982
      40.762537
      -73.967553
      40.761778
      5
      0.07
      0.0
      2.5
      3.0
    
    
      6
      2011-06-19 12:39:56+00:00
      -73.994080
      40.751073
      -73.994097
      40.751091
      1
      0.00
      0.0
      2.5
      3.0
    
    
      7
      2013-12-06 14:55:00+00:00
      -73.988727
      40.773987
      -73.988755
      40.774037
      5
      0.00
      0.0
      2.5
      3.0
    
    
      8
      2009-09-30 22:58:14+00:00
      -73.988954
      40.758612
      -73.952118
      40.776227
      2
      4.70
      0.0
      2.5
      3.0
    
    
      9
      2014-05-17 15:15:00+00:00
      -73.990825
      40.750897
      -73.990795
      40.750872
      6
      0.00
      0.0
      2.5
      3.0

Exploring data

Let's explore this dataset and clean it up as necessary. We'll use the Python Seaborn package to visualize graphs and Pandas to do the slicing and filtering.



In [7]:

    
ax = sns.regplot(x="trip_distance", y="fare_amount", fit_reg=False, ci=None, truncate=True, data=trips)
ax.figure.set_size_inches(10, 8)









    



/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Hmm ... do you see something wrong with the data that needs addressing?

It appears that we have a lot of invalid data that is being coded as zero distance and some fare amounts that are definitely illegitimate. Let's remove them from our analysis. We can do this by modifying the BigQuery query to keep only trips longer than zero miles and fare amounts that are at least the minimum cab fare ($2.50).

Note the extra WHERE clauses.



In [8]:

    
sql = """
  SELECT
    pickup_datetime,
    pickup_longitude, pickup_latitude, 
    dropoff_longitude, dropoff_latitude,
    passenger_count,
    trip_distance,
    tolls_amount,
    fare_amount,
    total_amount
  FROM
    `nyc-tlc.yellow.trips`
  WHERE
    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
    AND trip_distance > 0 AND fare_amount >= 2.5
"""



In [9]:

    
trips = client.query(sql).to_dataframe()
ax = sns.regplot(x="trip_distance", y="fare_amount", fit_reg=False, ci=None, truncate=True, data=trips)
ax.figure.set_size_inches(10, 8)









    



/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

What's up with the streaks at \$45 and \$50? Those are fixed-amount rides from JFK and La Guardia airports into anywhere in Manhattan, i.e. to be expected. Let's list the data to make sure the values look reasonable.

Let's examine whether the toll amount is captured in the total amount.



In [10]:

    
tollrides = trips[trips['tolls_amount'] > 0]
tollrides[tollrides['pickup_datetime'] == '2010-04-29 12:28:00']









    Out[10]:







  
    
      
      pickup_datetime
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
      passenger_count
      trip_distance
      tolls_amount
      fare_amount
      total_amount
    
  
  
    
      5302
      2010-04-29 12:28:00+00:00
      -73.865723
      40.770543
      -73.984790
      40.758760
      1
      12.32
      5.50
      32.5
      45.00
    
    
      5713
      2010-04-29 12:28:00+00:00
      -73.969748
      40.759790
      -73.872892
      40.774297
      1
      10.15
      4.57
      24.5
      29.57
    
    
      5776
      2010-04-29 12:28:00+00:00
      -73.870773
      40.773753
      -73.984963
      40.757590
      1
      10.97
      4.57
      28.5
      33.57
    
    
      5887
      2010-04-29 12:28:00+00:00
      -73.789942
      40.646943
      -73.974362
      40.756418
      2
      16.84
      4.57
      45.0
      50.07
    
    
      7153
      2010-04-29 12:28:00+00:00
      -73.862715
      40.768987
      -74.007195
      40.707480
      1
      13.17
      4.57
      32.9
      44.55
    
    
      7249
      2010-04-29 12:28:00+00:00
      -73.870928
      40.773747
      -73.983638
      40.752948
      1
      8.63
      4.57
      25.7
      30.77
    
    
      7269
      2010-04-29 12:28:00+00:00
      -74.006398
      40.738450
      -73.872652
      40.774357
      2
      10.67
      4.57
      29.7
      39.77
    
    
      7275
      2010-04-29 12:28:00+00:00
      -74.008322
      40.735337
      -74.177383
      40.695083
      1
      15.86
      10.00
      49.9
      69.88
    
    
      8215
      2010-04-29 12:28:00+00:00
      -73.991303
      40.749965
      -73.714585
      40.745767
      2
      18.68
      4.57
      47.3
      61.83
    
    
      9692
      2010-04-29 12:28:00+00:00
      -73.950105
      40.827105
      -73.861490
      40.768172
      1
      9.06
      4.57
      23.3
      28.37

Looking a few samples above, it should be clear that the total amount reflects fare amount, toll and tip somewhat arbitrarily -- this is because when customers pay cash, the tip is not known. So, we'll use the sum of fare_amount + tolls_amount as what needs to be predicted. Tips are discretionary and do not have to be included in our fare estimation tool.

Let's also look at the distribution of values within the columns.



In [11]:

    
trips.describe()









    Out[11]:







  
    
      
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
      passenger_count
      trip_distance
      tolls_amount
      fare_amount
      total_amount
    
  
  
    
      count
      10716.000000
      10716.000000
      10716.000000
      10716.000000
      10716.000000
      10716.000000
      10716.000000
      10716.000000
      10716.000000
    
    
      mean
      -72.602192
      40.002372
      -72.594838
      40.002052
      1.650056
      2.856395
      0.226428
      11.109446
      13.217078
    
    
      std
      9.982373
      5.474670
      10.004324
      5.474648
      1.283577
      3.322024
      1.135934
      9.137710
      10.953156
    
    
      min
      -74.258183
      0.000000
      -74.260472
      0.000000
      0.000000
      0.010000
      0.000000
      2.500000
      2.500000
    
    
      25%
      -73.992153
      40.735936
      -73.991566
      40.734310
      1.000000
      1.040000
      0.000000
      6.000000
      7.300000
    
    
      50%
      -73.981851
      40.753264
      -73.980373
      40.752956
      1.000000
      1.770000
      0.000000
      8.500000
      10.000000
    
    
      75%
      -73.967400
      40.767340
      -73.964142
      40.767510
      2.000000
      3.160000
      0.000000
      12.500000
      14.600000
    
    
      max
      0.000000
      41.366138
      0.000000
      41.366138
      6.000000
      42.800000
      16.000000
      179.000000
      179.000000

Hmm ... The min, max of longitude look strange.

Finally, let's actually look at the start and end of a few of the trips.



In [12]:

    
def showrides(df, numlines):
  lats = []
  lons = []
  for iter, row in df[:numlines].iterrows():
    lons.append(row['pickup_longitude'])
    lons.append(row['dropoff_longitude'])
    lons.append(None)
    lats.append(row['pickup_latitude'])
    lats.append(row['dropoff_latitude'])
    lats.append(None)

  sns.set_style("darkgrid")
  plt.figure(figsize=(10,8))
  plt.plot(lons, lats)

showrides(trips, 10)









    



/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))



In [13]:

    
showrides(tollrides, 10)









    



/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

As you'd expect, rides that involve a toll are longer than the typical ride.

Quality control and other preprocessing

We need to some clean-up of the data:

New York city longitudes are around -74 and latitudes are around 41.
We shouldn't have zero passengers.
Clean up the total_amount column to reflect only fare_amount and tolls_amount, and then remove those two columns.
Before the ride starts, we'll know the pickup and dropoff locations, but not the trip distance (that depends on the route taken), so remove it from the ML dataset
Discard the timestamp

We could do preprocessing in BigQuery, similar to how we removed the zero-distance rides, but just to show you another option, let's do this in Python. In production, we'll have to carry out the same preprocessing on the real-time input data.

This sort of preprocessing of input data is quite common in ML, especially if the quality-control is dynamic.



In [14]:

    
def preprocess(trips_in):
  trips = trips_in.copy(deep=True)
  trips.fare_amount = trips.fare_amount + trips.tolls_amount
  del trips['tolls_amount']
  del trips['total_amount']
  del trips['trip_distance']
  del trips['pickup_datetime']
  qc = np.all([\
             trips['pickup_longitude'] > -78, \
             trips['pickup_longitude'] < -70, \
             trips['dropoff_longitude'] > -78, \
             trips['dropoff_longitude'] < -70, \
             trips['pickup_latitude'] > 37, \
             trips['pickup_latitude'] < 45, \
             trips['dropoff_latitude'] > 37, \
             trips['dropoff_latitude'] < 45, \
             trips['passenger_count'] > 0,
            ], axis=0)
  return trips[qc]

tripsqc = preprocess(trips)
tripsqc.describe()









    Out[14]:







  
    
      
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
      passenger_count
      fare_amount
    
  
  
    
      count
      10476.000000
      10476.000000
      10476.000000
      10476.000000
      10476.000000
      10476.000000
    
    
      mean
      -73.975206
      40.751526
      -73.974373
      40.751199
      1.653303
      11.349003
    
    
      std
      0.038547
      0.029187
      0.039086
      0.033147
      1.278827
      9.878630
    
    
      min
      -74.258183
      40.452290
      -74.260472
      40.417750
      1.000000
      2.500000
    
    
      25%
      -73.992336
      40.737600
      -73.991739
      40.735904
      1.000000
      6.000000
    
    
      50%
      -73.982090
      40.754020
      -73.980780
      40.753597
      1.000000
      8.500000
    
    
      75%
      -73.968517
      40.767774
      -73.965851
      40.767921
      2.000000
      12.500000
    
    
      max
      -73.137393
      41.366138
      -73.137393
      41.366138
      6.000000
      179.000000

The quality control has removed about 300 rows (11400 - 11101) or about 3% of the data. This seems reasonable.

Let's move on to creating the ML datasets.

Create ML datasets

Let's split the QCed data randomly into training, validation and test sets.



In [15]:

    
shuffled = tripsqc.sample(frac=1)
trainsize = int(len(shuffled['fare_amount']) * 0.70)
validsize = int(len(shuffled['fare_amount']) * 0.15)

df_train = shuffled.iloc[:trainsize, :]
df_valid = shuffled.iloc[trainsize:(trainsize+validsize), :]
df_test = shuffled.iloc[(trainsize+validsize):, :]



In [16]:

    
df_train.describe()









    Out[16]:







  
    
      
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
      passenger_count
      fare_amount
    
  
  
    
      count
      7333.000000
      7333.000000
      7333.000000
      7333.000000
      7333.000000
      7333.000000
    
    
      mean
      -73.975107
      40.751321
      -73.974045
      40.750991
      1.644211
      11.403187
    
    
      std
      0.039297
      0.029180
      0.041209
      0.034105
      1.267472
      9.992344
    
    
      min
      -74.258183
      40.608573
      -74.260472
      40.569997
      1.000000
      2.500000
    
    
      25%
      -73.992417
      40.737124
      -73.991743
      40.735540
      1.000000
      6.000000
    
    
      50%
      -73.982063
      40.753595
      -73.980860
      40.753443
      1.000000
      8.500000
    
    
      75%
      -73.968425
      40.767697
      -73.965537
      40.767534
      2.000000
      12.500000
    
    
      max
      -73.137393
      41.366138
      -73.137393
      41.366138
      6.000000
      179.000000



In [17]:

    
df_valid.describe()









    Out[17]:







  
    
      
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
      passenger_count
      fare_amount
    
  
  
    
      count
      1571.000000
      1571.000000
      1571.000000
      1571.000000
      1571.000000
      1571.000000
    
    
      mean
      -73.976283
      40.752817
      -73.974124
      40.751598
      1.650541
      11.340872
    
    
      std
      0.031142
      0.027018
      0.035942
      0.032263
      1.280214
      9.726946
    
    
      min
      -74.031669
      40.452290
      -74.182035
      40.417750
      1.000000
      2.500000
    
    
      25%
      -73.991907
      40.738636
      -73.991120
      40.737029
      1.000000
      6.000000
    
    
      50%
      -73.981914
      40.755783
      -73.979670
      40.754025
      1.000000
      8.500000
    
    
      75%
      -73.968597
      40.768646
      -73.964645
      40.769946
      2.000000
      12.500000
    
    
      max
      -73.694077
      40.865671
      -73.679133
      40.879257
      6.000000
      93.750000



In [18]:

    
df_test.describe()









    Out[18]:







  
    
      
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
      passenger_count
      fare_amount
    
  
  
    
      count
      1572.000000
      1572.000000
      1572.000000
      1572.000000
      1572.000000
      1572.000000
    
    
      mean
      -73.974595
      40.751187
      -73.976153
      40.751767
      1.698473
      11.104377
    
    
      std
      0.041589
      0.031223
      0.031079
      0.029270
      1.329075
      9.490205
    
    
      min
      -74.116582
      40.633522
      -74.155750
      40.610602
      1.000000
      2.500000
    
    
      25%
      -73.992554
      40.737898
      -73.992351
      40.735920
      1.000000
      6.000000
    
    
      50%
      -73.982347
      40.753726
      -73.981396
      40.754341
      1.000000
      8.500000
    
    
      75%
      -73.969030
      40.766664
      -73.967797
      40.767822
      2.000000
      12.100000
    
    
      max
      -73.137393
      41.366138
      -73.744892
      41.001380
      6.000000
      120.000000

Let's write out the three dataframes to appropriately named csv files. We can use these csv files for local training (recall that these files represent only 1/100,000 of the full dataset) until we get to point of using Dataflow and Cloud ML.



In [19]:

    
def to_csv(df, filename):
  outdf = df.copy(deep=False)
  outdf.loc[:, 'key'] = np.arange(0, len(outdf)) # rownumber as key
  # reorder columns so that target is first column
  cols = outdf.columns.tolist()
  cols.remove('fare_amount')
  cols.insert(0, 'fare_amount')
  print (cols)  # new order of columns
  outdf = outdf[cols]
  outdf.to_csv(filename, header=False, index_label=False, index=False)

to_csv(df_train, 'taxi-train.csv')
to_csv(df_valid, 'taxi-valid.csv')
to_csv(df_test, 'taxi-test.csv')









    



['fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'key']
['fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'key']
['fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'key']



In [20]:

    
!head -10 taxi-valid.csv









    



21.0,-73.975305,40.790067,-73.996612,40.733275,1,0
12.0,-73.993325,40.736502,-73.969148,40.752752,5,1
9.0,-73.982121,40.778384,-73.972623,40.796093,1,2
5.3,-73.997942,40.735735,-73.98547,40.738608,2,3
10.5,-73.986543,40.730283,-74.006965,40.705447,1,4
25.7,-73.956644,40.771152,-74.005279,40.74028,1,5
13.7,-73.962352,40.758807,-73.941687,40.811947,1,6
8.5,-73.97510528564453,40.7363166809082,-73.98577117919922,40.755611419677734,3,7
5.7,-73.96476,40.773025,-73.964673,40.77295,1,8
6.6,-73.992046,40.751358,-74.003362,40.737756,1,9

Verify that datasets exist



In [21]:

    
!ls -l *.csv









    



-rw-r--r-- 1 root root  85534 Oct 19 21:23 taxi-test.csv
-rw-r--r-- 1 root root 402804 Oct 19 21:23 taxi-train.csv
-rw-r--r-- 1 root root  85997 Oct 19 21:23 taxi-valid.csv

We have 3 .csv files corresponding to train, valid, test. The ratio of file-sizes correspond to our split of the data.



In [22]:

    
%%bash
head taxi-train.csv









    



9.0,-73.93219757080078,40.79558181762695,-73.93547058105469,40.80010986328125,1,0
4.5,-73.967703,40.756252,-73.972677,40.747745,1,1
30.5,-73.86369323730469,40.76985168457031,-73.8174819946289,40.664794921875,1,2
4.5,-73.969182,40.766816,-73.962413,40.778255,1,3
5.7,-73.975688,40.751843,-73.97884,40.744205,1,4
20.5,-73.993289,40.752283,-73.940769,40.788656,1,5
4.1,-73.944658,40.779262,-73.954415,40.781145,1,6
11.5,-73.834687,40.717252,-73.83961,40.752702,1,7
6.9,-73.987127,40.738842,-73.969777,40.759165,1,8
4.9,-74.008033,40.722897,-74.000918,40.728945,5,9

Looks good! We now have our ML datasets and are ready to train ML models, validate them and evaluate them.

Benchmark

Before we start building complex ML models, it is a good idea to come up with a very simple model and use that as a benchmark.

My model is going to be to simply divide the mean fare_amount by the mean trip_distance to come up with a rate and use that to predict. Let's compute the RMSE of such a model.



In [23]:

    
def distance_between(lat1, lon1, lat2, lon2):
  # haversine formula to compute distance "as the crow flies".  Taxis can't fly of course.
  dist = np.degrees(np.arccos(np.minimum(1,np.sin(np.radians(lat1)) * np.sin(np.radians(lat2)) + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.cos(np.radians(lon2 - lon1))))) * 60 * 1.515 * 1.609344
  return dist

def estimate_distance(df):
  return distance_between(df['pickuplat'], df['pickuplon'], df['dropofflat'], df['dropofflon'])

def compute_rmse(actual, predicted):
  return np.sqrt(np.mean((actual-predicted)**2))

def print_rmse(df, rate, name):
  print ("{1} RMSE = {0}".format(compute_rmse(df['fare_amount'], rate*estimate_distance(df)), name))

FEATURES = ['pickuplon','pickuplat','dropofflon','dropofflat','passengers']
TARGET = 'fare_amount'
columns = list([TARGET])
columns.extend(FEATURES) # in CSV, target is the first column, after the features
columns.append('key')
df_train = pd.read_csv('taxi-train.csv', header=None, names=columns)
df_valid = pd.read_csv('taxi-valid.csv', header=None, names=columns)
df_test = pd.read_csv('taxi-test.csv', header=None, names=columns)
rate = df_train['fare_amount'].mean() / estimate_distance(df_train).mean()
print ("Rate = ${0}/km".format(rate))
print_rmse(df_train, rate, 'Train')
print_rmse(df_valid, rate, 'Valid') 
print_rmse(df_test, rate, 'Test')









    



Rate = $2.6002738988685428/km
Train RMSE = 7.593609093225721
Valid RMSE = 5.440351676399091
Test RMSE = 9.328946890495182

Benchmark on same dataset

The RMSE depends on the dataset, and for comparison, we have to evaluate on the same dataset each time. We'll use this query in later labs:



In [24]:

    
def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid
  """
  base_query = """
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  CONCAT(CAST(pickup_datetime AS STRING), CAST(pickup_longitude AS STRING), CAST(pickup_latitude AS STRING), CAST(dropoff_latitude AS STRING), CAST(dropoff_longitude AS STRING)) AS key,
  EXTRACT(DAYOFWEEK FROM pickup_datetime)*1.0 AS dayofweek,
  EXTRACT(HOUR FROM pickup_datetime)*1.0 AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers
FROM
  `nyc-tlc.yellow.trips`
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  """

  if EVERY_N == None:
    if phase < 2:
      # training
      query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 4)) < 2".format(base_query)
    else:
      query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 4)) = {1}".format(base_query, phase)
  else:
      query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), {1})) = {2}".format(base_query, EVERY_N, phase)
    
  return query

query = create_query(2, 100000)
df_valid = client.query(query).to_dataframe()
print_rmse(df_valid, 2.56, 'Final Validation Set')









    



Final Validation Set RMSE = 7.4158766166380445

The simple distance-based rule gives us a RMSE of $7.42. We have to beat this, of course, but you will find that simple rules of thumb like this can be surprisingly difficult to beat.

Let's be ambitious, though, and make our goal to build ML models that have a RMSE of less than $6 on the test set.

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

	pickup_datetime	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	trip_distance
0	2010-03-15 17:18:34+00:00	0.000000	0.000000	0.000000	0.000000	1	0.0
1	2015-03-18 01:07:02+00:00	0.000000	0.000000	0.000000	0.000000	5	0.0
2	2015-04-29 18:45:03+00:00	0.000000	0.000000	0.000000	0.000000	1	1.0
3	2013-08-24 01:58:23+00:00	-73.972171	40.759439	0.000000	0.000000	4	0.0
4	2015-04-26 02:56:37+00:00	-73.987656	40.771656	-73.987556	40.771751	1	0.0
5	2015-03-09 18:24:03+00:00	-73.937248	40.758202	-73.937263	40.758190	1	0.0
6	2010-03-04 00:35:16+00:00	-74.035201	40.721548	-74.035201	40.721548	1	0.0
7	2013-08-07 00:42:45+00:00	-74.025817	40.763044	-74.046752	40.783240	1	4.8
8	2010-03-11 21:24:48+00:00	-74.571511	40.910800	-74.628928	40.964321	1	68.4
9	2010-03-06 06:33:41+00:00	-73.785514	40.645400	-73.784564	40.648681	2	4.1

	pickup_datetime	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	trip_distance	fare_amount	total_amount
0	2009-07-04 08:36:00+00:00	-73.992533	40.756207	-73.992555	40.756205	1	0.00	2.5	2.5
1	2009-08-20 23:04:58+00:00	-73.980657	40.765322	-73.962737	40.769690	1	2.50	2.5	2.5
2	2009-09-04 21:49:30+00:00	-73.991085	40.755503	-73.991185	40.755543	1	0.00	2.5	2.5
3	2009-08-31 13:27:07+00:00	-73.979360	40.735598	-73.971661	40.758827	1	1.80	2.5	2.5
4	2009-09-28 17:47:22+00:00	-73.984128	40.780583	-73.984141	40.780562	1	0.00	2.5	2.5
5	2009-05-27 20:37:00+00:00	-73.967982	40.762537	-73.967553	40.761778	5	0.07	2.5	3.0
6	2011-06-19 12:39:56+00:00	-73.994080	40.751073	-73.994097	40.751091	1	0.00	2.5	3.0
7	2013-12-06 14:55:00+00:00	-73.988727	40.773987	-73.988755	40.774037	5	0.00	2.5	3.0
8	2009-09-30 22:58:14+00:00	-73.988954	40.758612	-73.952118	40.776227	2	4.70	2.5	3.0
9	2014-05-17 15:15:00+00:00	-73.990825	40.750897	-73.990795	40.750872	6	0.00	2.5	3.0

	pickup_datetime	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	trip_distance	tolls_amount	fare_amount	total_amount
5302	2010-04-29 12:28:00+00:00	-73.865723	40.770543	-73.984790	40.758760	1	12.32	5.50	32.5	45.00
5713	2010-04-29 12:28:00+00:00	-73.969748	40.759790	-73.872892	40.774297	1	10.15	4.57	24.5	29.57
5776	2010-04-29 12:28:00+00:00	-73.870773	40.773753	-73.984963	40.757590	1	10.97	4.57	28.5	33.57
5887	2010-04-29 12:28:00+00:00	-73.789942	40.646943	-73.974362	40.756418	2	16.84	4.57	45.0	50.07
7153	2010-04-29 12:28:00+00:00	-73.862715	40.768987	-74.007195	40.707480	1	13.17	4.57	32.9	44.55
7249	2010-04-29 12:28:00+00:00	-73.870928	40.773747	-73.983638	40.752948	1	8.63	4.57	25.7	30.77
7269	2010-04-29 12:28:00+00:00	-74.006398	40.738450	-73.872652	40.774357	2	10.67	4.57	29.7	39.77
7275	2010-04-29 12:28:00+00:00	-74.008322	40.735337	-74.177383	40.695083	1	15.86	10.00	49.9	69.88
8215	2010-04-29 12:28:00+00:00	-73.991303	40.749965	-73.714585	40.745767	2	18.68	4.57	47.3	61.83
9692	2010-04-29 12:28:00+00:00	-73.950105	40.827105	-73.861490	40.768172	1	9.06	4.57	23.3	28.37

	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	trip_distance	tolls_amount	fare_amount	total_amount
count	10716.000000	10716.000000	10716.000000	10716.000000	10716.000000	10716.000000	10716.000000	10716.000000	10716.000000
mean	-72.602192	40.002372	-72.594838	40.002052	1.650056	2.856395	0.226428	11.109446	13.217078
std	9.982373	5.474670	10.004324	5.474648	1.283577	3.322024	1.135934	9.137710	10.953156
min	-74.258183	0.000000	-74.260472	0.000000	0.000000	0.010000	0.000000	2.500000	2.500000
25%	-73.992153	40.735936	-73.991566	40.734310	1.000000	1.040000	0.000000	6.000000	7.300000
50%	-73.981851	40.753264	-73.980373	40.752956	1.000000	1.770000	0.000000	8.500000	10.000000
75%	-73.967400	40.767340	-73.964142	40.767510	2.000000	3.160000	0.000000	12.500000	14.600000
max	0.000000	41.366138	0.000000	41.366138	6.000000	42.800000	16.000000	179.000000	179.000000

	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	fare_amount
count	10476.000000	10476.000000	10476.000000	10476.000000	10476.000000	10476.000000
mean	-73.975206	40.751526	-73.974373	40.751199	1.653303	11.349003
std	0.038547	0.029187	0.039086	0.033147	1.278827	9.878630
min	-74.258183	40.452290	-74.260472	40.417750	1.000000	2.500000
25%	-73.992336	40.737600	-73.991739	40.735904	1.000000	6.000000
50%	-73.982090	40.754020	-73.980780	40.753597	1.000000	8.500000
75%	-73.968517	40.767774	-73.965851	40.767921	2.000000	12.500000
max	-73.137393	41.366138	-73.137393	41.366138	6.000000	179.000000

	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	fare_amount
count	7333.000000	7333.000000	7333.000000	7333.000000	7333.000000	7333.000000
mean	-73.975107	40.751321	-73.974045	40.750991	1.644211	11.403187
std	0.039297	0.029180	0.041209	0.034105	1.267472	9.992344
min	-74.258183	40.608573	-74.260472	40.569997	1.000000	2.500000
25%	-73.992417	40.737124	-73.991743	40.735540	1.000000	6.000000
50%	-73.982063	40.753595	-73.980860	40.753443	1.000000	8.500000
75%	-73.968425	40.767697	-73.965537	40.767534	2.000000	12.500000
max	-73.137393	41.366138	-73.137393	41.366138	6.000000	179.000000

	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	fare_amount
count	1571.000000	1571.000000	1571.000000	1571.000000	1571.000000	1571.000000
mean	-73.976283	40.752817	-73.974124	40.751598	1.650541	11.340872
std	0.031142	0.027018	0.035942	0.032263	1.280214	9.726946
min	-74.031669	40.452290	-74.182035	40.417750	1.000000	2.500000
25%	-73.991907	40.738636	-73.991120	40.737029	1.000000	6.000000
50%	-73.981914	40.755783	-73.979670	40.754025	1.000000	8.500000
75%	-73.968597	40.768646	-73.964645	40.769946	2.000000	12.500000
max	-73.694077	40.865671	-73.679133	40.879257	6.000000	93.750000

	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	fare_amount
count	1572.000000	1572.000000	1572.000000	1572.000000	1572.000000	1572.000000
mean	-73.974595	40.751187	-73.976153	40.751767	1.698473	11.104377
std	0.041589	0.031223	0.031079	0.029270	1.329075	9.490205
min	-74.116582	40.633522	-74.155750	40.610602	1.000000	2.500000
25%	-73.992554	40.737898	-73.992351	40.735920	1.000000	6.000000
50%	-73.982347	40.753726	-73.981396	40.754341	1.000000	8.500000
75%	-73.969030	40.766664	-73.967797	40.767822	2.000000	12.100000
max	-73.137393	41.366138	-73.744892	41.001380	6.000000	120.000000