Explore and create ML datasets

In this notebook, we will explore data corresponding to taxi rides in New York City to build a Machine Learning model in support of a fare-estimation tool. The idea is to suggest a likely fare to taxi riders so that they are not surprised, and so that they can protest if the charge is much higher than expected.

Let's start off with the Python imports that we need.


In [1]:
from google.cloud import bigquery
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import shutil

Extract sample data from BigQuery

The dataset that we will use is a BigQuery public dataset. Click on the link, and look at the column names. Switch to the Details tab to verify that the number of records is one billion, and then switch to the Preview tab to look at a few rows.

Let's write a SQL query to pick up interesting fields from the dataset.


In [3]:
sql = """
  SELECT
    pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude,
    dropoff_latitude, passenger_count, trip_distance, tolls_amount, 
    fare_amount, total_amount 
  FROM `nyc-tlc.yellow.trips`
  LIMIT 10
"""

In [4]:
client = bigquery.Client()
trips = client.query(sql).to_dataframe()
trips


Out[4]:
pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count trip_distance tolls_amount fare_amount total_amount
0 2010-03-15 17:18:34+00:00 0.000000 0.000000 0.000000 0.000000 1 0.0 0.0 0.0 0.0
1 2015-03-18 01:07:02+00:00 0.000000 0.000000 0.000000 0.000000 5 0.0 0.0 0.0 0.0
2 2015-04-29 18:45:03+00:00 0.000000 0.000000 0.000000 0.000000 1 1.0 0.0 0.0 0.0
3 2013-08-24 01:58:23+00:00 -73.972171 40.759439 0.000000 0.000000 4 0.0 0.0 0.0 0.0
4 2015-04-26 02:56:37+00:00 -73.987656 40.771656 -73.987556 40.771751 1 0.0 0.0 0.0 0.0
5 2015-03-09 18:24:03+00:00 -73.937248 40.758202 -73.937263 40.758190 1 0.0 0.0 0.0 0.0
6 2010-03-04 00:35:16+00:00 -74.035201 40.721548 -74.035201 40.721548 1 0.0 0.0 0.0 0.0
7 2013-08-07 00:42:45+00:00 -74.025817 40.763044 -74.046752 40.783240 1 4.8 0.0 0.0 0.0
8 2010-03-11 21:24:48+00:00 -74.571511 40.910800 -74.628928 40.964321 1 68.4 0.0 0.0 0.0
9 2010-03-06 06:33:41+00:00 -73.785514 40.645400 -73.784564 40.648681 2 4.1 0.0 0.0 0.0

Let's increase the number of records so that we can do some neat graphs. There is no guarantee about the order in which records are returned, and so no guarantee about which records get returned if we simply increase the LIMIT. To properly sample the dataset, let's use the HASH of the pickup time and return 1 in 100,000 records -- because there are 1 billion records in the data, we should get back approximately 10,000 records if we do this.


In [5]:
sql = """
  SELECT
    pickup_datetime,
    pickup_longitude, pickup_latitude, 
    dropoff_longitude, dropoff_latitude,
    passenger_count,
    trip_distance,
    tolls_amount,
    fare_amount,
    total_amount
  FROM
    `nyc-tlc.yellow.trips`
  WHERE
    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
"""

In [6]:
trips = client.query(sql).to_dataframe()
trips[:10]


Out[6]:
pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count trip_distance tolls_amount fare_amount total_amount
0 2009-07-04 08:36:00+00:00 -73.992533 40.756207 -73.992555 40.756205 1 0.00 0.0 2.5 2.5
1 2009-08-20 23:04:58+00:00 -73.980657 40.765322 -73.962737 40.769690 1 2.50 0.0 2.5 2.5
2 2009-09-04 21:49:30+00:00 -73.991085 40.755503 -73.991185 40.755543 1 0.00 0.0 2.5 2.5
3 2009-08-31 13:27:07+00:00 -73.979360 40.735598 -73.971661 40.758827 1 1.80 0.0 2.5 2.5
4 2009-09-28 17:47:22+00:00 -73.984128 40.780583 -73.984141 40.780562 1 0.00 0.0 2.5 2.5
5 2009-05-27 20:37:00+00:00 -73.967982 40.762537 -73.967553 40.761778 5 0.07 0.0 2.5 3.0
6 2011-06-19 12:39:56+00:00 -73.994080 40.751073 -73.994097 40.751091 1 0.00 0.0 2.5 3.0
7 2013-12-06 14:55:00+00:00 -73.988727 40.773987 -73.988755 40.774037 5 0.00 0.0 2.5 3.0
8 2009-09-30 22:58:14+00:00 -73.988954 40.758612 -73.952118 40.776227 2 4.70 0.0 2.5 3.0
9 2014-05-17 15:15:00+00:00 -73.990825 40.750897 -73.990795 40.750872 6 0.00 0.0 2.5 3.0

Exploring data

Let's explore this dataset and clean it up as necessary. We'll use the Python Seaborn package to visualize graphs and Pandas to do the slicing and filtering.


In [7]:
ax = sns.regplot(x="trip_distance", y="fare_amount", fit_reg=False, ci=None, truncate=True, data=trips)
ax.figure.set_size_inches(10, 8)


/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Hmm ... do you see something wrong with the data that needs addressing?

It appears that we have a lot of invalid data that is being coded as zero distance and some fare amounts that are definitely illegitimate. Let's remove them from our analysis. We can do this by modifying the BigQuery query to keep only trips longer than zero miles and fare amounts that are at least the minimum cab fare ($2.50).

Note the extra WHERE clauses.


In [8]:
sql = """
  SELECT
    pickup_datetime,
    pickup_longitude, pickup_latitude, 
    dropoff_longitude, dropoff_latitude,
    passenger_count,
    trip_distance,
    tolls_amount,
    fare_amount,
    total_amount
  FROM
    `nyc-tlc.yellow.trips`
  WHERE
    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
    AND trip_distance > 0 AND fare_amount >= 2.5
"""

In [9]:
trips = client.query(sql).to_dataframe()
ax = sns.regplot(x="trip_distance", y="fare_amount", fit_reg=False, ci=None, truncate=True, data=trips)
ax.figure.set_size_inches(10, 8)


/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

What's up with the streaks at \$45 and \$50? Those are fixed-amount rides from JFK and La Guardia airports into anywhere in Manhattan, i.e. to be expected. Let's list the data to make sure the values look reasonable.

Let's examine whether the toll amount is captured in the total amount.


In [10]:
tollrides = trips[trips['tolls_amount'] > 0]
tollrides[tollrides['pickup_datetime'] == '2010-04-29 12:28:00']


Out[10]:
pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count trip_distance tolls_amount fare_amount total_amount
5302 2010-04-29 12:28:00+00:00 -73.865723 40.770543 -73.984790 40.758760 1 12.32 5.50 32.5 45.00
5713 2010-04-29 12:28:00+00:00 -73.969748 40.759790 -73.872892 40.774297 1 10.15 4.57 24.5 29.57
5776 2010-04-29 12:28:00+00:00 -73.870773 40.773753 -73.984963 40.757590 1 10.97 4.57 28.5 33.57
5887 2010-04-29 12:28:00+00:00 -73.789942 40.646943 -73.974362 40.756418 2 16.84 4.57 45.0 50.07
7153 2010-04-29 12:28:00+00:00 -73.862715 40.768987 -74.007195 40.707480 1 13.17 4.57 32.9 44.55
7249 2010-04-29 12:28:00+00:00 -73.870928 40.773747 -73.983638 40.752948 1 8.63 4.57 25.7 30.77
7269 2010-04-29 12:28:00+00:00 -74.006398 40.738450 -73.872652 40.774357 2 10.67 4.57 29.7 39.77
7275 2010-04-29 12:28:00+00:00 -74.008322 40.735337 -74.177383 40.695083 1 15.86 10.00 49.9 69.88
8215 2010-04-29 12:28:00+00:00 -73.991303 40.749965 -73.714585 40.745767 2 18.68 4.57 47.3 61.83
9692 2010-04-29 12:28:00+00:00 -73.950105 40.827105 -73.861490 40.768172 1 9.06 4.57 23.3 28.37

Looking a few samples above, it should be clear that the total amount reflects fare amount, toll and tip somewhat arbitrarily -- this is because when customers pay cash, the tip is not known. So, we'll use the sum of fare_amount + tolls_amount as what needs to be predicted. Tips are discretionary and do not have to be included in our fare estimation tool.

Let's also look at the distribution of values within the columns.


In [11]:
trips.describe()


Out[11]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count trip_distance tolls_amount fare_amount total_amount
count 10716.000000 10716.000000 10716.000000 10716.000000 10716.000000 10716.000000 10716.000000 10716.000000 10716.000000
mean -72.602192 40.002372 -72.594838 40.002052 1.650056 2.856395 0.226428 11.109446 13.217078
std 9.982373 5.474670 10.004324 5.474648 1.283577 3.322024 1.135934 9.137710 10.953156
min -74.258183 0.000000 -74.260472 0.000000 0.000000 0.010000 0.000000 2.500000 2.500000
25% -73.992153 40.735936 -73.991566 40.734310 1.000000 1.040000 0.000000 6.000000 7.300000
50% -73.981851 40.753264 -73.980373 40.752956 1.000000 1.770000 0.000000 8.500000 10.000000
75% -73.967400 40.767340 -73.964142 40.767510 2.000000 3.160000 0.000000 12.500000 14.600000
max 0.000000 41.366138 0.000000 41.366138 6.000000 42.800000 16.000000 179.000000 179.000000

Hmm ... The min, max of longitude look strange.

Finally, let's actually look at the start and end of a few of the trips.


In [12]:
def showrides(df, numlines):
  lats = []
  lons = []
  for iter, row in df[:numlines].iterrows():
    lons.append(row['pickup_longitude'])
    lons.append(row['dropoff_longitude'])
    lons.append(None)
    lats.append(row['pickup_latitude'])
    lats.append(row['dropoff_latitude'])
    lats.append(None)

  sns.set_style("darkgrid")
  plt.figure(figsize=(10,8))
  plt.plot(lons, lats)

showrides(trips, 10)


/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

In [13]:
showrides(tollrides, 10)


/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

As you'd expect, rides that involve a toll are longer than the typical ride.

Quality control and other preprocessing

We need to some clean-up of the data:

  1. New York city longitudes are around -74 and latitudes are around 41.
  2. We shouldn't have zero passengers.
  3. Clean up the total_amount column to reflect only fare_amount and tolls_amount, and then remove those two columns.
  4. Before the ride starts, we'll know the pickup and dropoff locations, but not the trip distance (that depends on the route taken), so remove it from the ML dataset
  5. Discard the timestamp

We could do preprocessing in BigQuery, similar to how we removed the zero-distance rides, but just to show you another option, let's do this in Python. In production, we'll have to carry out the same preprocessing on the real-time input data.

This sort of preprocessing of input data is quite common in ML, especially if the quality-control is dynamic.


In [14]:
def preprocess(trips_in):
  trips = trips_in.copy(deep=True)
  trips.fare_amount = trips.fare_amount + trips.tolls_amount
  del trips['tolls_amount']
  del trips['total_amount']
  del trips['trip_distance']
  del trips['pickup_datetime']
  qc = np.all([\
             trips['pickup_longitude'] > -78, \
             trips['pickup_longitude'] < -70, \
             trips['dropoff_longitude'] > -78, \
             trips['dropoff_longitude'] < -70, \
             trips['pickup_latitude'] > 37, \
             trips['pickup_latitude'] < 45, \
             trips['dropoff_latitude'] > 37, \
             trips['dropoff_latitude'] < 45, \
             trips['passenger_count'] > 0,
            ], axis=0)
  return trips[qc]

tripsqc = preprocess(trips)
tripsqc.describe()


Out[14]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count fare_amount
count 10476.000000 10476.000000 10476.000000 10476.000000 10476.000000 10476.000000
mean -73.975206 40.751526 -73.974373 40.751199 1.653303 11.349003
std 0.038547 0.029187 0.039086 0.033147 1.278827 9.878630
min -74.258183 40.452290 -74.260472 40.417750 1.000000 2.500000
25% -73.992336 40.737600 -73.991739 40.735904 1.000000 6.000000
50% -73.982090 40.754020 -73.980780 40.753597 1.000000 8.500000
75% -73.968517 40.767774 -73.965851 40.767921 2.000000 12.500000
max -73.137393 41.366138 -73.137393 41.366138 6.000000 179.000000

The quality control has removed about 300 rows (11400 - 11101) or about 3% of the data. This seems reasonable.

Let's move on to creating the ML datasets.

Create ML datasets

Let's split the QCed data randomly into training, validation and test sets.


In [15]:
shuffled = tripsqc.sample(frac=1)
trainsize = int(len(shuffled['fare_amount']) * 0.70)
validsize = int(len(shuffled['fare_amount']) * 0.15)

df_train = shuffled.iloc[:trainsize, :]
df_valid = shuffled.iloc[trainsize:(trainsize+validsize), :]
df_test = shuffled.iloc[(trainsize+validsize):, :]

In [16]:
df_train.describe()


Out[16]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count fare_amount
count 7333.000000 7333.000000 7333.000000 7333.000000 7333.000000 7333.000000
mean -73.975107 40.751321 -73.974045 40.750991 1.644211 11.403187
std 0.039297 0.029180 0.041209 0.034105 1.267472 9.992344
min -74.258183 40.608573 -74.260472 40.569997 1.000000 2.500000
25% -73.992417 40.737124 -73.991743 40.735540 1.000000 6.000000
50% -73.982063 40.753595 -73.980860 40.753443 1.000000 8.500000
75% -73.968425 40.767697 -73.965537 40.767534 2.000000 12.500000
max -73.137393 41.366138 -73.137393 41.366138 6.000000 179.000000

In [17]:
df_valid.describe()


Out[17]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count fare_amount
count 1571.000000 1571.000000 1571.000000 1571.000000 1571.000000 1571.000000
mean -73.976283 40.752817 -73.974124 40.751598 1.650541 11.340872
std 0.031142 0.027018 0.035942 0.032263 1.280214 9.726946
min -74.031669 40.452290 -74.182035 40.417750 1.000000 2.500000
25% -73.991907 40.738636 -73.991120 40.737029 1.000000 6.000000
50% -73.981914 40.755783 -73.979670 40.754025 1.000000 8.500000
75% -73.968597 40.768646 -73.964645 40.769946 2.000000 12.500000
max -73.694077 40.865671 -73.679133 40.879257 6.000000 93.750000

In [18]:
df_test.describe()


Out[18]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count fare_amount
count 1572.000000 1572.000000 1572.000000 1572.000000 1572.000000 1572.000000
mean -73.974595 40.751187 -73.976153 40.751767 1.698473 11.104377
std 0.041589 0.031223 0.031079 0.029270 1.329075 9.490205
min -74.116582 40.633522 -74.155750 40.610602 1.000000 2.500000
25% -73.992554 40.737898 -73.992351 40.735920 1.000000 6.000000
50% -73.982347 40.753726 -73.981396 40.754341 1.000000 8.500000
75% -73.969030 40.766664 -73.967797 40.767822 2.000000 12.100000
max -73.137393 41.366138 -73.744892 41.001380 6.000000 120.000000

Let's write out the three dataframes to appropriately named csv files. We can use these csv files for local training (recall that these files represent only 1/100,000 of the full dataset) until we get to point of using Dataflow and Cloud ML.


In [19]:
def to_csv(df, filename):
  outdf = df.copy(deep=False)
  outdf.loc[:, 'key'] = np.arange(0, len(outdf)) # rownumber as key
  # reorder columns so that target is first column
  cols = outdf.columns.tolist()
  cols.remove('fare_amount')
  cols.insert(0, 'fare_amount')
  print (cols)  # new order of columns
  outdf = outdf[cols]
  outdf.to_csv(filename, header=False, index_label=False, index=False)

to_csv(df_train, 'taxi-train.csv')
to_csv(df_valid, 'taxi-valid.csv')
to_csv(df_test, 'taxi-test.csv')


['fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'key']
['fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'key']
['fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'key']

In [20]:
!head -10 taxi-valid.csv


21.0,-73.975305,40.790067,-73.996612,40.733275,1,0
12.0,-73.993325,40.736502,-73.969148,40.752752,5,1
9.0,-73.982121,40.778384,-73.972623,40.796093,1,2
5.3,-73.997942,40.735735,-73.98547,40.738608,2,3
10.5,-73.986543,40.730283,-74.006965,40.705447,1,4
25.7,-73.956644,40.771152,-74.005279,40.74028,1,5
13.7,-73.962352,40.758807,-73.941687,40.811947,1,6
8.5,-73.97510528564453,40.7363166809082,-73.98577117919922,40.755611419677734,3,7
5.7,-73.96476,40.773025,-73.964673,40.77295,1,8
6.6,-73.992046,40.751358,-74.003362,40.737756,1,9

Verify that datasets exist


In [21]:
!ls -l *.csv


-rw-r--r-- 1 root root  85534 Oct 19 21:23 taxi-test.csv
-rw-r--r-- 1 root root 402804 Oct 19 21:23 taxi-train.csv
-rw-r--r-- 1 root root  85997 Oct 19 21:23 taxi-valid.csv

We have 3 .csv files corresponding to train, valid, test. The ratio of file-sizes correspond to our split of the data.


In [22]:
%%bash
head taxi-train.csv


9.0,-73.93219757080078,40.79558181762695,-73.93547058105469,40.80010986328125,1,0
4.5,-73.967703,40.756252,-73.972677,40.747745,1,1
30.5,-73.86369323730469,40.76985168457031,-73.8174819946289,40.664794921875,1,2
4.5,-73.969182,40.766816,-73.962413,40.778255,1,3
5.7,-73.975688,40.751843,-73.97884,40.744205,1,4
20.5,-73.993289,40.752283,-73.940769,40.788656,1,5
4.1,-73.944658,40.779262,-73.954415,40.781145,1,6
11.5,-73.834687,40.717252,-73.83961,40.752702,1,7
6.9,-73.987127,40.738842,-73.969777,40.759165,1,8
4.9,-74.008033,40.722897,-74.000918,40.728945,5,9

Looks good! We now have our ML datasets and are ready to train ML models, validate them and evaluate them.

Benchmark

Before we start building complex ML models, it is a good idea to come up with a very simple model and use that as a benchmark.

My model is going to be to simply divide the mean fare_amount by the mean trip_distance to come up with a rate and use that to predict. Let's compute the RMSE of such a model.


In [23]:
def distance_between(lat1, lon1, lat2, lon2):
  # haversine formula to compute distance "as the crow flies".  Taxis can't fly of course.
  dist = np.degrees(np.arccos(np.minimum(1,np.sin(np.radians(lat1)) * np.sin(np.radians(lat2)) + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.cos(np.radians(lon2 - lon1))))) * 60 * 1.515 * 1.609344
  return dist

def estimate_distance(df):
  return distance_between(df['pickuplat'], df['pickuplon'], df['dropofflat'], df['dropofflon'])

def compute_rmse(actual, predicted):
  return np.sqrt(np.mean((actual-predicted)**2))

def print_rmse(df, rate, name):
  print ("{1} RMSE = {0}".format(compute_rmse(df['fare_amount'], rate*estimate_distance(df)), name))

FEATURES = ['pickuplon','pickuplat','dropofflon','dropofflat','passengers']
TARGET = 'fare_amount'
columns = list([TARGET])
columns.extend(FEATURES) # in CSV, target is the first column, after the features
columns.append('key')
df_train = pd.read_csv('taxi-train.csv', header=None, names=columns)
df_valid = pd.read_csv('taxi-valid.csv', header=None, names=columns)
df_test = pd.read_csv('taxi-test.csv', header=None, names=columns)
rate = df_train['fare_amount'].mean() / estimate_distance(df_train).mean()
print ("Rate = ${0}/km".format(rate))
print_rmse(df_train, rate, 'Train')
print_rmse(df_valid, rate, 'Valid') 
print_rmse(df_test, rate, 'Test')


Rate = $2.6002738988685428/km
Train RMSE = 7.593609093225721
Valid RMSE = 5.440351676399091
Test RMSE = 9.328946890495182

Benchmark on same dataset

The RMSE depends on the dataset, and for comparison, we have to evaluate on the same dataset each time. We'll use this query in later labs:


In [24]:
def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid
  """
  base_query = """
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  CONCAT(CAST(pickup_datetime AS STRING), CAST(pickup_longitude AS STRING), CAST(pickup_latitude AS STRING), CAST(dropoff_latitude AS STRING), CAST(dropoff_longitude AS STRING)) AS key,
  EXTRACT(DAYOFWEEK FROM pickup_datetime)*1.0 AS dayofweek,
  EXTRACT(HOUR FROM pickup_datetime)*1.0 AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers
FROM
  `nyc-tlc.yellow.trips`
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  """

  if EVERY_N == None:
    if phase < 2:
      # training
      query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 4)) < 2".format(base_query)
    else:
      query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 4)) = {1}".format(base_query, phase)
  else:
      query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), {1})) = {2}".format(base_query, EVERY_N, phase)
    
  return query

query = create_query(2, 100000)
df_valid = client.query(query).to_dataframe()
print_rmse(df_valid, 2.56, 'Final Validation Set')


Final Validation Set RMSE = 7.4158766166380445

The simple distance-based rule gives us a RMSE of $7.42. We have to beat this, of course, but you will find that simple rules of thumb like this can be surprisingly difficult to beat.

Let's be ambitious, though, and make our goal to build ML models that have a RMSE of less than $6 on the test set.

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.