Demand forecasting with BigQuery and TensorFlow

In this notebook, we will develop a machine learning model to predict the demand for taxi cabs in New York.

To develop the model, we will need to get historical data of taxicab usage. This data exists in BigQuery. Let's start by looking at the schema.

Set up



In [2]:

    
!sudo pip install --user pandas-gbq
!pip install --user pandas_gbq
!pip install tensorflow==1.15.3

Restart the kernel after installation.



In [2]:

    
PROJECT = 'cloud-training-demos'   # CHANGE this to your GCP project
BUCKET  = PROJECT + '-ml'
REGION  = 'us-central1'            # CHANGE this to the region you want to use



In [3]:

    
import os
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION



In [4]:

    
%%bash
gcloud config set project ${PROJECT}
gcloud config set compute/region ${REGION}









    



Updated property [core/project].
Updated property [compute/region].



In [6]:

    
def query_to_dataframe(query):
  import pandas as pd
  return pd.read_gbq(query, dialect='standard', project_id=PROJECT)

Explore table



In [7]:

    
import pandas as pd
import numpy as np
import shutil



In [8]:

    
query_to_dataframe("""
SELECT * FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015` LIMIT 10
""")









    



Requesting query... ok.
Job ID: deb771f8-1c3f-4bc0-9520-d4a754400e55
Query running...
Query done.
Processed: 18.1 GB Billed: 18.1 GB
Standard price: $0.09 USD

Retrieving results...
Got 10 rows.

Total time taken 1.99 s.
Finished at 2018-09-13 17:14:20.






    Out[8]:







  
    
      
      vendor_id
      pickup_datetime
      dropoff_datetime
      passenger_count
      trip_distance
      pickup_longitude
      pickup_latitude
      rate_code
      store_and_fwd_flag
      dropoff_longitude
      dropoff_latitude
      payment_type
      fare_amount
      extra
      mta_tax
      tip_amount
      tolls_amount
      imp_surcharge
      total_amount
    
  
  
    
      0
      1
      2015-11-15 21:30:58
      2015-11-15 21:55:47
      1
      15.10
      -73.874443
      40.773983
      1
      N
      -74.017311
      40.705109
      1
      42.0
      0.5
      0.5
      9.76
      5.54
      0.3
      58.60
    
    
      1
      1
      2015-03-15 12:50:01
      2015-03-15 13:23:35
      1
      10.80
      -73.870926
      40.773727
      None
      N
      -73.988228
      40.765694
      1
      34.5
      0.0
      0.5
      8.10
      5.33
      0.3
      48.73
    
    
      2
      1
      2015-06-20 19:36:17
      2015-06-20 20:10:49
      1
      15.50
      -73.777054
      40.644947
      None
      Y
      -73.946800
      40.725021
      1
      44.5
      0.0
      0.5
      9.06
      0.00
      0.3
      54.36
    
    
      3
      2
      2015-07-18 11:25:58
      2015-07-18 11:43:47
      1
      7.21
      -73.862762
      40.769028
      1
      N
      -73.949203
      40.722584
      1
      22.5
      0.0
      0.5
      4.66
      0.00
      0.3
      27.96
    
    
      4
      2
      2015-05-18 08:00:04
      2015-05-18 08:30:31
      1
      5.70
      -73.958054
      40.784805
      None
      N
      -74.008820
      40.720043
      1
      22.5
      0.0
      0.5
      4.66
      0.00
      0.3
      27.96
    
    
      5
      2
      2015-04-30 12:25:44
      2015-04-30 13:03:51
      1
      4.28
      -73.978180
      40.762341
      None
      N
      -74.008911
      40.710789
      1
      24.5
      0.0
      0.5
      2.50
      0.00
      0.3
      27.80
    
    
      6
      2
      2015-01-08 19:28:42
      2015-01-08 19:46:40
      1
      8.46
      -73.862740
      40.769081
      None
      N
      -73.952225
      40.786377
      1
      25.5
      1.0
      0.5
      6.37
      5.33
      0.3
      39.00
    
    
      7
      2
      2015-04-05 13:50:31
      2015-04-05 14:13:43
      1
      7.72
      -73.946259
      40.777489
      None
      N
      -73.857201
      40.836712
      1
      25.5
      0.0
      0.5
      6.58
      0.00
      0.3
      32.88
    
    
      8
      2
      2015-05-28 08:47:56
      2015-05-28 09:26:08
      1
      18.47
      -73.776711
      40.645302
      None
      N
      -73.843422
      40.852852
      1
      51.0
      0.0
      0.5
      7.00
      5.54
      0.3
      64.34
    
    
      9
      2
      2015-06-29 06:58:17
      2015-06-29 07:18:44
      2
      8.87
      -73.951965
      40.773315
      None
      N
      -73.864807
      40.770283
      1
      27.0
      0.0
      0.5
      6.67
      5.54
      0.3
      40.01

Analyzing taxicab demand

Let's pull the number of trips for each day in the 2015 dataset using Standard SQL.



In [9]:

    
query_to_dataframe("""
SELECT 
  EXTRACT (DAYOFYEAR from pickup_datetime) AS daynumber
FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015` 
LIMIT 5
""")









    



Requesting query... ok.
Job ID: 14762f07-8df7-41e7-beae-6d5cb76a1d78
Query running...
Query done.
Processed: 1.1 GB Billed: 1.1 GB
Standard price: $0.01 USD

Retrieving results...
Got 5 rows.

Total time taken 1.47 s.
Finished at 2018-09-13 17:14:25.






    Out[9]:







  
    
      
      daynumber
    
  
  
    
      0
      199
    
    
      1
      74
    
    
      2
      120
    
    
      3
      148
    
    
      4
      171

Modular queries and Pandas dataframe

Let's use the total number of trips as our proxy for taxicab demand (other reasonable alternatives are total trip_distance or total fare_amount). It is possible to predict multiple variables using Tensorflow, but for simplicity, we will stick to just predicting the number of trips.

We will give our query a name 'taxiquery' and have it use an input variable '$YEAR'. We can then invoke the 'taxiquery' by giving it a YEAR. The to_dataframe() converts the BigQuery result into a Pandas dataframe.



In [10]:

    
def taxiquery(year):
  return """
WITH trips AS (
  SELECT EXTRACT (DAYOFYEAR from pickup_datetime) AS daynumber 
  FROM `bigquery-public-data.new_york.tlc_yellow_trips_*`
  where _TABLE_SUFFIX = '{}'
)
SELECT daynumber, COUNT(1) AS numtrips FROM trips
GROUP BY daynumber ORDER BY daynumber
""".format(year)



In [11]:

    
trips = query_to_dataframe(taxiquery(2015))
trips[:5]









    



Requesting query... ok.
Job ID: 5cefa5cc-fc2d-4ab2-a7b0-ab12f9c861a9
Query running...
Query done.
Processed: 1.1 GB Billed: 1.1 GB
Standard price: $0.01 USD

Retrieving results...
Got 365 rows.

Total time taken 3.72 s.
Finished at 2018-09-13 17:14:33.






    Out[11]:







  
    
      
      daynumber
      numtrips
    
  
  
    
      0
      1
      382014
    
    
      1
      2
      345296
    
    
      2
      3
      406769
    
    
      3
      4
      328848
    
    
      4
      5
      363454

Benchmark

Often, a reasonable estimate of something is its historical average. We can therefore benchmark our machine learning model against the historical average.



In [12]:

    
avg = np.mean(trips['numtrips'])
print('Just using average={0} has RMSE of {1}'.format(avg, np.sqrt(np.mean((trips['numtrips'] - avg)**2))))









    



Just using average=400309.5589041096 has RMSE of 51613.65169049127

The mean here is about 400,000 and the root-mean-square-error (RMSE) in this case is about 52,000. In other words, if we were to estimate that there are 400,000 taxi trips on any given day, that estimate is will be off on average by about 52,000 in either direction.

Let's see if we can do better than this -- our goal is to make predictions of taxicab demand whose RMSE is lower than 52,000.

What kinds of things affect people's use of taxicabs?

Weather data

We suspect that weather influences how often people use a taxi. Perhaps someone who'd normally walk to work would take a taxi if it is very cold or rainy.

One of the advantages of using a global data warehouse like BigQuery is that you get to mash up unrelated datasets quite easily.



In [13]:

    
query_to_dataframe("""
SELECT * FROM `bigquery-public-data.noaa_gsod.stations`
WHERE state = 'NY' AND wban != '99999' AND name LIKE '%LA GUARDIA%'
""")









    



Requesting query... ok.
Job ID: 4d6e67c6-bfc2-49eb-880e-3c44aedeb81b
Query running...
Query done.
Processed: 2.3 MB Billed: 10.0 MB
Standard price: $0.00 USD

Retrieving results...
Got 1 rows.

Total time taken 1.54 s.
Finished at 2018-09-13 17:14:38.






    Out[13]:







  
    
      
      usaf
      wban
      name
      country
      state
      call
      lat
      lon
      elev
      begin
      end
    
  
  
    
      0
      725030
      14732
      LA GUARDIA AIRPORT
      US
      NY
      KLGA
      40.779
      -73.88
      +0003.4
      19730101
      20180909

Variables

Let's pull out the minimum and maximum daily temperature (in Fahrenheit) as well as the amount of rain (in inches) for La Guardia airport.



In [14]:

    
def wxquery(year):
  return """
SELECT EXTRACT (DAYOFYEAR FROM CAST(CONCAT('{0}','-',mo,'-',da) AS TIMESTAMP)) AS daynumber,
       MIN(EXTRACT (DAYOFWEEK FROM CAST(CONCAT('{0}','-',mo,'-',da) AS TIMESTAMP))) dayofweek,
       MIN(min) mintemp, MAX(max) maxtemp, MAX(IF(prcp=99.99,0,prcp)) rain
FROM `bigquery-public-data.noaa_gsod.gsod*`
WHERE stn='725030' AND _TABLE_SUFFIX = '{0}'
GROUP BY 1 ORDER BY daynumber DESC
""".format(year)



In [15]:

    
weather = query_to_dataframe(wxquery(2015))
weather[:5]









    



Requesting query... ok.
Job ID: 62a3c04d-3663-4c33-b1e6-b02ef171134d
Query running...
Query done.
Processed: 160.3 MB Billed: 161.0 MB
Standard price: $0.00 USD

Retrieving results...
Got 365 rows.

Total time taken 5.33 s.
Finished at 2018-09-13 17:14:54.






    Out[15]:







  
    
      
      daynumber
      dayofweek
      mintemp
      maxtemp
      rain
    
  
  
    
      0
      365
      5
      46.0
      48.2
      0.17
    
    
      1
      364
      4
      34.0
      48.0
      0.13
    
    
      2
      363
      3
      33.8
      46.9
      0.37
    
    
      3
      362
      2
      39.0
      62.1
      0.02
    
    
      4
      361
      1
      46.0
      62.6
      0.14

Merge datasets

Let's use Pandas to merge (combine) the taxi cab and weather datasets day-by-day.



In [16]:

    
data = pd.merge(weather, trips, on='daynumber')
data[:5]

Exploratory analysis

Is there a relationship between maximum temperature and the number of trips?



In [17]:

    
j = data.plot(kind='scatter', x='maxtemp', y='numtrips')









    



/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

The scatterplot above doesn't look very promising. There appears to be a weak downward trend, but it's also quite noisy.

Is there a relationship between the day of the week and the number of trips?



In [18]:

    
j = data.plot(kind='scatter', x='dayofweek', y='numtrips')









    



/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Hurrah, we seem to have found a predictor. It appears that people use taxis more later in the week. Perhaps New Yorkers make weekly resolutions to walk more and then lose their determination later in the week, or maybe it reflects tourism dynamics in New York City.

Perhaps if we took out the confounding effect of the day of the week, maximum temperature will start to have an effect. Let's see if that's the case:



In [19]:

    
j = data[data['dayofweek'] == 7].plot(kind='scatter', x='maxtemp', y='numtrips')









    



/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Removing the confounding factor does seem to reflect an underlying trend around temperature. But ... the data are a little sparse, don't you think? This is something that you have to keep in mind -- the more predictors you start to consider (here we are using two: day of week and maximum temperature), the more rows you will need so as to avoid overfitting the model.

Adding 2014 and 2016 data

Let's add in 2014 and 2016 data to the Pandas dataframe. Note how useful it was for us to modularize our queries around the YEAR.



In [20]:

    
data2 = data # 2015 data
for year in [2014, 2016]:
    weather = query_to_dataframe(wxquery(year))
    trips = query_to_dataframe(taxiquery(year))
    data_for_year = pd.merge(weather, trips, on='daynumber')
    data2 = pd.concat([data2, data_for_year])
data2.describe()









    



Requesting query... ok.
Job ID: 2b88936a-b1dd-4274-ae76-7a40e296589b
Query running...
Query done.
Processed: 157.0 MB Billed: 157.0 MB
Standard price: $0.00 USD

Retrieving results...
Got 365 rows.

Total time taken 2.31 s.
Finished at 2018-09-13 17:16:02.
Requesting query... ok.
Job ID: 137e41ae-ba45-43d8-abd9-03febff14b03
Query running...
Query done.
Processed: 1.2 GB Billed: 1.2 GB
Standard price: $0.01 USD

Retrieving results...
Got 365 rows.

Total time taken 3.35 s.
Finished at 2018-09-13 17:16:06.
Requesting query... ok.
Job ID: 86891ae0-ab3e-42b0-a825-e5c34b7f841a
Query running...
Query done.
Processed: 163.8 MB Billed: 164.0 MB
Standard price: $0.00 USD

Retrieving results...
Got 366 rows.

Total time taken 3.1 s.
Finished at 2018-09-13 17:16:10.
Requesting query... ok.
Job ID: 197ae5d6-4f79-4707-b5cd-f416b4443cdd
Query running...
Query done.
Processed: 1000.7 MB Billed: 1001.0 MB
Standard price: $0.00 USD

Retrieving results...
Got 366 rows.

Total time taken 3.59 s.
Finished at 2018-09-13 17:16:15.






    Out[20]:







  
    
      
      daynumber
      dayofweek
      mintemp
      maxtemp
      rain
      numtrips
    
  
  
    
      count
      1096.000000
      1096.000000
      1096.000000
      1096.000000
      1096.000000
      1096.000000
    
    
      mean
      183.166971
      4.005474
      48.195073
      66.151825
      0.117272
      403642.694343
    
    
      std
      105.510927
      2.000449
      18.031228
      18.484065
      0.320836
      63767.524397
    
    
      min
      1.000000
      1.000000
      1.000000
      21.000000
      0.000000
      78133.000000
    
    
      25%
      92.000000
      2.000000
      35.100000
      51.950000
      0.000000
      363809.000000
    
    
      50%
      183.000000
      4.000000
      48.900000
      68.000000
      0.000000
      402184.500000
    
    
      75%
      274.250000
      6.000000
      64.400000
      82.900000
      0.050000
      447099.000000
    
    
      max
      366.000000
      7.000000
      82.000000
      99.000000
      4.880000
      574530.000000



In [21]:

    
j = data2[data2['dayofweek'] == 7].plot(kind='scatter', x='maxtemp', y='numtrips')









    



/usr/local/envs/py3env/lib/python3.5/site-packages/matplotlib/font_manager.py:1320: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

The data do seem a bit more robust. If we had even more data, it would be better of course. But in this case, we only have 2014-2016 data for taxi trips, so that's what we will go with.

Machine Learning with Tensorflow

We'll use 80% of our dataset for training and 20% of the data for testing the model we have trained. Let's shuffle the rows of the Pandas dataframe so that this division is random. The predictor (or input) columns will be every column in the database other than the number-of-trips (which is our target, or what we want to predict).

The machine learning models that we will use -- linear regression and neural networks -- both require that the input variables are numeric in nature.

The day of the week, however, is a categorical variable (i.e. Tuesday is not really greater than Monday). So, we should create separate columns for whether it is a Monday (with values 0 or 1), Tuesday, etc.

Against that, we do have limited data (remember: the more columns you use as input features, the more rows you need to have in your training dataset), and it appears that there is a clear linear trend by day of the week. So, we will opt for simplicity here and use the data as-is. Try uncommenting the code that creates separate columns for the days of the week and re-run the notebook if you are curious about the impact of this simplification.



In [22]:

    
import tensorflow as tf
shuffled = data2.sample(frac=1, random_state=13)
# It would be a good idea, if we had more data, to treat the days as categorical variables
# with the small amount of data, we have though, the model tends to overfit
#predictors = shuffled.iloc[:,2:5]
#for day in range(1,8):
#  matching = shuffled['dayofweek'] == day
#  key = 'day_' + str(day)
#  predictors[key] = pd.Series(matching, index=predictors.index, dtype=float)
predictors = shuffled.iloc[:,1:5]
predictors[:5]









    



/usr/local/envs/py3env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters






    Out[22]:







  
    
      
      dayofweek
      mintemp
      maxtemp
      rain
    
  
  
    
      9
      2
      32.0
      43.0
      0.00
    
    
      279
      6
      37.9
      57.0
      0.52
    
    
      163
      5
      71.1
      91.0
      0.00
    
    
      225
      6
      55.9
      78.1
      0.00
    
    
      218
      6
      55.0
      91.9
      0.00



In [23]:

    
shuffled[:5]



In [24]:

    
targets = shuffled.iloc[:,5]
targets[:5]









    Out[24]:





9      382112
279    465493
163    363728
225    414711
218    364951
Name: numtrips, dtype: int64

Let's update our benchmark based on the 80-20 split and the larger dataset.



In [25]:

    
trainsize = int(len(shuffled['numtrips']) * 0.8)
avg = np.mean(shuffled['numtrips'][:trainsize])
rmse = np.sqrt(np.mean((targets[trainsize:] - avg)**2))
print('Just using average={0} has RMSE of {1}'.format(avg, rmse))









    



Just using average=402667.6826484018 has RMSE of 62394.11232075195

Linear regression with tf.contrib.learn

We scale the number of taxicab rides by 400,000 so that the model can keep its predicted values in the [0-1] range. The optimization goes a lot faster when the weights are small numbers. We save the weights into ./trained_model_linear and display the root mean square error on the test dataset.



In [26]:

    
SCALE_NUM_TRIPS = 600000.0
trainsize = int(len(shuffled['numtrips']) * 0.8)
testsize = len(shuffled['numtrips']) - trainsize
npredictors = len(predictors.columns)
noutputs = 1
tf.logging.set_verbosity(tf.logging.WARN) # change to INFO to get output every 100 steps ...
shutil.rmtree('./trained_model_linear', ignore_errors=True) # so that we don't load weights from previous runs
estimator = tf.contrib.learn.LinearRegressor(model_dir='./trained_model_linear',
                                             feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values))

print("starting to train ... this will take a while ... use verbosity=INFO to get more verbose output")
def input_fn(features, targets):
  return tf.constant(features.values), tf.constant(targets.values.reshape(len(targets), noutputs)/SCALE_NUM_TRIPS)
estimator.fit(input_fn=lambda: input_fn(predictors[:trainsize], targets[:trainsize]), steps=10000)

pred = np.multiply(list(estimator.predict(predictors[trainsize:].values)), SCALE_NUM_TRIPS )
rmse = np.sqrt(np.mean(np.power((targets[trainsize:].values - pred), 2)))
print('LinearRegression has RMSE of {0}'.format(rmse))









    



WARNING:tensorflow:From <ipython-input-26-3865241c1b45>:9: infer_real_valued_columns_from_input (from tensorflow.contrib.learn.python.learn.estimators.estimator) is deprecated and will be removed in a future version.
Instructions for updating:
Please specify feature columns explicitly.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py:142: setup_train_data_feeder (from tensorflow.contrib.learn.python.learn.learn_io.data_feeder) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py:96: extract_dask_data (from tensorflow.contrib.learn.python.learn.learn_io.dask_io) is deprecated and will be removed in a future version.
Instructions for updating:
Please feed input to tf.data to support dask.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py:100: extract_pandas_data (from tensorflow.contrib.learn.python.learn.learn_io.pandas_io) is deprecated and will be removed in a future version.
Instructions for updating:
Please access pandas data directly.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py:159: DataFeeder.__init__ (from tensorflow.contrib.learn.python.learn.learn_io.data_feeder) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py:340: check_array (from tensorflow.contrib.learn.python.learn.learn_io.data_feeder) is deprecated and will be removed in a future version.
Instructions for updating:
Please convert numpy dtypes explicitly.
WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py:182: infer_real_valued_columns_from_input_fn (from tensorflow.contrib.learn.python.learn.estimators.estimator) is deprecated and will be removed in a future version.
Instructions for updating:
Please specify feature columns explicitly.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/linear.py:738: regression_head (from tensorflow.contrib.learn.python.learn.estimators.head) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.contrib.estimator.*_head.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py:1179: BaseEstimator.__init__ (from tensorflow.contrib.learn.python.learn.estimators.estimator) is deprecated and will be removed in a future version.
Instructions for updating:
Please replace uses of any Estimator from tf.contrib.learn with an Estimator from tf.estimator.*
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py:427: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
starting to train ... this will take a while ... use verbosity=INFO to get more verbose output
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:678: ModelFnOps.__new__ (from tensorflow.contrib.learn.python.learn.estimators.model_fn) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.EstimatorSpec. You can use the `estimator_spec` method to create an equivalent one.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py:497: calling LinearRegressor.predict (from tensorflow.contrib.learn.python.learn.estimators.linear) with outputs=None is deprecated and will be removed after 2017-03-01.
Instructions for updating:
Please switch to predict_scores, or set `outputs` argument.
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/linear.py:843: calling BaseEstimator.predict (from tensorflow.contrib.learn.python.learn.estimators.estimator) with x is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
LinearRegression has RMSE of 56643.95363906046

The RMSE here (57K) is lower than the benchmark (62K) indicates that we are doing about 10% better with the machine learning model than we would be if we were to just use the historical average (our benchmark).

Neural network with tf.contrib.learn

Let's make a more complex model with a few hidden nodes.



In [27]:

    
SCALE_NUM_TRIPS = 600000.0
trainsize = int(len(shuffled['numtrips']) * 0.8)
testsize = len(shuffled['numtrips']) - trainsize
npredictors = len(predictors.columns)
noutputs = 1
tf.logging.set_verbosity(tf.logging.WARN) # change to INFO to get output every 100 steps ...
shutil.rmtree('./trained_model', ignore_errors=True) # so that we don't load weights from previous runs
estimator = tf.contrib.learn.DNNRegressor(model_dir='./trained_model',
                                          hidden_units=[5, 5],                             
                                          feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values))

print("starting to train ... this will take a while ... use verbosity=INFO to get more verbose output")
def input_fn(features, targets):
  return tf.constant(features.values), tf.constant(targets.values.reshape(len(targets), noutputs)/SCALE_NUM_TRIPS)
estimator.fit(input_fn=lambda: input_fn(predictors[:trainsize], targets[:trainsize]), steps=10000)

pred = np.multiply(list(estimator.predict(predictors[trainsize:].values)), SCALE_NUM_TRIPS )
rmse = np.sqrt(np.mean((targets[trainsize:].values - pred)**2))
print('Neural Network Regression has RMSE of {0}'.format(rmse))









    



WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
starting to train ... this will take a while ... use verbosity=INFO to get more verbose output
WARNING:tensorflow:From /usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py:497: calling DNNRegressor.predict (from tensorflow.contrib.learn.python.learn.estimators.dnn) with outputs=None is deprecated and will be removed after 2017-03-01.
Instructions for updating:
Please switch to predict_scores, or set `outputs` argument.
WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
Neural Network Regression has RMSE of 62370.46887832732

Using a neural network results in similar performance to the linear model when I ran it -- it might be because there isn't enough data for the NN to do much better. (NN training is a non-convex optimization, and you will get different results each time you run the above code).

Running a trained model

So, we have trained a model, and saved it to a file. Let's use this model to predict taxicab demand given the expected weather for three days.

Here we make a Dataframe out of those inputs, load up the saved model (note that we have to know the model equation -- it's not saved in the model file) and use it to predict the taxicab demand.



In [28]:

    
input = pd.DataFrame.from_dict(data = 
                               {'dayofweek' : [4, 5, 6],
                                'mintemp' : [60, 40, 50],
                                'maxtemp' : [70, 90, 60],
                                'rain' : [0, 0.5, 0]})
# read trained model from ./trained_model
estimator = tf.contrib.learn.LinearRegressor(model_dir='./trained_model_linear',
                                          feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values))

pred = np.multiply(list(estimator.predict(input.values)), SCALE_NUM_TRIPS )
print(pred)









    



WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
[354762.78 306764.7  387226.62]

Looks like we should tell some of our taxi drivers to take the day off on Thursday (day=5). No wonder -- the forecast calls for extreme weather fluctuations on Thursday.

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	rate_code	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	imp_surcharge	total_amount
0	1	2015-11-15 21:30:58	2015-11-15 21:55:47	1	15.10	-73.874443	40.773983	1	N	-74.017311	40.705109	1	42.0	0.5	0.5	9.76	5.54	0.3	58.60
1	1	2015-03-15 12:50:01	2015-03-15 13:23:35	1	10.80	-73.870926	40.773727	None	N	-73.988228	40.765694	1	34.5	0.0	0.5	8.10	5.33	0.3	48.73
2	1	2015-06-20 19:36:17	2015-06-20 20:10:49	1	15.50	-73.777054	40.644947	None	Y	-73.946800	40.725021	1	44.5	0.0	0.5	9.06	0.00	0.3	54.36
3	2	2015-07-18 11:25:58	2015-07-18 11:43:47	1	7.21	-73.862762	40.769028	1	N	-73.949203	40.722584	1	22.5	0.0	0.5	4.66	0.00	0.3	27.96
4	2	2015-05-18 08:00:04	2015-05-18 08:30:31	1	5.70	-73.958054	40.784805	None	N	-74.008820	40.720043	1	22.5	0.0	0.5	4.66	0.00	0.3	27.96
5	2	2015-04-30 12:25:44	2015-04-30 13:03:51	1	4.28	-73.978180	40.762341	None	N	-74.008911	40.710789	1	24.5	0.0	0.5	2.50	0.00	0.3	27.80
6	2	2015-01-08 19:28:42	2015-01-08 19:46:40	1	8.46	-73.862740	40.769081	None	N	-73.952225	40.786377	1	25.5	1.0	0.5	6.37	5.33	0.3	39.00
7	2	2015-04-05 13:50:31	2015-04-05 14:13:43	1	7.72	-73.946259	40.777489	None	N	-73.857201	40.836712	1	25.5	0.0	0.5	6.58	0.00	0.3	32.88
8	2	2015-05-28 08:47:56	2015-05-28 09:26:08	1	18.47	-73.776711	40.645302	None	N	-73.843422	40.852852	1	51.0	0.0	0.5	7.00	5.54	0.3	64.34
9	2	2015-06-29 06:58:17	2015-06-29 07:18:44	2	8.87	-73.951965	40.773315	None	N	-73.864807	40.770283	1	27.0	0.0	0.5	6.67	5.54	0.3	40.01

	daynumber	dayofweek	mintemp	maxtemp	rain
0	365	5	46.0	48.2	0.17
1	364	4	34.0	48.0	0.13
2	363	3	33.8	46.9	0.37
3	362	2	39.0	62.1	0.02
4	361	1	46.0	62.6	0.14

	daynumber	dayofweek	mintemp	maxtemp	rain	numtrips
count	1096.000000	1096.000000	1096.000000	1096.000000	1096.000000	1096.000000
mean	183.166971	4.005474	48.195073	66.151825	0.117272	403642.694343
std	105.510927	2.000449	18.031228	18.484065	0.320836	63767.524397
min	1.000000	1.000000	1.000000	21.000000	0.000000	78133.000000
25%	92.000000	2.000000	35.100000	51.950000	0.000000	363809.000000
50%	183.000000	4.000000	48.900000	68.000000	0.000000	402184.500000
75%	274.250000	6.000000	64.400000	82.900000	0.050000	447099.000000
max	366.000000	7.000000	82.000000	99.000000	4.880000	574530.000000

	dayofweek	mintemp	maxtemp	rain
9	2	32.0	43.0	0.00
279	6	37.9	57.0	0.52
163	5	71.1	91.0	0.00
225	6	55.9	78.1	0.00
218	6	55.0	91.9	0.00