In [1]:
import h2o
import time

In [2]:
# Explore a typical Data Science workflow with H2O and Python
#
# Goal: assist the manager of CitiBike of NYC to load-balance the bicycles
# across the CitiBike network of stations, by predicting the number of bike
# trips taken from the station every day.  Use 10 million rows of historical
# data, and eventually add weather data.


# Connect to a cluster
h2o.init()


H2O cluster uptime: 1 hours 54 minutes 16 seconds 936 milliseconds
H2O cluster version: 3.1.0.99999
H2O cluster name: spencer
H2O cluster total nodes: 1
H2O cluster total memory: 14.22 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

In [3]:
# Set this to True if you want to fetch the data directly from S3.
# This is useful if your cluster is running in EC2.
data_source_is_s3 = False

def mylocate(s):
    if data_source_is_s3:
        return "s3n://h2o-public-test-data/" + s
    else:
        return h2o.locate(s)

In [4]:
# Pick either the big or the small demo.
# Big data is 10M rows
small_test = [mylocate("bigdata/laptop/citibike-nyc/2013-10.csv")]
big_test =   [mylocate("bigdata/laptop/citibike-nyc/2013-07.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-08.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-09.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-10.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-11.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-12.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-01.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-02.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-03.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-04.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-05.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-06.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-07.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-08.csv")]

# ----------

# 1- Load data - 1 row per bicycle trip.  Has columns showing the start and end
# station, trip duration and trip start time and day.  The larger dataset
# totals about 10 million rows
print "Import and Parse bike data"
data = h2o.import_frame(path=small_test)


Import and Parse bike data

Parse Progress: [##################################################] 100%

Parsed 1,037,712 rows and 15 cols:

File1 /Users/spencer/0xdata/h2o-dev/bigdata/laptop/citibike-nyc/2013-10.csv

In [5]:
# ----------

# 2- light data munging: group the bike starts per-day, converting the 10M rows
# of trips to about 140,000 station&day combos - predicting the number of trip
# starts per-station-per-day.

# Convert start time to: Day since the Epoch
startime = data["starttime"]
secsPerDay=1000*60*60*24
data["Days"] = (startime/secsPerDay).floor()
data.describe()


Rows: 1,037,712 Cols: 16

Chunk compression summary:

chunk_type chunk_name count count_percentage size size_percentage
C0L Constant Integers 17 2.2135415 1.3 KB 0.0022871785
C1 1-Byte Integers 48 6.25 1016.6 KB 1.7506603
C1N 1-Byte Integers (w/o NAs) 48 6.25 1016.6 KB 1.7506603
C1S 1-Byte Fractions 79 10.286459 1.6 MB 2.8878725
C2 2-Byte Integers 243 31.640625 10.0 MB 17.696283
C2S 2-Byte Fractions 49 6.3802085 2.0 MB 3.5701983
C4 4-Byte Integers 32 4.166667 2.6 MB 4.6726856
C4S 4-Byte Fractions 39 5.078125 3.2 MB 5.6373096
C8 64-bit Integers 60 7.8125 9.9 MB 17.432673
C8D 64-bit Reals 153 19.921875 25.3 MB 44.59937
Frame distribution summary:

size number_of_rows number_of_chunks_per_column number_of_chunks
172.16.2.37:54321 56.7 MB 1037712.0 48.0 768.0
mean 56.7 MB 1037712.0 48.0 768.0
min 56.7 MB 1037712.0 48.0 768.0
max 56.7 MB 1037712.0 48.0 768.0
stddev 0 B 0.0 0.0 0.0
total 56.7 MB 1037712.0 48.0 768.0
Column-by-Column Summary:

tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender Days
type int time time int enum real real int enum real real int enum int int int
mins 60.0 1.380610868e+12 1.380611083e+12 72.0 0.0 40.680342423 -74.01713445 72.0 0.0 40.680342423 -74.01713445 14529.0 0.0 1899.0 0.0 15979.0
maxs 1259480.0 1.383289197e+12 1.38341851e+12 3002.0 329.0 40.770513 -73.9500479759 3002.0 329.0 40.770513 -73.9500479759 20757.0 1.0 1997.0 2.0 16010.0
sigma 2000.3732323 778871729.131 778847387.504 354.434325075 100.299091797 0.0195734073053 0.0123161234106 357.398217058 100.415965049 0.0195578458116 0.0123855811965 1717.68112134 0.291696182123 11.1314906238 0.544380593291 9.02215033588
zero_count 0 0 0 0 5239 0 0 0 5449 0 0 0 97446 0 97498 0
missing_count 0 0 0 0 0 0 0 0 0 0 0 0 0 97445 0 0

In [6]:
# Now do a monster Group-By.  Count bike starts per-station per-day.  Ends up
# with about 340 stations times 400 days (140,000 rows).  This is what we want
# to predict.
group_by_cols = ["Days","start station name"]
aggregates = {"bikes": ["count", 0, "all"]}
bpd = data.group_by(cols=group_by_cols, aggregates=aggregates) # Compute bikes-per-day
bpd.show()
bpd.describe()
bpd.dim()


First 10 rows and first 3 columns: 
Days start station name bikes
15981 Fulton St & William St 89
15980 9 Ave & W 18 St 137
16007 W 27 St & 7 Ave 203
15979 Bedford Ave & S 9th St 8
15995 Washington Ave & Park Ave 29
16005 W 13 St & 6 Ave 138
15979 11 Ave & W 27 St 139
15986 Central Park S & 6 Ave 123
16004 John St & William St 60
15989 Allen St & Hester St 110
Rows: 10,450 Cols: 3

Chunk compression summary:

chunk_type chunk_name count count_percentage size size_percentage
C1S 1-Byte Fractions 32 33.333336 12.8 KB 22.15888
C2 2-Byte Integers 64 66.66667 45.1 KB 77.84112
Frame distribution summary:

size number_of_rows number_of_chunks_per_column number_of_chunks
172.16.2.37:54321 57.9 KB 10450.0 32.0 96.0
mean 57.9 KB 10450.0 32.0 96.0
min 57.9 KB 10450.0 32.0 96.0
max 57.9 KB 10450.0 32.0 96.0
stddev 0 B 0.0 0.0 0.0
total 57.9 KB 10450.0 32.0 96.0
Column-by-Column Summary:

Days start station name bikes
type int enum int
mins 15979.0 0.0 1.0
maxs 16010.0 329.0 553.0
sigma 9.23370172444 95.4454360013 72.9721964301
zero_count 0 32 0
missing_count 0 0 0
Out[6]:
[10450, 3]

In [7]:
# Quantiles: the data is fairly unbalanced; some station/day combos are wildly
# more popular than others.
print "Quantiles of bikes-per-day"
bpd["bikes"].quantile().show()


Quantiles of bikes-per-day
First 9 rows and first 2 columns: 
Probs bikesQuantiles
0.01 4.490000000000002
0.1 19.0
0.25 43.0
0.333 57.0
0.5 87.0
0.667 118.0
0.75 137.0
0.9 192.0
0.99 334.51

In [8]:
# A little feature engineering
# Add in month-of-year (seasonality; fewer bike rides in winter than summer)
secs = bpd["Days"]*secsPerDay
bpd["Month"]     = secs.month().asfactor()
# Add in day-of-week (work-week; more bike rides on Sunday than Monday)
bpd["DayOfWeek"] = secs.dayOfWeek()
print "Bikes-Per-Day"
bpd.describe()


Bikes-Per-Day
Rows: 10,450 Cols: 5

Chunk compression summary:

chunk_type chunk_name count count_percentage size size_percentage
CBS Bits 32 20.0 3.5 KB 4.7066307
C1N 1-Byte Integers (w/o NAs) 32 20.0 12.3 KB 16.73027
C1S 1-Byte Fractions 32 20.0 12.8 KB 17.408703
C2 2-Byte Integers 64 40.0 45.1 KB 61.154396
Frame distribution summary:

size number_of_rows number_of_chunks_per_column number_of_chunks
172.16.2.37:54321 73.7 KB 10450.0 32.0 160.0
mean 73.7 KB 10450.0 32.0 160.0
min 73.7 KB 10450.0 32.0 160.0
max 73.7 KB 10450.0 32.0 160.0
stddev 0 B 0.0 0.0 0.0
total 73.7 KB 10450.0 32.0 160.0
Column-by-Column Summary:

Days start station name bikes Month DayOfWeek
type int enum int enum enum
mins 15979.0 0.0 1.0 0.0 0.0
maxs 16010.0 329.0 553.0 1.0 6.0
sigma 9.23370172444 95.4454360013 72.9721964301 0.174371128617 1.9765802868
zero_count 0 32 0 328 1635
missing_count 0 0 0 0 0

In [9]:
# ----------
# 3- Fit a model on train; using test as validation

# Function for doing class test/train/holdout split
def split_fit_predict(data):
  global gbm0,drf0,glm0,dl0
  # Classic Test/Train split
  r = data['Days'].runif()   # Random UNIForm numbers, one per row
  train = data[  r  < 0.6]
  test  = data[(0.6 <= r) & (r < 0.9)]
  hold  = data[ 0.9 <= r ]
  print "Training data has",train.ncol(),"columns and",train.nrow(),"rows, test has",test.nrow(),"rows, holdout has",hold.nrow()
  
  # Run GBM
  s = time.time()
  gbm0 = h2o.gbm(x           =train.drop("bikes"),
                 y           =train     ["bikes"],
                 validation_x=test .drop("bikes"),
                 validation_y=test      ["bikes"],
                 ntrees=500, # 500 works well
                 max_depth=6,
                 learn_rate=0.1)
  gbm_elapsed = time.time() - s

  # Run DRF
  s = time.time()
  drf0 = h2o.random_forest(x =train.drop("bikes"),
                y           =train     ["bikes"],
                validation_x=test .drop("bikes"),
                validation_y=test      ["bikes"],
                ntrees=250,
                max_depth=30)
  drf_elapsed = time.time() - s 
    
    
  # Run GLM
  s = time.time()
  glm0 = h2o.glm(x           =train.drop("bikes"),
                 y           =train     ["bikes"],
                 validation_x=test .drop("bikes"),
                 validation_y=test      ["bikes"],
                 Lambda=[1e-5],
                 family="poisson")
  glm_elapsed = time.time() - s
  
  # Run DL
  s = time.time()
  dl0 = h2o.deeplearning(x  =train.drop("bikes"),
                y           =train     ["bikes"],
                validation_x=test .drop("bikes"),
                validation_y=test      ["bikes"],
                hidden=[50,50,50,50],
                epochs=50)
  dl_elapsed = time.time() - s
  
  # ----------
  # 4- Score on holdout set & report
  train_r2_gbm = gbm0.model_performance(train).r2()
  test_r2_gbm  = gbm0.model_performance(test ).r2()
  hold_r2_gbm  = gbm0.model_performance(hold ).r2()
#   print "GBM R2 TRAIN=",train_r2_gbm,", R2 TEST=",test_r2_gbm,", R2 HOLDOUT=",hold_r2_gbm
  
  train_r2_drf = drf0.model_performance(train).r2()
  test_r2_drf  = drf0.model_performance(test ).r2()
  hold_r2_drf  = drf0.model_performance(hold ).r2()
#   print "DRF R2 TRAIN=",train_r2_drf,", R2 TEST=",test_r2_drf,", R2 HOLDOUT=",hold_r2_drf
  
  train_r2_glm = glm0.model_performance(train).r2()
  test_r2_glm  = glm0.model_performance(test ).r2()
  hold_r2_glm  = glm0.model_performance(hold ).r2()
#   print "GLM R2 TRAIN=",train_r2_glm,", R2 TEST=",test_r2_glm,", R2 HOLDOUT=",hold_r2_glm
    
  train_r2_dl = dl0.model_performance(train).r2()
  test_r2_dl  = dl0.model_performance(test ).r2()
  hold_r2_dl  = dl0.model_performance(hold ).r2()
#   print " DL R2 TRAIN=",train_r2_dl,", R2 TEST=",test_r2_dl,", R2 HOLDOUT=",hold_r2_dl
    
  # make a pretty HTML table printout of the results

  header = ["Model", "R2 TRAIN", "R2 TEST", "R2 HOLDOUT", "Model Training Time (s)"]
  table  = [
            ["GBM", train_r2_gbm, test_r2_gbm, hold_r2_gbm, round(gbm_elapsed,3)],
            ["DRF", train_r2_drf, test_r2_drf, hold_r2_drf, round(drf_elapsed,3)],
            ["GLM", train_r2_glm, test_r2_glm, hold_r2_glm, round(glm_elapsed,3)],
            ["DL ", train_r2_dl,  test_r2_dl,  hold_r2_dl , round(dl_elapsed,3) ],
           ]
  h2o.H2ODisplay(table,header)
  # --------------

In [10]:
# Split the data (into test & train), fit some models and predict on the holdout data
split_fit_predict(bpd)
# Here we see an r^2 of 0.91 for GBM, and 0.71 for GLM.  This means given just
# the station, the month, and the day-of-week we can predict 90% of the
# variance of the bike-trip-starts.


Training data has 5 columns and 6292 rows, test has 3135 rows, holdout has 1023

gbm Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

deeplearning Model Build Progress: [##################################################] 100%
Model R2 TRAIN R2 TEST R2 HOLDOUT Model Training Time (s)
GBM 0.997349130174 0.924028146861 0.925064871886 6.851
DRF 0.840283339977 0.779143049269 0.783155364777 5.716
GLM 0.861354633367 0.840104405497 0.842517809915 0.228
DL 0.961708220985 0.915753071791 0.921853286131 6.858

In [11]:
# ----------
# 5- Now lets add some weather
# Load weather data
wthr1 = h2o.import_frame(path=[mylocate("bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2013.csv"),
                               mylocate("bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2014.csv")])
# Peek at the data
wthr1.describe()


Parse Progress: [##################################################] 100%

Parsed 17,520 rows and 50 cols:

File1 /Users/spencer/0xdata/h2o-dev/bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2013.csv
File2 /Users/spencer/0xdata/h2o-dev/bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2014.csv
Rows: 17,520 Cols: 50

Chunk compression summary:

chunk_type chunk_name count count_percentage size size_percentage
C0L Constant Integers 107 6.294118 8.4 KB 0.7889721
C0D Constant Reals 436 25.647058 34.1 KB 3.2148771
CXI Sparse Integers 17 1.0 1.5 KB 0.13991351
C1 1-Byte Integers 346 20.352942 197.4 KB 18.634672
C1N 1-Byte Integers (w/o NAs) 214 12.588236 122.3 KB 11.544063
C1S 1-Byte Fractions 214 12.588236 125.3 KB 11.822968
C2S 2-Byte Fractions 196 11.529412 214.5 KB 20.242111
C4S 4-Byte Fractions 170 10.0 356.1 KB 33.612423
Frame distribution summary:

size number_of_rows number_of_chunks_per_column number_of_chunks
172.16.2.37:54321 1.0 MB 17520.0 34.0 1700.0
mean 1.0 MB 17520.0 34.0 1700.0
min 1.0 MB 17520.0 34.0 1700.0
max 1.0 MB 17520.0 34.0 1700.0
stddev 0 B 0.0 0.0 0.0
total 1.0 MB 17520.0 34.0 1700.0
Column-by-Column Summary:

Year Local Month Local Day Local Hour Local Year UTC Month UTC Day UTC Hour UTC Cavok Reported Cloud Ceiling (m) Cloud Cover Fraction Cloud Cover Fraction 1 Cloud Cover Fraction 2 Cloud Cover Fraction 3 Cloud Cover Fraction 4 Cloud Cover Fraction 5 Cloud Cover Fraction 6 Cloud Height (m) 1 Cloud Height (m) 2 Cloud Height (m) 3 Cloud Height (m) 4 Cloud Height (m) 5 Cloud Height (m) 6 Dew Point (C) Humidity Fraction Precipitation One Hour (mm) Pressure Altimeter (mbar) Pressure Sea Level (mbar) Pressure Station (mbar) Snow Depth (cm) Temperature (C) Visibility (km) Weather Code 1 Weather Code 1/ Description Weather Code 2 Weather Code 2/ Description Weather Code 3 Weather Code 3/ Description Weather Code 4 Weather Code 4/ Description Weather Code 5 Weather Code 5/ Description Weather Code 6 Weather Code 6/ Description Weather Code Most Severe / Icon Code Weather Code Most Severe Weather Code Most Severe / Description Wind Direction (degrees) Wind Gust (m/s) Wind Speed (m/s)
type int int int int int int int int int real real real real real int int int real real real int int int real real real real int int int real real int enum int enum int enum int enum int enum int enum int int enum int real real
mins 2013.0 1.0 1.0 0.0 2013.0 1.0 1.0 0.0 0.0 61.0 0.0 0.0 0.25 0.5 NaN NaN NaN 60.96 213.36 365.76 NaN NaN NaN -26.7 0.1251 0.0 983.2949 NaN NaN NaN -15.6 0.001 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 3.0 0.0 0.0 1.0 0.0 10.0 7.2 0.0
maxs 2014.0 12.0 31.0 23.0 2015.0 12.0 31.0 23.0 0.0 3657.6 1.0 1.0 1.0 1.0 NaN NaN NaN 3657.5999 3657.5999 3657.5999 NaN NaN NaN 24.4 1.0 26.924 1042.2113 NaN NaN NaN 36.1 16.0934 60.0 11.0 60.0 10.0 36.0 7.0 27.0 4.0 27.0 2.0 3.0 0.0 16.0 60.0 11.0 360.0 20.58 10.8
sigma 0.500014270017 3.44794972385 8.79649804852 6.92238411188 0.500584411716 3.44782405458 8.79561488868 6.92230165203 0.0 995.339856966 0.462720830993 0.42770569708 0.197155690367 0.0861015598104 NaN NaN NaN 962.743095854 916.73861349 887.215847511 NaN NaN NaN 10.9731282097 0.185792011866 2.56215129179 7.46451697179 NaN NaN NaN 10.0396739531 3.69893623033 5.70486576983 2.47814708663 6.13386253912 1.83976235335 5.80553286364 1.28967553698 3.12340844261 1.282164369 6.15223536611 0.60207972894 0.0 0.0 4.07386062702 5.70486576983 2.47814708663 106.350000031 1.81511871115 1.61469790524
zero_count 0 0 0 730 0 0 0 730 17455 0 8758 8758 0 0 0 0 0 0 0 0 0 0 0 268 0 501 0 0 0 0 269 0 0 17 0 30 0 13 0 20 0 12 0 2 14980 0 17 0 0 2768
missing_count 0 0 0 0 0 0 0 0 65 10780 375 375 14682 16535 17520 17520 17520 9103 14683 16535 17520 17520 17520 67 67 15660 360 17520 17520 17520 67 412 14980 14980 16477 16477 17181 17181 17433 17433 17504 17504 17518 17518 0 14980 14980 9382 14381 1283

In [12]:
# Lots of columns in there!  Lets plan on converting to time-since-epoch to do
# a 'join' with the bike data, plus gather weather info that might affect
# cyclists - rain, snow, temperature.  Alas, drop the "snow" column since it's
# all NA's.  Also add in dew point and humidity just in case.  Slice out just
# the columns of interest and drop the rest.
wthr2 = wthr1[["Year Local","Month Local","Day Local","Hour Local","Dew Point (C)","Humidity Fraction","Precipitation One Hour (mm)","Temperature (C)","Weather Code 1/ Description"]]

wthr2.setName(wthr2.index("Precipitation One Hour (mm)"), "Rain (mm)")
wthr2.setName(wthr2.index("Weather Code 1/ Description"), "WC1")
wthr2.describe()
# Much better!


Rows: 17,520 Cols: 9

Chunk compression summary:

chunk_type chunk_name count count_percentage size size_percentage
C0L Constant Integers 46 15.0326805 3.6 KB 1.780005
C1 1-Byte Integers 34 11.111112 19.4 KB 9.592678
C1N 1-Byte Integers (w/o NAs) 90 29.411766 51.5 KB 25.494701
C1S 1-Byte Fractions 42 13.725491 24.0 KB 11.894592
C2S 2-Byte Fractions 94 30.718956 103.4 KB 51.238026
Frame distribution summary:

size number_of_rows number_of_chunks_per_column number_of_chunks
172.16.2.37:54321 201.9 KB 17520.0 34.0 306.0
mean 201.9 KB 17520.0 34.0 306.0
min 201.9 KB 17520.0 34.0 306.0
max 201.9 KB 17520.0 34.0 306.0
stddev 0 B 0.0 0.0 0.0
total 201.9 KB 17520.0 34.0 306.0
Column-by-Column Summary:

Year Local Month Local Day Local Hour Local Dew Point (C) Humidity Fraction Rain (mm) Temperature (C) WC1
type int int int int real real real real enum
mins 2013.0 1.0 1.0 0.0 -26.7 0.1251 0.0 -15.6 0.0
maxs 2014.0 12.0 31.0 23.0 24.4 1.0 26.924 36.1 11.0
sigma 0.500014270017 3.44794972385 8.79649804852 6.92238411188 10.9731282097 0.185792011866 2.56215129179 10.0396739531 2.47814708663
zero_count 0 0 0 730 268 0 501 269 17
missing_count 0 0 0 0 67 67 15660 67 14980

In [13]:
# Filter down to the weather at Noon
wthr3 = wthr2[ wthr2["Hour Local"]==12 ]

In [14]:
# Lets now get Days since the epoch... we'll convert year/month/day into Epoch
# time, and then back to Epoch days.  Need zero-based month and days, but have
# 1-based.
wthr3["msec"] = h2o.H2OFrame.mktime(year=wthr3["Year Local"], month=wthr3["Month Local"]-1, day=wthr3["Day Local"]-1, hour=wthr3["Hour Local"])
secsPerDay=1000*60*60*24
wthr3["Days"] = (wthr3["msec"]/secsPerDay).floor()
wthr3.describe()
# msec looks sane (numbers like 1.3e12 are in the correct range for msec since
# 1970).  Epoch Days matches closely with the epoch day numbers from the
# CitiBike dataset.


Rows: 730 Cols: 11

Chunk compression summary:

chunk_type chunk_name count count_percentage size size_percentage
C0L Constant Integers 80 21.390373 6.3 KB 11.955688
C0D Constant Reals 13 3.4759357 1.0 KB 1.9427994
C1 1-Byte Integers 30 8.021391 2.6 KB 5.0176535
C1N 1-Byte Integers (w/o NAs) 56 14.973262 4.9 KB 9.375875
C1S 1-Byte Fractions 34 9.090909 3.5 KB 6.698922
C2S 2-Byte Fractions 34 9.090909 4.2 KB 8.062618
C8D 64-bit Reals 127 33.95722 29.8 KB 56.946445
Frame distribution summary:

size number_of_rows number_of_chunks_per_column number_of_chunks
172.16.2.37:54321 52.3 KB 730.0 34.0 374.0
mean 52.3 KB 730.0 34.0 374.0
min 52.3 KB 730.0 34.0 374.0
max 52.3 KB 730.0 34.0 374.0
stddev 0 B 0.0 0.0 0.0
total 52.3 KB 730.0 34.0 374.0
Column-by-Column Summary:

Year Local Month Local Day Local Hour Local Dew Point (C) Humidity Fraction Rain (mm) Temperature (C) WC1 msec Days
type int int int int real real real real enum int int
mins 2013.0 1.0 1.0 12.0 -26.7 0.1723 0.0 -13.9 0.0 1.3570704e+12 15706.0
maxs 2014.0 12.0 31.0 12.0 23.3 1.0 12.446 34.4 10.0 1.420056e+12 16435.0
sigma 0.500342818004 3.45021529307 8.80227802701 0.0 11.1062964725 0.179945027923 2.36064248615 10.3989855149 2.74674726123 18219740080.4 210.877136425
zero_count 0 0 0 0 14 0 15 7 1 0 0
missing_count 0 0 0 0 3 3 660 3 620 0 0

In [15]:
# Lets drop off the extra time columns to make a easy-to-handle dataset.
wthr4 = wthr3.drop("Year Local").drop("Month Local").drop("Day Local").drop("Hour Local").drop("msec")

In [16]:
# Also, most rain numbers are missing - lets assume those are zero rain days
rain = wthr4["Rain (mm)"]
rain[ rain.isna() ] = 0

In [17]:
# ----------
# 6 - Join the weather data-per-day to the bike-starts-per-day
print "Merge Daily Weather with Bikes-Per-Day"
bpd_with_weather = bpd.merge(wthr4,allLeft=True,allRite=False)
bpd_with_weather.describe()
bpd_with_weather.show()


Merge Daily Weather with Bikes-Per-Day
Rows: 10,450 Cols: 10

Chunk compression summary:

chunk_type chunk_name count count_percentage size size_percentage
CBS Bits 32 10.0 3.5 KB 0.82375145
C1 1-Byte Integers 32 10.0 12.3 KB 2.928121
C1N 1-Byte Integers (w/o NAs) 32 10.0 12.3 KB 2.928121
C1S 1-Byte Fractions 32 10.0 12.8 KB 3.0468602
C2 2-Byte Integers 64 20.0 45.1 KB 10.703204
C8D 64-bit Reals 128 40.0 335.1 KB 79.56994
Frame distribution summary:

size number_of_rows number_of_chunks_per_column number_of_chunks
172.16.2.37:54321 421.1 KB 10450.0 32.0 320.0
mean 421.1 KB 10450.0 32.0 320.0
min 421.1 KB 10450.0 32.0 320.0
max 421.1 KB 10450.0 32.0 320.0
stddev 0 B 0.0 0.0 0.0
total 421.1 KB 10450.0 32.0 320.0
Column-by-Column Summary:

Days start station name bikes Month DayOfWeek Humidity Fraction Rain (mm) Temperature (C) WC1 Dew Point (C)
type int enum int enum enum real real real enum real
mins 15979.0 0.0 1.0 0.0 0.0 0.3485 0.0 9.4 2.0 -2.2
maxs 16010.0 329.0 553.0 1.0 6.0 0.8718 0.254 26.1 8.0 19.4
sigma 9.23370172444 95.4454360013 72.9721964301 0.174371128617 1.9765802868 0.149631413472 0.127095436911 4.29746634617 2.60111075432 6.49151146664
zero_count 0 32 0 328 1635 0 330 0 0 0
missing_count 0 0 0 0 0 0 9793 0 9134 0
First 10 rows and first 10 columns: 
Days start station name bikes Month DayOfWeek Humidity Fraction Rain (mm) Temperature (C) WC1 Dew Point (C)
15981 Fulton St & William St 89 10 Wed 0.4973 23.3 12.200000000000001
15980 9 Ave & W 18 St 137 10 Tue 0.5019 25.0 13.9
16007 W 27 St & 7 Ave 203 10 Mon 0.4071 10.600000000000001 -2.2
15979 Bedford Ave & S 9th St 8 9 Mon 0.4315 23.900000000000002 10.600000000000001
15995 Washington Ave & Park Ave 29 10 Wed 0.6765 20.6 14.4
16005 W 13 St & 6 Ave 138 10 Sat 0.38180000000000003 12.8 -1.1
15979 11 Ave & W 27 St 139 9 Mon 0.4315 23.900000000000002 10.600000000000001
15986 Central Park S & 6 Ave 123 10 Mon 0.48200000000000004 17.8 6.7
16004 John St & William St 60 10 Fri 0.4238 10.0 -2.2
15989 Allen St & Hester St 110 10 Thu 0.8631000000000001 0.254 16.7 light rain 14.4

In [18]:
# 7 - Test/Train split again, model build again, this time with weather
split_fit_predict(bpd_with_weather)


Training data has 10 columns and 6184 rows, test has 3221 rows, holdout has 1045

gbm Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

deeplearning Model Build Progress: [##################################################] 100%
Model R2 TRAIN R2 TEST R2 HOLDOUT Model Training Time (s)
GBM 0.998106707443 0.927889547812 0.913989238329 7.967
DRF 0.883439340907 0.780096711334 0.784022405792 15.76
GLM 0.999999999425 -0.108259021752 -0.0546717836679 0.254
DL 0.96876446219 0.924862385081 0.909027422251 7.984