notebook.community

Edit and run



In [1]:

    
import h2o
import time



In [2]:

    
# Explore a typical Data Science workflow with H2O and Python
#
# Goal: assist the manager of CitiBike of NYC to load-balance the bicycles
# across the CitiBike network of stations, by predicting the number of bike
# trips taken from the station every day.  Use 10 million rows of historical
# data, and eventually add weather data.


# Connect to a cluster
h2o.init()









    




H2O cluster uptime: 
1 hours 54 minutes 16 seconds 936 milliseconds 
H2O cluster version: 
3.1.0.99999
H2O cluster name: 
spencer
H2O cluster total nodes: 
1
H2O cluster total memory: 
14.22 GB
H2O cluster total cores: 
8
H2O cluster allowed cores: 
8
H2O cluster healthy: 
True
H2O Connection ip: 
127.0.0.1
H2O Connection port: 
54321



In [3]:

    
# Set this to True if you want to fetch the data directly from S3.
# This is useful if your cluster is running in EC2.
data_source_is_s3 = False

def mylocate(s):
    if data_source_is_s3:
        return "s3n://h2o-public-test-data/" + s
    else:
        return h2o.locate(s)



In [4]:

    
# Pick either the big or the small demo.
# Big data is 10M rows
small_test = [mylocate("bigdata/laptop/citibike-nyc/2013-10.csv")]
big_test =   [mylocate("bigdata/laptop/citibike-nyc/2013-07.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-08.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-09.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-10.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-11.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2013-12.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-01.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-02.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-03.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-04.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-05.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-06.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-07.csv"),
              mylocate("bigdata/laptop/citibike-nyc/2014-08.csv")]

# ----------

# 1- Load data - 1 row per bicycle trip.  Has columns showing the start and end
# station, trip duration and trip start time and day.  The larger dataset
# totals about 10 million rows
print "Import and Parse bike data"
data = h2o.import_frame(path=small_test)









    



Import and Parse bike data

Parse Progress: [##################################################] 100%

Parsed 1,037,712 rows and 15 cols:







    




File1
/Users/spencer/0xdata/h2o-dev/bigdata/laptop/citibike-nyc/2013-10.csv



In [5]:

    
# ----------

# 2- light data munging: group the bike starts per-day, converting the 10M rows
# of trips to about 140,000 station&day combos - predicting the number of trip
# starts per-station-per-day.

# Convert start time to: Day since the Epoch
startime = data["starttime"]
secsPerDay=1000*60*60*24
data["Days"] = (startime/secsPerDay).floor()
data.describe()









    



Rows: 1,037,712 Cols: 16

Chunk compression summary:







    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
C0L
Constant Integers
17
2.2135415
    1.3 KB
0.0022871785
C1
1-Byte Integers
48
6.25
 1016.6 KB
1.7506603
C1N
1-Byte Integers (w/o NAs)
48
6.25
 1016.6 KB
1.7506603
C1S
1-Byte Fractions
79
10.286459
    1.6 MB
2.8878725
C2
2-Byte Integers
243
31.640625
   10.0 MB
17.696283
C2S
2-Byte Fractions
49
6.3802085
    2.0 MB
3.5701983
C4
4-Byte Integers
32
4.166667
    2.6 MB
4.6726856
C4S
4-Byte Fractions
39
5.078125
    3.2 MB
5.6373096
C8
64-bit Integers
60
7.8125
    9.9 MB
17.432673
C8D
64-bit Reals
153
19.921875
   25.3 MB
44.59937






    



Frame distribution summary:







    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.16.2.37:54321
   56.7 MB
1037712.0
48.0
768.0
mean
   56.7 MB
1037712.0
48.0
768.0
min
   56.7 MB
1037712.0
48.0
768.0
max
   56.7 MB
1037712.0
48.0
768.0
stddev
      0  B
0.0
0.0
0.0
total
   56.7 MB
1037712.0
48.0
768.0






    



Column-by-Column Summary:







    





tripduration
starttime
stoptime
start station id
start station name
start station latitude
start station longitude
end station id
end station name
end station latitude
end station longitude
bikeid
usertype
birth year
gender
Days
type
int
time
time
int
enum
real
real
int
enum
real
real
int
enum
int
int
int
mins
60.0
1.380610868e+12
1.380611083e+12
72.0
0.0
40.680342423
-74.01713445
72.0
0.0
40.680342423
-74.01713445
14529.0
0.0
1899.0
0.0
15979.0
maxs
1259480.0
1.383289197e+12
1.38341851e+12
3002.0
329.0
40.770513
-73.9500479759
3002.0
329.0
40.770513
-73.9500479759
20757.0
1.0
1997.0
2.0
16010.0
sigma
2000.3732323
778871729.131
778847387.504
354.434325075
100.299091797
0.0195734073053
0.0123161234106
357.398217058
100.415965049
0.0195578458116
0.0123855811965
1717.68112134
0.291696182123
11.1314906238
0.544380593291
9.02215033588
zero_count
0
0
0
0
5239
0
0
0
5449
0
0
0
97446
0
97498
0
missing_count
0
0
0
0
0
0
0
0
0
0
0
0
0
97445
0
0



In [6]:

    
# Now do a monster Group-By.  Count bike starts per-station per-day.  Ends up
# with about 340 stations times 400 days (140,000 rows).  This is what we want
# to predict.
group_by_cols = ["Days","start station name"]
aggregates = {"bikes": ["count", 0, "all"]}
bpd = data.group_by(cols=group_by_cols, aggregates=aggregates) # Compute bikes-per-day
bpd.show()
bpd.describe()
bpd.dim()









    



First 10 rows and first 3 columns: 






    




Days
start station name
bikes
15981
Fulton St & William St
89
15980
9 Ave & W 18 St
137
16007
W 27 St & 7 Ave
203
15979
Bedford Ave & S 9th St
8
15995
Washington Ave & Park Ave
29
16005
W 13 St & 6 Ave
138
15979
11 Ave & W 27 St
139
15986
Central Park S & 6 Ave
123
16004
John St & William St
60
15989
Allen St & Hester St
110






    



Rows: 10,450 Cols: 3

Chunk compression summary:







    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
C1S
1-Byte Fractions
32
33.333336
   12.8 KB
22.15888
C2
2-Byte Integers
64
66.66667
   45.1 KB
77.84112






    



Frame distribution summary:







    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.16.2.37:54321
   57.9 KB
10450.0
32.0
96.0
mean
   57.9 KB
10450.0
32.0
96.0
min
   57.9 KB
10450.0
32.0
96.0
max
   57.9 KB
10450.0
32.0
96.0
stddev
      0  B
0.0
0.0
0.0
total
   57.9 KB
10450.0
32.0
96.0






    



Column-by-Column Summary:







    





Days
start station name
bikes
type
int
enum
int
mins
15979.0
0.0
1.0
maxs
16010.0
329.0
553.0
sigma
9.23370172444
95.4454360013
72.9721964301
zero_count
0
32
0
missing_count
0
0
0






    Out[6]:





[10450, 3]



In [7]:

    
# Quantiles: the data is fairly unbalanced; some station/day combos are wildly
# more popular than others.
print "Quantiles of bikes-per-day"
bpd["bikes"].quantile().show()









    



Quantiles of bikes-per-day
First 9 rows and first 2 columns: 






    




Probs
bikesQuantiles
0.01
4.490000000000002
0.1
19.0
0.25
43.0
0.333
57.0
0.5
87.0
0.667
118.0
0.75
137.0
0.9
192.0
0.99
334.51



In [8]:

    
# A little feature engineering
# Add in month-of-year (seasonality; fewer bike rides in winter than summer)
secs = bpd["Days"]*secsPerDay
bpd["Month"]     = secs.month().asfactor()
# Add in day-of-week (work-week; more bike rides on Sunday than Monday)
bpd["DayOfWeek"] = secs.dayOfWeek()
print "Bikes-Per-Day"
bpd.describe()









    



Bikes-Per-Day
Rows: 10,450 Cols: 5

Chunk compression summary:







    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
CBS
Bits
32
20.0
    3.5 KB
4.7066307
C1N
1-Byte Integers (w/o NAs)
32
20.0
   12.3 KB
16.73027
C1S
1-Byte Fractions
32
20.0
   12.8 KB
17.408703
C2
2-Byte Integers
64
40.0
   45.1 KB
61.154396






    



Frame distribution summary:







    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.16.2.37:54321
   73.7 KB
10450.0
32.0
160.0
mean
   73.7 KB
10450.0
32.0
160.0
min
   73.7 KB
10450.0
32.0
160.0
max
   73.7 KB
10450.0
32.0
160.0
stddev
      0  B
0.0
0.0
0.0
total
   73.7 KB
10450.0
32.0
160.0






    



Column-by-Column Summary:







    





Days
start station name
bikes
Month
DayOfWeek
type
int
enum
int
enum
enum
mins
15979.0
0.0
1.0
0.0
0.0
maxs
16010.0
329.0
553.0
1.0
6.0
sigma
9.23370172444
95.4454360013
72.9721964301
0.174371128617
1.9765802868
zero_count
0
32
0
328
1635
missing_count
0
0
0
0
0



In [9]:

    
# ----------
# 3- Fit a model on train; using test as validation

# Function for doing class test/train/holdout split
def split_fit_predict(data):
  global gbm0,drf0,glm0,dl0
  # Classic Test/Train split
  r = data['Days'].runif()   # Random UNIForm numbers, one per row
  train = data[  r  < 0.6]
  test  = data[(0.6 <= r) & (r < 0.9)]
  hold  = data[ 0.9 <= r ]
  print "Training data has",train.ncol(),"columns and",train.nrow(),"rows, test has",test.nrow(),"rows, holdout has",hold.nrow()
  
  # Run GBM
  s = time.time()
  gbm0 = h2o.gbm(x           =train.drop("bikes"),
                 y           =train     ["bikes"],
                 validation_x=test .drop("bikes"),
                 validation_y=test      ["bikes"],
                 ntrees=500, # 500 works well
                 max_depth=6,
                 learn_rate=0.1)
  gbm_elapsed = time.time() - s

  # Run DRF
  s = time.time()
  drf0 = h2o.random_forest(x =train.drop("bikes"),
                y           =train     ["bikes"],
                validation_x=test .drop("bikes"),
                validation_y=test      ["bikes"],
                ntrees=250,
                max_depth=30)
  drf_elapsed = time.time() - s 
    
    
  # Run GLM
  s = time.time()
  glm0 = h2o.glm(x           =train.drop("bikes"),
                 y           =train     ["bikes"],
                 validation_x=test .drop("bikes"),
                 validation_y=test      ["bikes"],
                 Lambda=[1e-5],
                 family="poisson")
  glm_elapsed = time.time() - s
  
  # Run DL
  s = time.time()
  dl0 = h2o.deeplearning(x  =train.drop("bikes"),
                y           =train     ["bikes"],
                validation_x=test .drop("bikes"),
                validation_y=test      ["bikes"],
                hidden=[50,50,50,50],
                epochs=50)
  dl_elapsed = time.time() - s
  
  # ----------
  # 4- Score on holdout set & report
  train_r2_gbm = gbm0.model_performance(train).r2()
  test_r2_gbm  = gbm0.model_performance(test ).r2()
  hold_r2_gbm  = gbm0.model_performance(hold ).r2()
#   print "GBM R2 TRAIN=",train_r2_gbm,", R2 TEST=",test_r2_gbm,", R2 HOLDOUT=",hold_r2_gbm
  
  train_r2_drf = drf0.model_performance(train).r2()
  test_r2_drf  = drf0.model_performance(test ).r2()
  hold_r2_drf  = drf0.model_performance(hold ).r2()
#   print "DRF R2 TRAIN=",train_r2_drf,", R2 TEST=",test_r2_drf,", R2 HOLDOUT=",hold_r2_drf
  
  train_r2_glm = glm0.model_performance(train).r2()
  test_r2_glm  = glm0.model_performance(test ).r2()
  hold_r2_glm  = glm0.model_performance(hold ).r2()
#   print "GLM R2 TRAIN=",train_r2_glm,", R2 TEST=",test_r2_glm,", R2 HOLDOUT=",hold_r2_glm
    
  train_r2_dl = dl0.model_performance(train).r2()
  test_r2_dl  = dl0.model_performance(test ).r2()
  hold_r2_dl  = dl0.model_performance(hold ).r2()
#   print " DL R2 TRAIN=",train_r2_dl,", R2 TEST=",test_r2_dl,", R2 HOLDOUT=",hold_r2_dl
    
  # make a pretty HTML table printout of the results

  header = ["Model", "R2 TRAIN", "R2 TEST", "R2 HOLDOUT", "Model Training Time (s)"]
  table  = [
            ["GBM", train_r2_gbm, test_r2_gbm, hold_r2_gbm, round(gbm_elapsed,3)],
            ["DRF", train_r2_drf, test_r2_drf, hold_r2_drf, round(drf_elapsed,3)],
            ["GLM", train_r2_glm, test_r2_glm, hold_r2_glm, round(glm_elapsed,3)],
            ["DL ", train_r2_dl,  test_r2_dl,  hold_r2_dl , round(dl_elapsed,3) ],
           ]
  h2o.H2ODisplay(table,header)
  # --------------



In [10]:

    
# Split the data (into test & train), fit some models and predict on the holdout data
split_fit_predict(bpd)
# Here we see an r^2 of 0.91 for GBM, and 0.71 for GLM.  This means given just
# the station, the month, and the day-of-week we can predict 90% of the
# variance of the bike-trip-starts.









    



Training data has 5 columns and 6292 rows, test has 3135 rows, holdout has 1023

gbm Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

deeplearning Model Build Progress: [##################################################] 100%






    




Model
R2 TRAIN
R2 TEST
R2 HOLDOUT
Model Training Time (s)
GBM
0.997349130174
0.924028146861
0.925064871886
6.851
DRF
0.840283339977
0.779143049269
0.783155364777
5.716
GLM
0.861354633367
0.840104405497
0.842517809915
0.228
DL 
0.961708220985
0.915753071791
0.921853286131
6.858



In [11]:

    
# ----------
# 5- Now lets add some weather
# Load weather data
wthr1 = h2o.import_frame(path=[mylocate("bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2013.csv"),
                               mylocate("bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2014.csv")])
# Peek at the data
wthr1.describe()









    



Parse Progress: [##################################################] 100%

Parsed 17,520 rows and 50 cols:







    




File1
/Users/spencer/0xdata/h2o-dev/bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2013.csv
File2
/Users/spencer/0xdata/h2o-dev/bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2014.csv






    



Rows: 17,520 Cols: 50

Chunk compression summary:







    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
C0L
Constant Integers
107
6.294118
    8.4 KB
0.7889721
C0D
Constant Reals
436
25.647058
   34.1 KB
3.2148771
CXI
Sparse Integers
17
1.0
    1.5 KB
0.13991351
C1
1-Byte Integers
346
20.352942
  197.4 KB
18.634672
C1N
1-Byte Integers (w/o NAs)
214
12.588236
  122.3 KB
11.544063
C1S
1-Byte Fractions
214
12.588236
  125.3 KB
11.822968
C2S
2-Byte Fractions
196
11.529412
  214.5 KB
20.242111
C4S
4-Byte Fractions
170
10.0
  356.1 KB
33.612423






    



Frame distribution summary:







    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.16.2.37:54321
    1.0 MB
17520.0
34.0
1700.0
mean
    1.0 MB
17520.0
34.0
1700.0
min
    1.0 MB
17520.0
34.0
1700.0
max
    1.0 MB
17520.0
34.0
1700.0
stddev
      0  B
0.0
0.0
0.0
total
    1.0 MB
17520.0
34.0
1700.0






    



Column-by-Column Summary:







    





Year Local
Month Local
Day Local
Hour Local
Year UTC
Month UTC
Day UTC
Hour UTC
Cavok Reported
Cloud Ceiling (m)
Cloud Cover Fraction
Cloud Cover Fraction 1
Cloud Cover Fraction 2
Cloud Cover Fraction 3
Cloud Cover Fraction 4
Cloud Cover Fraction 5
Cloud Cover Fraction 6
Cloud Height (m) 1
Cloud Height (m) 2
Cloud Height (m) 3
Cloud Height (m) 4
Cloud Height (m) 5
Cloud Height (m) 6
Dew Point (C)
Humidity Fraction
Precipitation One Hour (mm)
Pressure Altimeter (mbar)
Pressure Sea Level (mbar)
Pressure Station (mbar)
Snow Depth (cm)
Temperature (C)
Visibility (km)
Weather Code 1
Weather Code 1/ Description
Weather Code 2
Weather Code 2/ Description
Weather Code 3
Weather Code 3/ Description
Weather Code 4
Weather Code 4/ Description
Weather Code 5
Weather Code 5/ Description
Weather Code 6
Weather Code 6/ Description
Weather Code Most Severe / Icon Code
Weather Code Most Severe
Weather Code Most Severe / Description
Wind Direction (degrees)
Wind Gust (m/s)
Wind Speed (m/s)
type
int
int
int
int
int
int
int
int
int
real
real
real
real
real
int
int
int
real
real
real
int
int
int
real
real
real
real
int
int
int
real
real
int
enum
int
enum
int
enum
int
enum
int
enum
int
enum
int
int
enum
int
real
real
mins
2013.0
1.0
1.0
0.0
2013.0
1.0
1.0
0.0
0.0
61.0
0.0
0.0
0.25
0.5
NaN
NaN
NaN
60.96
213.36
365.76
NaN
NaN
NaN
-26.7
0.1251
0.0
983.2949
NaN
NaN
NaN
-15.6
0.001
1.0
0.0
1.0
0.0
1.0
0.0
1.0
0.0
1.0
0.0
3.0
0.0
0.0
1.0
0.0
10.0
7.2
0.0
maxs
2014.0
12.0
31.0
23.0
2015.0
12.0
31.0
23.0
0.0
3657.6
1.0
1.0
1.0
1.0
NaN
NaN
NaN
3657.5999
3657.5999
3657.5999
NaN
NaN
NaN
24.4
1.0
26.924
1042.2113
NaN
NaN
NaN
36.1
16.0934
60.0
11.0
60.0
10.0
36.0
7.0
27.0
4.0
27.0
2.0
3.0
0.0
16.0
60.0
11.0
360.0
20.58
10.8
sigma
0.500014270017
3.44794972385
8.79649804852
6.92238411188
0.500584411716
3.44782405458
8.79561488868
6.92230165203
0.0
995.339856966
0.462720830993
0.42770569708
0.197155690367
0.0861015598104
NaN
NaN
NaN
962.743095854
916.73861349
887.215847511
NaN
NaN
NaN
10.9731282097
0.185792011866
2.56215129179
7.46451697179
NaN
NaN
NaN
10.0396739531
3.69893623033
5.70486576983
2.47814708663
6.13386253912
1.83976235335
5.80553286364
1.28967553698
3.12340844261
1.282164369
6.15223536611
0.60207972894
0.0
0.0
4.07386062702
5.70486576983
2.47814708663
106.350000031
1.81511871115
1.61469790524
zero_count
0
0
0
730
0
0
0
730
17455
0
8758
8758
0
0
0
0
0
0
0
0
0
0
0
268
0
501
0
0
0
0
269
0
0
17
0
30
0
13
0
20
0
12
0
2
14980
0
17
0
0
2768
missing_count
0
0
0
0
0
0
0
0
65
10780
375
375
14682
16535
17520
17520
17520
9103
14683
16535
17520
17520
17520
67
67
15660
360
17520
17520
17520
67
412
14980
14980
16477
16477
17181
17181
17433
17433
17504
17504
17518
17518
0
14980
14980
9382
14381
1283



In [12]:

    
# Lots of columns in there!  Lets plan on converting to time-since-epoch to do
# a 'join' with the bike data, plus gather weather info that might affect
# cyclists - rain, snow, temperature.  Alas, drop the "snow" column since it's
# all NA's.  Also add in dew point and humidity just in case.  Slice out just
# the columns of interest and drop the rest.
wthr2 = wthr1[["Year Local","Month Local","Day Local","Hour Local","Dew Point (C)","Humidity Fraction","Precipitation One Hour (mm)","Temperature (C)","Weather Code 1/ Description"]]

wthr2.setName(wthr2.index("Precipitation One Hour (mm)"), "Rain (mm)")
wthr2.setName(wthr2.index("Weather Code 1/ Description"), "WC1")
wthr2.describe()
# Much better!









    



Rows: 17,520 Cols: 9

Chunk compression summary:







    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
C0L
Constant Integers
46
15.0326805
    3.6 KB
1.780005
C1
1-Byte Integers
34
11.111112
   19.4 KB
9.592678
C1N
1-Byte Integers (w/o NAs)
90
29.411766
   51.5 KB
25.494701
C1S
1-Byte Fractions
42
13.725491
   24.0 KB
11.894592
C2S
2-Byte Fractions
94
30.718956
  103.4 KB
51.238026






    



Frame distribution summary:







    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.16.2.37:54321
  201.9 KB
17520.0
34.0
306.0
mean
  201.9 KB
17520.0
34.0
306.0
min
  201.9 KB
17520.0
34.0
306.0
max
  201.9 KB
17520.0
34.0
306.0
stddev
      0  B
0.0
0.0
0.0
total
  201.9 KB
17520.0
34.0
306.0






    



Column-by-Column Summary:







    





Year Local
Month Local
Day Local
Hour Local
Dew Point (C)
Humidity Fraction
Rain (mm)
Temperature (C)
WC1
type
int
int
int
int
real
real
real
real
enum
mins
2013.0
1.0
1.0
0.0
-26.7
0.1251
0.0
-15.6
0.0
maxs
2014.0
12.0
31.0
23.0
24.4
1.0
26.924
36.1
11.0
sigma
0.500014270017
3.44794972385
8.79649804852
6.92238411188
10.9731282097
0.185792011866
2.56215129179
10.0396739531
2.47814708663
zero_count
0
0
0
730
268
0
501
269
17
missing_count
0
0
0
0
67
67
15660
67
14980



In [13]:

    
# Filter down to the weather at Noon
wthr3 = wthr2[ wthr2["Hour Local"]==12 ]



In [14]:

    
# Lets now get Days since the epoch... we'll convert year/month/day into Epoch
# time, and then back to Epoch days.  Need zero-based month and days, but have
# 1-based.
wthr3["msec"] = h2o.H2OFrame.mktime(year=wthr3["Year Local"], month=wthr3["Month Local"]-1, day=wthr3["Day Local"]-1, hour=wthr3["Hour Local"])
secsPerDay=1000*60*60*24
wthr3["Days"] = (wthr3["msec"]/secsPerDay).floor()
wthr3.describe()
# msec looks sane (numbers like 1.3e12 are in the correct range for msec since
# 1970).  Epoch Days matches closely with the epoch day numbers from the
# CitiBike dataset.









    



Rows: 730 Cols: 11

Chunk compression summary:







    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
C0L
Constant Integers
80
21.390373
    6.3 KB
11.955688
C0D
Constant Reals
13
3.4759357
    1.0 KB
1.9427994
C1
1-Byte Integers
30
8.021391
    2.6 KB
5.0176535
C1N
1-Byte Integers (w/o NAs)
56
14.973262
    4.9 KB
9.375875
C1S
1-Byte Fractions
34
9.090909
    3.5 KB
6.698922
C2S
2-Byte Fractions
34
9.090909
    4.2 KB
8.062618
C8D
64-bit Reals
127
33.95722
   29.8 KB
56.946445






    



Frame distribution summary:







    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.16.2.37:54321
   52.3 KB
730.0
34.0
374.0
mean
   52.3 KB
730.0
34.0
374.0
min
   52.3 KB
730.0
34.0
374.0
max
   52.3 KB
730.0
34.0
374.0
stddev
      0  B
0.0
0.0
0.0
total
   52.3 KB
730.0
34.0
374.0






    



Column-by-Column Summary:







    





Year Local
Month Local
Day Local
Hour Local
Dew Point (C)
Humidity Fraction
Rain (mm)
Temperature (C)
WC1
msec
Days
type
int
int
int
int
real
real
real
real
enum
int
int
mins
2013.0
1.0
1.0
12.0
-26.7
0.1723
0.0
-13.9
0.0
1.3570704e+12
15706.0
maxs
2014.0
12.0
31.0
12.0
23.3
1.0
12.446
34.4
10.0
1.420056e+12
16435.0
sigma
0.500342818004
3.45021529307
8.80227802701
0.0
11.1062964725
0.179945027923
2.36064248615
10.3989855149
2.74674726123
18219740080.4
210.877136425
zero_count
0
0
0
0
14
0
15
7
1
0
0
missing_count
0
0
0
0
3
3
660
3
620
0
0



In [15]:

    
# Lets drop off the extra time columns to make a easy-to-handle dataset.
wthr4 = wthr3.drop("Year Local").drop("Month Local").drop("Day Local").drop("Hour Local").drop("msec")



In [16]:

    
# Also, most rain numbers are missing - lets assume those are zero rain days
rain = wthr4["Rain (mm)"]
rain[ rain.isna() ] = 0



In [17]:

    
# ----------
# 6 - Join the weather data-per-day to the bike-starts-per-day
print "Merge Daily Weather with Bikes-Per-Day"
bpd_with_weather = bpd.merge(wthr4,allLeft=True,allRite=False)
bpd_with_weather.describe()
bpd_with_weather.show()









    



Merge Daily Weather with Bikes-Per-Day
Rows: 10,450 Cols: 10

Chunk compression summary:







    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
CBS
Bits
32
10.0
    3.5 KB
0.82375145
C1
1-Byte Integers
32
10.0
   12.3 KB
2.928121
C1N
1-Byte Integers (w/o NAs)
32
10.0
   12.3 KB
2.928121
C1S
1-Byte Fractions
32
10.0
   12.8 KB
3.0468602
C2
2-Byte Integers
64
20.0
   45.1 KB
10.703204
C8D
64-bit Reals
128
40.0
  335.1 KB
79.56994






    



Frame distribution summary:







    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.16.2.37:54321
  421.1 KB
10450.0
32.0
320.0
mean
  421.1 KB
10450.0
32.0
320.0
min
  421.1 KB
10450.0
32.0
320.0
max
  421.1 KB
10450.0
32.0
320.0
stddev
      0  B
0.0
0.0
0.0
total
  421.1 KB
10450.0
32.0
320.0






    



Column-by-Column Summary:







    





Days
start station name
bikes
Month
DayOfWeek
Humidity Fraction
Rain (mm)
Temperature (C)
WC1
Dew Point (C)
type
int
enum
int
enum
enum
real
real
real
enum
real
mins
15979.0
0.0
1.0
0.0
0.0
0.3485
0.0
9.4
2.0
-2.2
maxs
16010.0
329.0
553.0
1.0
6.0
0.8718
0.254
26.1
8.0
19.4
sigma
9.23370172444
95.4454360013
72.9721964301
0.174371128617
1.9765802868
0.149631413472
0.127095436911
4.29746634617
2.60111075432
6.49151146664
zero_count
0
32
0
328
1635
0
330
0
0
0
missing_count
0
0
0
0
0
0
9793
0
9134
0






    



First 10 rows and first 10 columns: 






    




Days
start station name
bikes
Month
DayOfWeek
Humidity Fraction
Rain (mm)
Temperature (C)
WC1
Dew Point (C)
15981
Fulton St & William St
89
10
Wed
0.4973

23.3

12.200000000000001
15980
9 Ave & W 18 St
137
10
Tue
0.5019

25.0

13.9
16007
W 27 St & 7 Ave
203
10
Mon
0.4071

10.600000000000001

-2.2
15979
Bedford Ave & S 9th St
8
9
Mon
0.4315

23.900000000000002

10.600000000000001
15995
Washington Ave & Park Ave
29
10
Wed
0.6765

20.6

14.4
16005
W 13 St & 6 Ave
138
10
Sat
0.38180000000000003

12.8

-1.1
15979
11 Ave & W 27 St
139
9
Mon
0.4315

23.900000000000002

10.600000000000001
15986
Central Park S & 6 Ave
123
10
Mon
0.48200000000000004

17.8

6.7
16004
John St & William St
60
10
Fri
0.4238

10.0

-2.2
15989
Allen St & Hester St
110
10
Thu
0.8631000000000001
0.254
16.7
light rain
14.4



In [18]:

    
# 7 - Test/Train split again, model build again, this time with weather
split_fit_predict(bpd_with_weather)









    



Training data has 10 columns and 6184 rows, test has 3221 rows, holdout has 1045

gbm Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

deeplearning Model Build Progress: [##################################################] 100%






    




Model
R2 TRAIN
R2 TEST
R2 HOLDOUT
Model Training Time (s)
GBM
0.998106707443
0.927889547812
0.913989238329
7.967
DRF
0.883439340907
0.780096711334
0.784022405792
15.76
GLM
0.999999999425
-0.108259021752
-0.0546717836679
0.254
DL 
0.96876446219
0.924862385081
0.909027422251
7.984

H2O cluster uptime:	1 hours 54 minutes 16 seconds 936 milliseconds
H2O cluster version:	3.1.0.99999
H2O cluster name:	spencer
H2O cluster total nodes:	1
H2O cluster total memory:	14.22 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

chunk_type	chunk_name	count	count_percentage	size	size_percentage
C0L	Constant Integers	17	2.2135415	1.3 KB	0.0022871785
C1	1-Byte Integers	48	6.25	1016.6 KB	1.7506603
C1N	1-Byte Integers (w/o NAs)	48	6.25	1016.6 KB	1.7506603
C1S	1-Byte Fractions	79	10.286459	1.6 MB	2.8878725
C2	2-Byte Integers	243	31.640625	10.0 MB	17.696283
C2S	2-Byte Fractions	49	6.3802085	2.0 MB	3.5701983
C4	4-Byte Integers	32	4.166667	2.6 MB	4.6726856
C4S	4-Byte Fractions	39	5.078125	3.2 MB	5.6373096
C8	64-bit Integers	60	7.8125	9.9 MB	17.432673
C8D	64-bit Reals	153	19.921875	25.3 MB	44.59937

	size	number_of_rows	number_of_chunks_per_column	number_of_chunks
172.16.2.37:54321	56.7 MB	1037712.0	48.0	768.0
mean	56.7 MB	1037712.0	48.0	768.0
min	56.7 MB	1037712.0	48.0	768.0
max	56.7 MB	1037712.0	48.0	768.0
stddev	0 B	0.0	0.0	0.0
total	56.7 MB	1037712.0	48.0	768.0

	tripduration	starttime	stoptime	start station id	start station name	start station latitude	start station longitude	end station id	end station name	end station latitude	end station longitude	bikeid	usertype	birth year	gender	Days
type	int	time	time	int	enum	real	real	int	enum	real	real	int	enum	int	int	int
mins	60.0	1.380610868e+12	1.380611083e+12	72.0	0.0	40.680342423	-74.01713445	72.0	0.0	40.680342423	-74.01713445	14529.0	0.0	1899.0	0.0	15979.0
maxs	1259480.0	1.383289197e+12	1.38341851e+12	3002.0	329.0	40.770513	-73.9500479759	3002.0	329.0	40.770513	-73.9500479759	20757.0	1.0	1997.0	2.0	16010.0
sigma	2000.3732323	778871729.131	778847387.504	354.434325075	100.299091797	0.0195734073053	0.0123161234106	357.398217058	100.415965049	0.0195578458116	0.0123855811965	1717.68112134	0.291696182123	11.1314906238	0.544380593291	9.02215033588
zero_count	0	0	0	0	5239	0	0	0	5449	0	0	0	97446	0	97498	0
missing_count	0	0	0	0	0	0	0	0	0	0	0	0	0	97445	0	0

Days	start station name	bikes
15981	Fulton St & William St	89
15980	9 Ave & W 18 St	137
16007	W 27 St & 7 Ave	203
15979	Bedford Ave & S 9th St	8
15995	Washington Ave & Park Ave	29
16005	W 13 St & 6 Ave	138
15979	11 Ave & W 27 St	139
15986	Central Park S & 6 Ave	123
16004	John St & William St	60
15989	Allen St & Hester St	110

Probs	bikesQuantiles
0.01	4.490000000000002
0.1	19.0
0.25	43.0
0.333	57.0
0.5	87.0
0.667	118.0
0.75	137.0
0.9	192.0
0.99	334.51

Model	R2 TRAIN	R2 TEST	R2 HOLDOUT	Model Training Time (s)
GBM	0.997349130174	0.924028146861	0.925064871886	6.851
DRF	0.840283339977	0.779143049269	0.783155364777	5.716
GLM	0.861354633367	0.840104405497	0.842517809915	0.228
DL	0.961708220985	0.915753071791	0.921853286131	6.858

File1	/Users/spencer/0xdata/h2o-dev/bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2013.csv
File2	/Users/spencer/0xdata/h2o-dev/bigdata/laptop/citibike-nyc/31081_New_York_City__Hourly_2014.csv

	Year Local	Month Local	Day Local	Hour Local	Year UTC	Month UTC	Day UTC	Hour UTC	Cavok Reported	Cloud Ceiling (m)	Cloud Cover Fraction	Cloud Cover Fraction 1	Cloud Cover Fraction 2	Cloud Cover Fraction 3	Cloud Cover Fraction 4	Cloud Cover Fraction 5	Cloud Cover Fraction 6	Cloud Height (m) 1	Cloud Height (m) 2	Cloud Height (m) 3	Cloud Height (m) 4	Cloud Height (m) 5	Cloud Height (m) 6	Dew Point (C)	Humidity Fraction	Precipitation One Hour (mm)	Pressure Altimeter (mbar)	Pressure Sea Level (mbar)	Pressure Station (mbar)	Snow Depth (cm)	Temperature (C)	Visibility (km)	Weather Code 1	Weather Code 1/ Description	Weather Code 2	Weather Code 2/ Description	Weather Code 3	Weather Code 3/ Description	Weather Code 4	Weather Code 4/ Description	Weather Code 5	Weather Code 5/ Description	Weather Code 6	Weather Code 6/ Description	Weather Code Most Severe / Icon Code	Weather Code Most Severe	Weather Code Most Severe / Description	Wind Direction (degrees)	Wind Gust (m/s)	Wind Speed (m/s)
type	int	int	int	int	int	int	int	int	int	real	real	real	real	real	int	int	int	real	real	real	int	int	int	real	real	real	real	int	int	int	real	real	int	enum	int	enum	int	enum	int	enum	int	enum	int	enum	int	int	enum	int	real	real
mins	2013.0	1.0	1.0	0.0	2013.0	1.0	1.0	0.0	0.0	61.0	0.0	0.0	0.25	0.5	NaN	NaN	NaN	60.96	213.36	365.76	NaN	NaN	NaN	-26.7	0.1251	0.0	983.2949	NaN	NaN	NaN	-15.6	0.001	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	3.0	0.0	0.0	1.0	0.0	10.0	7.2	0.0
maxs	2014.0	12.0	31.0	23.0	2015.0	12.0	31.0	23.0	0.0	3657.6	1.0	1.0	1.0	1.0	NaN	NaN	NaN	3657.5999	3657.5999	3657.5999	NaN	NaN	NaN	24.4	1.0	26.924	1042.2113	NaN	NaN	NaN	36.1	16.0934	60.0	11.0	60.0	10.0	36.0	7.0	27.0	4.0	27.0	2.0	3.0	0.0	16.0	60.0	11.0	360.0	20.58	10.8
sigma	0.500014270017	3.44794972385	8.79649804852	6.92238411188	0.500584411716	3.44782405458	8.79561488868	6.92230165203	0.0	995.339856966	0.462720830993	0.42770569708	0.197155690367	0.0861015598104	NaN	NaN	NaN	962.743095854	916.73861349	887.215847511	NaN	NaN	NaN	10.9731282097	0.185792011866	2.56215129179	7.46451697179	NaN	NaN	NaN	10.0396739531	3.69893623033	5.70486576983	2.47814708663	6.13386253912	1.83976235335	5.80553286364	1.28967553698	3.12340844261	1.282164369	6.15223536611	0.60207972894	0.0	0.0	4.07386062702	5.70486576983	2.47814708663	106.350000031	1.81511871115	1.61469790524
zero_count	0	0	0	730	0	0	0	730	17455	0	8758	8758	0	0	0	0	0	0	0	0	0	0	0	268	0	501	0	0	0	0	269	0	0	17	0	30	0	13	0	20	0	12	0	2	14980	0	17	0	0	2768
missing_count	0	0	0	0	0	0	0	0	65	10780	375	375	14682	16535	17520	17520	17520	9103	14683	16535	17520	17520	17520	67	67	15660	360	17520	17520	17520	67	412	14980	14980	16477	16477	17181	17181	17433	17433	17504	17504	17518	17518	0	14980	14980	9382	14381	1283