Multi GPU in TensorFlow and Deep Water

This notebook contains an example how you can define a neural net with multi GPU support in TensorFlow and Keras.

We also train defined graph using H2O DeepWater.


In [1]:
import h2o
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.deepwater import H2ODeepWaterEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
import subprocess

h2o.init(nthreads=-1)
if not H2ODeepWaterEstimator.available(): exit


versionFromGradle='3.11.0',projectVersion='3.11.0.99999',branch='arno-automl-xgboost-deepwater',lastCommitHash='ab6c38eeb4a0673be164f680914f65c9922633d3',gitDescribe='jenkins-master-3860-326-gab6c38e',compiledOn='2017-05-01 16:43:47',compiledBy='arno'
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_121"; Java(TM) SE Runtime Environment (build 1.8.0_121-b13); Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
  Starting server from /home/dmitry/Desktop/venv/gtc/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpzavoyzs0
  JVM stdout: /tmp/tmpzavoyzs0/h2o_dmitry_started_from_python.out
  JVM stderr: /tmp/tmpzavoyzs0/h2o_dmitry_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
versionFromGradle='3.11.0',projectVersion='3.11.0.99999',branch='arno-automl-xgboost-deepwater',lastCommitHash='ab6c38eeb4a0673be164f680914f65c9922633d3',gitDescribe='jenkins-master-3860-326-gab6c38e',compiledOn='2017-05-01 16:43:47',compiledBy='arno'
H2O cluster uptime: 04 secs
H2O cluster version: 3.11.0.99999
H2O cluster version age: 1 day
H2O cluster name: H2O_from_python_dmitry_nlw06e
H2O cluster total nodes: 1
H2O cluster free memory: 26.67 Gb
H2O cluster total cores: 40
H2O cluster allowed cores: 40
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
Python version: 3.6.1 final

Data

As an example we are going to use a dataset from Kaggle's competition.

Competiton goal

In this challenge, BNP Paribas Cardif is providing an anonymized database with two categories of claims:

  • claims for which approval could be accelerated leading to faster payments
  • claims for which additional information is required before approval

In terms of machine learning it's a binary classifiaction problem. As performance metric we are going to use logarithmic loss.

Data

Data can be downloaded here: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/download/train.csv.zip


In [2]:
# upload dataset in H2O and show some rows as an example
df = h2o.import_file("train.csv")
df.show()
df.dim


Parse progress: |█████████████████████████████████████████████████████████| 100%
ID target v1 v2v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 v21v22 v23v24 v25 v26 v27 v28 v29v30 v31 v32 v33 v34 v35 v36 v37 v38 v39 v40 v41 v42 v43 v44 v45 v46v47 v48 v49 v50 v51v52 v53 v54 v55v56 v57 v58 v59 v60 v61 v62 v63 v64 v65v66 v67 v68 v69 v70v71 v72 v73v74 v75 v76 v77 v78v79 v80 v81 v82 v83 v84 v85 v86 v87 v88 v89 v90v91 v92 v93 v94 v95 v96 v97 v98 v99 v100 v101 v102 v103 v104 v105 v106v107 v108 v109v110 v111v112 v113 v114 v115 v116 v117 v118 v119 v120 v121 v122 v123 v124v125 v126 v127 v128 v129 v130 v131
3 1 1.33574 8.72747C 3.92103 7.91527 2.59928 3.17689 0.0129415 10 0.503281 16.43416.08571 2.8668311.6364 1.35501 8.57143 3.67035 0.10672 0.148883 18.86937.73092XDX -1.71613e-08C 0.139412 1.72082 3.3935 0.590122 8.88087C A 1.08303 1.01083 7.27015 8.37545 11.3266 0.454546 0 4.01209 7.71145 7.65343 12.7076 2.0155 10.4983 9.84867 0.113561 C 12.1717 8.086640.89942 7.27779G 16.748 0.0370963 1.29964 DI 3.97112 0.529802 10.891 1.58845 15.8582 1 0.153461 6.36319 18.3039C 9.31408 15.2318 17.1429 11.7845 F 1 1.61499B D 2.23094 7.29242 8.57143E 3 7.52833 8.86165 0.64982 1.29964 1.70732 0.866426 9.55184 3.3213 0.0956784 0.905342A 0.442252 5.81402 3.51772 0.462019 7.43682 5.45455 8.87741 1.19134 19.4702 8.38924 2.75738 4.3743 1.57404 0.00729382 12.5792E 2.38269 3.93092B 0.433213O 15.6349 2.85714 1.95122 6.59201 5.90909 -6.29742e-07 1.0596 0.803572 8 1.98978 0.0357537 AU 1.80413 3.11372 2.02429 0 0.636365 2.85714
4 1nan nan C nan 9.19127nan nan 2.30163 nan 1.31291 nan 6.50765nan 11.6364 nan nan nan nan nan nan 6.76311GUV nan C 3.05614 nan nan nan nan C A nan nan 3.61508nan 14.5795nan 0nan 14.3058 nan nan nan nan nan 2.44996 E nan nan 1.37921 nan G nan 1.12947 nan DY nan nan nan nan nan 2 2.54474 nan nan A nan nan nan 12.0534 F 2nan B D nan nan nan D nan 7.27765 3.43069nan nan nan nan 9.848 nan 2.67858 nan B nan nan nan nan nan nan 8.30397nan nan nan nan nan nan 1.50533 nan B 1.82536 4.24786A nan U G 10.308 nan nan 10.5954 nan nan nan nan nan nan 0.598896 AF nan nan 1.95783 0nan nan
5 1 0.943877 5.31008C 4.41097 5.32616 3.97959 3.92857 0.0196451 12.6667 0.765864 14.75616.38467 2.50559 9.60354 1.98413 5.88235 3.17085 0.244541 0.144258 17.95235.24504FQ -2.78505e-07E 0.113997 2.2449 5.30612 0.836005 7.5 A 1.45408 1.73469 4.04386 7.95918 12.7305 0.25974 0 7.37896 13.0772 6.17347 12.3469 2.92683 8.89756 5.34382 0.126035 C 12.7113 6.836730.604504 9.63763F 15.102 0.0855729 0.765305AS 4.03061 4.27746 9.10548 2.15136 16.0756 1 0.123643 5.51795 16.3772A 8.36735 11.0405 5.88235 8.46065B 3 2.41362B B 1.96397 5.91837 11.7647 E 3.33333 10.1944 8.2662 1.53061 1.53061 2.42991 1.07143 8.44746 3.36735 0.111388 0.811447G 0.27148 5.15656 4.21494 0.309657 5.66327 5.97403 11.5889 0.841837 15.4913 5.87935 3.29279 5.92446 1.6684 0.00827462 11.6706C 1.37575 1.18421B 3.36735 S 11.2056 12.9412 3.12925 3.47891 6.23377 -2.79275e-07 2.13873 2.23881 9.33333 2.4776 0.0134519 AE 1.77371 3.92219 1.12047 2 0.883118 1.17647
6 1 0.797415 8.30476C 4.22593 11.6274 2.0977 1.98755 0.171947 8.965526.54267 16.34759.64665 3.9033 14.0947 1.94504 5.51724 3.61079 1.22411 0.23163 18.37647.51712ACUE -4.80534e-07D 0.148843 1.30827 2.30364 8.92666 8.87452C B 1.58764 1.66667 8.70355 8.89847 11.3028 0.433735 0 0.287322 11.523 7.93104 12.9358 1.47088 12.7086 9.67082 0.108387 C 12.1949 8.591953.32918 4.78036H 16.6217 0.139721 1.17816 BW 3.96552 1.7321 11.7779 1.22925 15.9274 1 0.14026 6.29298 17.0116A 9.70306 18.5681 9.42529 13.5947 F 2 2.27254B D 2.1882 8.2136 13.4483 B 1.94726 4.79787 13.3158 1.68103 1.37931 1.58704 1.24282 10.7471 1.40805 0.0390513 1.04243 B 0.763925 5.4989 3.42394 0.832518 7.37548 6.74699 6.942 1.33461 18.2564 8.50728 2.50305 4.87216 2.57366 0.113967 12.5543B 2.23075 1.99013B 2.64368 J 13.7777 10.5747 1.51106 4.94961 7.18072 0.565509 1.16628 1.95652 7.01826 1.81279 0.00226738CJ 1.41523 2.95438 1.99085 1 1.67711 1.03448
8 1nan nan C nan nan nan nan nan nan 1.05033 nan 6.32009nan 10.9911 nan nan nan nan nan nan 6.41457HIT nan E nan nan nan nan nan A nan nan 6.08315nan nan nan 0nan 10.1389 nan nan nan nan nan nan I nan nan 1.36454 nan H nan nan nan nan nan nan nan nan 1nan nan nan C nan nan nan nan F 1nan B D nan nan nan C nan nan nan nan nan nan nan nan nan nan nan G nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan C nan nan A nan T G 14.0971 nan nan nan nan nan nan nan nan nan nan Z nan nan nan 0nan nan
9 0nan nan C nan 8.85679nan nan 0.359993 nan 1.05033 nan 6.21608nan 11.9163 nan nan nan nan nan nan 7.35143AYX nan A 0.218458 nan nan nan nan A nan nan 7.49661nan 15.498 nan 0nan 7.90391nan nan nan nan nan 0.262632 I nan nan 1.65358 nan K nan 0.131278 nan DX nan nan nan nan nan 1 0.276524 nan nan A nan nan nan 11.5212 F 1nan B D nan nan nan I nan 7.97838 3.81136nan nan nan nan 9.61844nan 0.11639 nan G nan nan nan nan nan nan 8.9701 nan nan nan nan nan nan 0.19353 nan C 1.76377 2.44262A nan D X 15.7505 nan nan 8.85168nan nan nan nan nan nan 0.0498612 X nan nan 1.53622 0nan nan
12 0 0.899806 7.31299C 3.49415 9.9462 1.92607 1.77043 0.0662515 5.011292.34136 16.27457.71117 5.9155912.1486 1.9683 6.60194 2.87397 0.484133 0.443661 17.22676.66148NFD 7.81302e-07E 0.180765 1.07004 1.56615 4.39384 7.92802G A 1.61965 2.00389 3.96491 8.73541 13.9679 0.549451 0 0.031587314.7876 7.26654 14.4455 1.30719 11.7656 9.40669 0.110774 C 13.2569 8.365760.13352 6.36855A 16.7607 0.0976927 1.72179 AS 3.67704 0.729927 12.8588 0.98087 15.0679 2 0.164415 5.17556 15.6559A 8.92996 19.1241 11.4563 14.1526 F 2 2.51906B D 1.70919 7.35408 12.4272 E 1.53499 6.32816 3.49095 1.86771 1.98444 2.59188 1.12354 10.9995 1.14786 0.0759585 1.34375 B 0.766422 5.87507 5.84175 0.863301 6.64397 6.04396 6.07968 1.30188 19.0365 7.49283 2.5119 6.31084 4.69728 0.0477031 11.0758B 2.25482 5.18092B 2.56809 I 9.01008 8.54369 1.34969 12.1687 6.59341 1.53273 0.846716 2.23256 3.4763 1.99259 0.0837583 BJ 3.2761 1.6233 2.26658 0 2.26374 0.970873
21 1nan nan C nan nan nan nan nan nan 1.83807 nan 6.42448nan 12.7939 nan nan nan nan nan nan 7.8067 AHBW nan D nan nan nan nan nan C A nan nan 11.0775 nan nan nan 0nan 3.03451nan nan nan nan nan nan I nan nan 2.68212 nan C nan nan nan DP nan nan nan nan nan 2nan nan nan C nan nan nan nan F 2nan B D nan nan nan C nan nan nan nan nan nan nan nan nan nan nan A nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan E nan nan A nan F M 18.7056 nan nan nan nan nan nan nan nan nan nan BY nan nan nan 0nan nan
22 0 2.07865 8.46262 3.73903 5.26564 1.57303 2.30337 0.0158692 11.1111 4.46389 16.051 8.71505 2.3480512.6034 1.58046 6.66667 3.05213 0.139939 0.180616 17.75466.03495GKQ -3.64317e-07E 0.0589057 1.64794 3.08989 0.759034 8.14607 1.26404 1.34831 6.95262 8.08989 12.7543 0.975611 0 3.169 12.1996 7.07865 11.7416 2.03822 10.173 7.93684 0.0851758D 12.9409 7.921351.67559 8.85183H 16.5517 0.0153907 1.17978 AF 4.38202 0.505051 10.0227 1.44195 15.5556 0 0.0677198 5.40664 16.8564B 8.82023 15.6566 11.6667 9.72771B 0 1.99045B B 1.84099 7.07865 10 P 2.96296 10.2532 8.23117 1.01124 1.34831 1.6 0.926967 6.96162 2.47191 0.040216 0.910867C 0.58504 5.00481 3.09412 0.632525 6.74157 6.82927 9.82361 0.720974 19.4949 6.67209 1.78025 4.67994 1.6534 0.00607144 11.461 D 1.1944 1.15132C 1.68539 L 12.5368 8.33333 1.93103 3.41919 7.80488 0.518242 1.41414 1.2766 8.14815 1.87556 0.0186595 S 1.15964 5.58287 1.10528 0 1.17073 3.33333
23 1 1.1448 5.88061C 3.24447 9.53838 2.5 1.55941 0.41261 9.977532.36324 16.09147.41785 4.1769513.79 2.31956 5.9 2.76865 0.812777 0.271822 18.41018.31245PYF 3.95598e-08C 0.509588 1.16852 1.41089 6.90828 9.05941C A 2.06064 2.47525 6.49873 9.60396 14.6586 0.793651 0 0.075258410.8031 8.2797 13.2673 1.23872 10.2958 12.4804 0.343973 I 13.9826 8.954210.993324 6.68233A 16.4891 0.2328 1.70792 3.81188 6.41739 13.2043 1.05507 13.5283 3 0.390694 4.87455 16.4992A 10.2908 19.4957 9.2 15.0528 F 3 2.78112B D 1.66867 8.6448 13.1 C 2.65169 7.01748 3.67944 2.43193 2.19059 2.74775 1.513 12.3247 0.878712 0.171153 1.18966 A 1.02295 5.43264 3.15251 1.15878 7.46906 7.38095 5.77273 1.17987 13.5826 7.19729 3.69435 4.38947 3.00161 0.278599 10.0642E 2.24059 3.83753A 3.97277 J P 13.4341 10.7 1.27049 10.7195 8.1746 8.93128 0.434782 2.71596 7.32584 4.89662 0.00894365E 1.34455 1.60118 1.92801 0 3.1746 1
Out[2]:
[114321, 133]

In [3]:
# "target" is a column we would like to predict
response = "target"
cols = []

# let's encode "target" column as enum (factor)
for i in cols + [response]: 
    df[i] = df[i].asfactor() 
predictors = list(set(df.names) - set([response, 'ID']))

In [4]:
# dataset split
r = df.runif(seed=42)
train = df[r  < 0.8]                 ## 80% for training
valid = df[(r >= 0.8) & (r < 0.9)]   ## 10% for early stopping (only enabled by default for Deep Water)
test  = df[r  >= 0.9]                ## 10% for final testing
print(train.dim)
print(valid.dim)
print(test .dim)


[91408, 133]
[11380, 133]
[11533, 133]

In [5]:
#neural net definition. P
#lease note how easy you can use Keras layers and TensorFlow layers in the same graph definition

import tensorflow as tf
import json
from keras.layers.core import Dense,  Activation, Dropout
from keras.layers.normalization import BatchNormalization
from keras import backend as K
from keras.objectives import categorical_crossentropy
from tensorflow.python.framework import ops

def keras_model(size, classes, n_gpus = 1, layers = 3, neurons = 256):
    # always create a new graph inside ipython or
    # the default one will be used and can lead to
    # unexpected behavior
    graph = tf.Graph() 
    with graph.as_default():
        # Input fed via H2O
        inp = tf.placeholder(tf.float32, [None, size])
        # Actual labels used for training fed via H2O
        labels = tf.placeholder(tf.float32, [None, classes])
        
        if n_gpus > 1:
            inp_arr = tf.split(inp, n_gpus, axis=0)
            labels_arr = tf.split(labels, n_gpus, axis=0)
        else:
            inp_arr = [ inp ]
            labels_arr = [ labels ]
        
        classes_arr = [ classes ] * n_gpus

        logits_arr = [ 0.0 ] * n_gpus
        predictions_arr = [ 0.0 ] * n_gpus
        
        for gpu in range(n_gpus):
            
            with tf.device('/gpu:'+str(gpu)):
                with tf.name_scope('tower_'+str(gpu)) as scope:


                    x = Dense(neurons)(inp_arr[gpu])
                    x = tf.contrib.layers.batch_norm(x)
                    x = Activation('relu')(x)
                    
                    for i in range(layers):
                        sl = x
                        x = Dense(neurons)(x)
                        x = tf.contrib.layers.batch_norm(x)
                        x = Activation('relu')(x)
                        x = tf.nn.dropout(x, 0.5)
                    out = Dense(classes)(x)
                    logits_arr[gpu] = out
                    predictions_arr[gpu] = tf.nn.softmax(out)
                    
        with tf.device('/cpu:0'):
            out = tf.concat(logits_arr, 0)
            predictions = tf.concat(predictions_arr, 0)
            loss = tf.reduce_mean(tf.losses.softmax_cross_entropy(labels, out))
            train_step = tf.train.AdamOptimizer(1e-3).minimize(loss)

        init_op = tf.global_variables_initializer()

        # Metadata required by H2O
        tf.add_to_collection(ops.GraphKeys.INIT_OP, init_op.name)
        tf.add_to_collection(ops.GraphKeys.TRAIN_OP, train_step)
        tf.add_to_collection("logits", out)
        tf.add_to_collection("predictions", predictions)

        meta = json.dumps({
                "inputs": {"batch_image_input": inp.name,
                           "categorical_labels": labels.name
                          },
                "outputs": {"categorical_logits": out.name,
                            "layers": ','.join([m.name for m in tf.get_default_graph().get_operations()])},
                "parameters": {},
            })
        tf.add_to_collection("meta", meta)

        # Save the meta file with the graph
        saver = tf.train.Saver()
        filename = "/tmp/keras_tensorflow.meta"
        tf.train.export_meta_graph(filename, saver_def=saver.as_saver_def())

        return filename


Using TensorFlow backend.

In [6]:
NGPUS = int(subprocess.check_output("nvidia-smi -L | wc -l", shell=True))
print("GPUs:", NGPUS)


GPUs: 2

In [7]:
#will take ~10 minutes to converge on 2 GPUs (GeForce 1080)
#194 is a length of input layer. We encode all categorical using "binary" encoding from H2O.
filename = keras_model(194, 2, NGPUS, layers = 5, neurons = 4096)

In [8]:
%%time

batch_size = 512

dw = H2ODeepWaterEstimator(
    seed=1234, 
    backend = "tensorflow", 
    epochs = 100,
    network_definition_file=filename,
    mini_batch_size = batch_size*NGPUS,
    categorical_encoding = "binary",
)  
dw.train(
    x=predictors, 
    y=response, 
    training_frame=train, 
    validation_frame=valid,
)
print("Validation Logloss:",dw.model_performance(valid=True).logloss())


deepwater Model Build progress: |█████████████████████████████████████████| 100%
Validation Logloss: 0.4952041208139977
CPU times: user 2.94 s, sys: 348 ms, total: 3.29 s
Wall time: 10min 16s

In [9]:
pdw = dw.predict(test)
print("Test LogLoss:", h2o.make_metrics(actual=test[response], predicted=pdw[2]).logloss())


deepwater prediction progress: |██████████████████████████████████████████| 100%
Test LogLoss: 0.5063574885230738

In [ ]: