In [1]:
library(h2o)
h2o.init()
data<-h2o.uploadFile("../data/german_credit.csv")


----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: ‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmppCUDha/h2o_micio1970_started_from_r.out
    /tmp/RtmppCUDha/h2o_micio1970_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 526 milliseconds 
    H2O cluster version:        3.10.2.2 
    H2O cluster version age:    3 months and 5 days  
    H2O cluster name:           H2O_started_from_R_micio1970_hdk093 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.71 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.3.2 (2016-10-31) 

Note:  As started, H2O is limited to the CRAN default of 2 CPUs.
       Shut down and restart H2O as shown below to use all your CPUs.
           > h2o.shutdown()
           > h2o.init(nthreads = -1)

  |======================================================================| 100%

In [2]:
h2o.head(data)


CreditabilityAccount BalanceDuration of Credit (month)Payment Status of Previous CreditPurposeCredit AmountValue Savings/StocksLength of current employmentInstalment per centSex & Marital StatusDuration in Current addressMost valuable available assetAge (years)Concurrent CreditsType of apartmentNo of Credits at this BankOccupationNo of dependentsTelephoneForeign Worker
1 1 18 4 2 10491 2 4 2 4 2 21 3 1 1 3 1 1 1
1 1 9 4 0 27991 3 2 3 2 1 36 3 1 2 3 2 1 1
1 2 12 2 9 8412 4 2 2 4 1 23 3 1 1 2 1 1 1
1 1 12 4 0 21221 3 3 3 2 1 39 3 1 2 2 2 1 2
1 1 12 4 0 21711 3 4 3 4 2 38 1 2 2 2 1 1 2
1 1 10 4 0 22411 2 1 3 3 1 48 3 1 2 2 2 1 2

In [3]:
h2o.describe(data)


LabelTypeMissingZerosPosInfNegInfMinMaxMeanSigmaCardinality
Creditability int 0 300 0 0 0 1 0.7 0.458486870270251 NA
Account Balance int 0 0 0 0 1 4 2.577 1.25763772711089 NA
Duration of Credit (month) int 0 0 0 0 4 72 20.903 12.0588144527564 NA
Payment Status of Previous Creditint 0 40 0 0 0 4 2.545 1.08311963704299 NA
Purpose int 0 234 0 0 0 10 2.828 2.74443945969809 NA
Credit Amount int 0 0 0 0 250 18424 3271.248 2822.75175989565 NA
Value Savings/Stocks int 0 0 0 0 1 5 2.105 1.58002261739238 NA
Length of current employment int 0 0 0 0 1 5 3.384 1.20830625422697 NA
Instalment per cent int 0 0 0 0 1 4 2.973 1.11871467431268 NA
Sex & Marital Status int 0 0 0 0 1 4 2.682 0.708080064242298 NA
Guarantors int 0 0 0 0 1 3 1.145 0.477706189203367 NA
Duration in Current address int 0 0 0 0 1 4 2.845 1.10371789565685 NA
Most valuable available asset int 0 0 0 0 1 4 2.358 1.05020899774232 NA
Age (years) int 0 0 0 0 19 75 35.542 11.3526701316967 NA
Concurrent Credits int 0 0 0 0 1 3 2.675 0.70560107204629 NA
Type of apartment int 0 0 0 0 1 3 1.928 0.530185908052163 NA
No of Credits at this Bank int 0 0 0 0 1 4 1.407 0.5776544682461 NA
Occupation int 0 0 0 0 1 4 2.904 0.653613961915756 NA
No of dependents int 0 0 0 0 1 2 1.155 0.362085771753194 NA
Telephone int 0 0 0 0 1 2 1.404 0.490942995698101 NA
Foreign Worker int 0 0 0 0 1 2 1.037 0.18885620632287 NA

In [4]:
### Convert Numeric to Categorical ###
to_factors <- c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20)
for(i in to_factors) data[,i] <- h2o.asfactor(data[,i])

In [5]:
h2o.describe(data)


LabelTypeMissingZerosPosInfNegInfMinMaxMeanSigmaCardinality
Creditability enum 0 300 0 0 0 1 0.7 0.458486870270251 2
Account Balance enum 0 274 0 0 0 3 NA NA 4
Duration of Credit (month) int 0 0 0 0 4 72 20.903 12.0588144527564 NA
Payment Status of Previous Creditenum 0 40 0 0 0 4 NA NA 5
Purpose enum 0 234 0 0 0 9 NA NA 10
Credit Amount int 0 0 0 0 250 18424 3271.248 2822.75175989565 NA
Value Savings/Stocks enum 0 603 0 0 0 4 NA NA 5
Length of current employment enum 0 62 0 0 0 4 NA NA 5
Instalment per cent enum 0 136 0 0 0 3 NA NA 4
Sex & Marital Status enum 0 50 0 0 0 3 NA NA 4
Guarantors enum 0 907 0 0 0 2 NA NA 3
Duration in Current address enum 0 130 0 0 0 3 NA NA 4
Most valuable available asset enum 0 282 0 0 0 3 NA NA 4
Age (years) int 0 0 0 0 19 75 35.542 11.3526701316967 NA
Concurrent Credits enum 0 139 0 0 0 2 NA NA 3
Type of apartment enum 0 179 0 0 0 2 NA NA 3
No of Credits at this Bank enum 0 633 0 0 0 3 NA NA 4
Occupation enum 0 22 0 0 0 3 NA NA 4
No of dependents enum 0 845 0 0 0 1 0.155 0.362085771753194 2
Telephone enum 0 596 0 0 0 1 0.404 0.4909429956981 2
Foreign Worker int 0 0 0 0 1 2 1.037 0.18885620632287 NA

In [6]:
h2o.summary(data)


Warning message in h2o.summary(data):
“Approximated quantiles computed! If you are interested in exact quantiles, please pass the `exact_quantiles=TRUE` parameter.”
 Creditability Account Balance Duration of Credit (month)
 1:700         4:394           Min.   : 4.0              
 0:300         1:274           1st Qu.:12.0              
               2:269           Median :18.0              
               3: 63           Mean   :20.9              
                               3rd Qu.:24.0              
                               Max.   :72.0              
 Payment Status of Previous Credit Purpose Credit Amount   Value Savings/Stocks
 2:530                             3:280   Min.   :  250   1:603               
 4:293                             0:234   1st Qu.: 1359   5:183               
 3: 88                             2:181   Median : 2304   2:103               
 1: 49                             1:103   Mean   : 3271   3: 63               
 0: 40                             9: 97   3rd Qu.: 3958   4: 48               
                                   6: 50   Max.   :18424                       
 Length of current employment Instalment per cent Sex & Marital Status
 3:339                        4:476               3:548               
 5:253                        2:231               2:310               
 4:174                        3:157               4: 92               
 2:172                        1:136               1: 50               
 1: 62                                                                
                                                                      
 Guarantors Duration in Current address Most valuable available asset
 1:907      4:413                       3:332                        
 3: 52      2:308                       1:282                        
 2: 41      3:149                       2:232                        
            1:130                       4:154                        
                                                                     
                                                                     
 Age (years)     Concurrent Credits Type of apartment
 Min.   :19.00   3:814              2:714            
 1st Qu.:27.00   1:139              1:179            
 Median :33.00   2: 47              3:107            
 Mean   :35.54                                       
 3rd Qu.:42.00                                       
 Max.   :75.00                                       
 No of Credits at this Bank Occupation No of dependents Telephone
 1:633                      3:630      1:845            1:596    
 2:333                      2:200      2:155            2:404    
 3: 28                      4:148                                
 4:  6                      1: 22                                
                                                                 
                                                                 
 Foreign Worker 
 Min.   :1.000  
 1st Qu.:1.000  
 Median :1.000  
 Mean   :1.037  
 3rd Qu.:1.000  
 Max.   :2.000  

In [7]:
h2o.str(data)


Class 'H2OFrame' <environment: 0x314cbf0> 
 - attr(*, "op")= chr ":="
 - attr(*, "eval")= logi TRUE
 - attr(*, "id")= chr "RTMP_sid_9e76_19"
 - attr(*, "nrow")= int 1000
 - attr(*, "ncol")= int 21
 - attr(*, "types")=List of 21
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "int"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "int"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "int"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "enum"
  ..$ : chr "int"
 - attr(*, "data")='data.frame':	10 obs. of  21 variables:
  ..$ Creditability                    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2
  ..$ Account Balance                  : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2
  ..$ Duration of Credit (month)       : num  18 9 12 12 12 10 8 6 18 24
  ..$ Payment Status of Previous Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3
  ..$ Purpose                          : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4
  ..$ Credit Amount                    : num  1049 2799 841 2122 2171 ...
  ..$ Value Savings/Stocks             : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 1 1 1 1 1 1 3
  ..$ Length of current employment     : Factor w/ 5 levels "1","2","3","4",..: 2 3 4 3 3 2 4 2 1 1
  ..$ Instalment per cent              : Factor w/ 4 levels "1","2","3","4": 4 2 2 3 4 1 1 2 4 1
  ..$ Sex & Marital Status             : Factor w/ 4 levels "1","2","3","4": 2 3 2 3 3 3 3 3 2 2
  ..$ Guarantors                       : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1
  ..$ Duration in Current address      : Factor w/ 4 levels "1","2","3","4": 4 2 4 2 4 3 4 4 4 4
  ..$ Most valuable available asset    : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 2 1 1 1 3 4
  ..$ Age (years)                      : num  21 36 23 39 38 48 39 40 65 23
  ..$ Concurrent Credits               : Factor w/ 3 levels "1","2","3": 3 3 3 3 1 3 3 3 3 3
  ..$ Type of apartment                : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 1 2 2 2 1
  ..$ No of Credits at this Bank       : Factor w/ 4 levels "1","2","3","4": 1 2 1 2 2 2 2 1 2 1
  ..$ Occupation                       : Factor w/ 4 levels "1","2","3","4": 3 3 2 2 2 2 2 2 1 1
  ..$ No of dependents                 : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 1
  ..$ Telephone                        : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1
  ..$ Foreign Worker                   : num  1 1 1 2 2 2 2 2 1 1

In [8]:
h2o.group_by(data, by="Creditability",nrow("Creditability"))


  Creditability nrow_Creditability
1             0                300
2             1                700

[2 rows x 2 columns] 

In [9]:
h2o.hist(data[,"Credit Amount"])
#h2o.hist(data[,14])



In [10]:
data$credit_amount_trnsf <- h2o.log(data[,"Credit Amount"])
h2o.hist(data$credit_amount_trnsf)



In [11]:
target <- "Creditability"

In [12]:
print(target)


[1] "Creditability"

In [13]:
a<-colnames(data)
features <- a[2:22]
print(features)


 [1] "Account Balance"                   "Duration of Credit (month)"       
 [3] "Payment Status of Previous Credit" "Purpose"                          
 [5] "Credit Amount"                     "Value Savings/Stocks"             
 [7] "Length of current employment"      "Instalment per cent"              
 [9] "Sex & Marital Status"              "Guarantors"                       
[11] "Duration in Current address"       "Most valuable available asset"    
[13] "Age (years)"                       "Concurrent Credits"               
[15] "Type of apartment"                 "No of Credits at this Bank"       
[17] "Occupation"                        "No of dependents"                 
[19] "Telephone"                         "Foreign Worker"                   
[21] "credit_amount_trnsf"              

In [14]:
set.seed(102) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  

# Split dataset giving the training dataset 75% of the data
data.split <- h2o.splitFrame(data=data, ratios=0.75)

# Create a training set from the 1st dataset in the split
data.train <- data.split[[1]]

# Create a testing set from the 2nd dataset in the split
data.test <- data.split[[2]]

In [15]:
nrow(data.train)


767

In [16]:
glm_model1 <- h2o.glm(x = features, 
                      y = target, 
                      training_frame = data.train,
                      model_id = "glm_model1",
                      family = "binomial")


  |======================================================================| 100%

In [17]:
print(summary(glm_model1))


Model Details:
==============

H2OBinomialModel: glm
Model Key:  glm_model1 
GLM Model: summary
    family  link                               regularization
1 binomial logit Elastic Net (alpha = 0.5, lambda = 0.02283 )
  number_of_predictors_total number_of_active_predictors number_of_iterations
1                         71                          18                    5
    training_frame
1 RTMP_sid_9e76_24

H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.1621729
RMSE:  0.402707
LogLoss:  0.4934618
Mean Per-Class Error:  0.3228609
AUC:  0.8069031
Gini:  0.6138062
R^2:  0.233218
Null Deviance:  941.9281
Residual Deviance:  756.9705
AIC:  794.9705

Confusion Matrix for F1-optimal threshold:
         0   1    Error      Rate
0      100 133 0.570815  =133/233
1       40 494 0.074906   =40/534
Totals 140 627 0.225554  =173/767

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.528189 0.850991 307
2                       max f2  0.328556 0.924835 386
3                 max f0point5  0.655950 0.837428 229
4                 max accuracy  0.528189 0.774446 307
5                max precision  0.962081 1.000000   0
6                   max recall  0.328556 1.000000 386
7              max specificity  0.962081 1.000000   0
8             max absolute_mcc  0.633872 0.452292 243
9   max min_per_class_accuracy  0.675999 0.726592 217
10 max mean_per_class_accuracy  0.655950 0.737036 229

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`



Scoring History: 
            timestamp   duration iteration negative_log_likelihood objective
1 2017-04-17 19:38:36  0.000 sec         0               470.96404   0.61403
2 2017-04-17 19:38:36  0.037 sec         1               391.47936   0.54639
3 2017-04-17 19:38:36  0.051 sec         2               387.08134   0.54480
4 2017-04-17 19:38:37  0.068 sec         3               386.80849   0.54479
5 2017-04-17 19:38:37  0.106 sec         4               378.62245   0.54253
6 2017-04-17 19:38:37  0.112 sec         5               378.48524   0.54253

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Standardized Coefficient Magnitudes: standardized coefficient magnitudes
                                names coefficients sign
1                   Account Balance.4     0.834273  POS
2                   Account Balance.1     0.463571  NEG
3          Duration of Credit (month)     0.339717  NEG
4 Payment Status of Previous Credit.4     0.272301  POS
5               Instalment per cent.4     0.210977  NEG

---
                  names coefficients sign
66 Concurrent Credits.2     0.000000  POS
67   No of dependents.1     0.000000  POS
68   No of dependents.2     0.000000  POS
69          Telephone.1     0.000000  POS
70          Telephone.2     0.000000  POS
71  credit_amount_trnsf     0.000000  POS
Standardized Coefficient Magnitudes: standardized coefficient magnitudes
                                names coefficients sign
1                   Account Balance.4     0.834273  POS
2                   Account Balance.1     0.463571  NEG
3          Duration of Credit (month)     0.339717  NEG
4 Payment Status of Previous Credit.4     0.272301  POS
5               Instalment per cent.4     0.210977  NEG

---
                  names coefficients sign
66 Concurrent Credits.2     0.000000  POS
67   No of dependents.1     0.000000  POS
68   No of dependents.2     0.000000  POS
69          Telephone.1     0.000000  POS
70          Telephone.2     0.000000  POS
71  credit_amount_trnsf     0.000000  POS

In [18]:
perf_obj <- h2o.performance(glm_model1, newdata = data.test)

In [19]:
print(perf_obj)


H2OBinomialMetrics: glm

MSE:  0.1831267
RMSE:  0.4279331
LogLoss:  0.5429005
Mean Per-Class Error:  0.46426
AUC:  0.7169574
Gini:  0.4339148
R^2:  0.1061169
Null Deviance:  279.8683
Residual Deviance:  252.9917
AIC:  290.9917

Confusion Matrix for F1-optimal threshold:
       0   1    Error     Rate
0      6  61 0.910448   =61/67
1      3 163 0.018072   =3/166
Totals 9 224 0.274678  =64/233

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.360782 0.835897 223
2                       max f2  0.211924 0.925307 232
3                 max f0point5  0.702929 0.810398 121
4                 max accuracy  0.472808 0.725322 203
5                max precision  0.939168 1.000000   0
6                   max recall  0.211924 1.000000 232
7              max specificity  0.939168 1.000000   0
8             max absolute_mcc  0.702929 0.362274 121
9   max min_per_class_accuracy  0.668232 0.668675 132
10 max mean_per_class_accuracy  0.702929 0.699874 121

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

In [23]:
h2o.accuracy(perf_obj,0.939168338305063)


  1. 0.291845493562232

In [21]:
pred_creditability <- h2o.predict(glm_model1,data.test)
pred_creditability


  |======================================================================| 100%
  predict        p0        p1
1       1 0.2970714 0.7029286
2       0 0.5300577 0.4699423
3       1 0.2398353 0.7601647
4       1 0.3139088 0.6860912
5       0 0.4860326 0.5139674
6       1 0.1663972 0.8336028

[233 rows x 3 columns] 

In [ ]: