In [1]:
library(h2o)
h2o.init()
data<-h2o.uploadFile("../data/german_credit.csv")
----------------------------------------------------------------------
Your next step is to start H2O:
> h2o.init()
For H2O package documentation, ask for help:
> ??h2o
After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai
----------------------------------------------------------------------
Attaching package: ‘h2o’
The following objects are masked from ‘package:stats’:
cor, sd, var
The following objects are masked from ‘package:base’:
&&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
colnames<-, ifelse, is.character, is.factor, is.numeric, log,
log10, log1p, log2, round, signif, trunc
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmppCUDha/h2o_micio1970_started_from_r.out
/tmp/RtmppCUDha/h2o_micio1970_started_from_r.err
Starting H2O JVM and connecting: .. Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 seconds 526 milliseconds
H2O cluster version: 3.10.2.2
H2O cluster version age: 3 months and 5 days
H2O cluster name: H2O_started_from_R_micio1970_hdk093
H2O cluster total nodes: 1
H2O cluster total memory: 1.71 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
R Version: R version 3.3.2 (2016-10-31)
Note: As started, H2O is limited to the CRAN default of 2 CPUs.
Shut down and restart H2O as shown below to use all your CPUs.
> h2o.shutdown()
> h2o.init(nthreads = -1)
|======================================================================| 100%
In [2]:
h2o.head(data)
Creditability Account Balance Duration of Credit (month) Payment Status of Previous Credit Purpose Credit Amount Value Savings/Stocks Length of current employment Instalment per cent Sex & Marital Status ⋯ Duration in Current address Most valuable available asset Age (years) Concurrent Credits Type of apartment No of Credits at this Bank Occupation No of dependents Telephone Foreign Worker
1 1 18 4 2 1049 1 2 4 2 ⋯ 4 2 21 3 1 1 3 1 1 1
1 1 9 4 0 2799 1 3 2 3 ⋯ 2 1 36 3 1 2 3 2 1 1
1 2 12 2 9 841 2 4 2 2 ⋯ 4 1 23 3 1 1 2 1 1 1
1 1 12 4 0 2122 1 3 3 3 ⋯ 2 1 39 3 1 2 2 2 1 2
1 1 12 4 0 2171 1 3 4 3 ⋯ 4 2 38 1 2 2 2 1 1 2
1 1 10 4 0 2241 1 2 1 3 ⋯ 3 1 48 3 1 2 2 2 1 2
In [3]:
h2o.describe(data)
Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
Creditability int 0 300 0 0 0 1 0.7 0.458486870270251 NA
Account Balance int 0 0 0 0 1 4 2.577 1.25763772711089 NA
Duration of Credit (month) int 0 0 0 0 4 72 20.903 12.0588144527564 NA
Payment Status of Previous Credit int 0 40 0 0 0 4 2.545 1.08311963704299 NA
Purpose int 0 234 0 0 0 10 2.828 2.74443945969809 NA
Credit Amount int 0 0 0 0 250 18424 3271.248 2822.75175989565 NA
Value Savings/Stocks int 0 0 0 0 1 5 2.105 1.58002261739238 NA
Length of current employment int 0 0 0 0 1 5 3.384 1.20830625422697 NA
Instalment per cent int 0 0 0 0 1 4 2.973 1.11871467431268 NA
Sex & Marital Status int 0 0 0 0 1 4 2.682 0.708080064242298 NA
Guarantors int 0 0 0 0 1 3 1.145 0.477706189203367 NA
Duration in Current address int 0 0 0 0 1 4 2.845 1.10371789565685 NA
Most valuable available asset int 0 0 0 0 1 4 2.358 1.05020899774232 NA
Age (years) int 0 0 0 0 19 75 35.542 11.3526701316967 NA
Concurrent Credits int 0 0 0 0 1 3 2.675 0.70560107204629 NA
Type of apartment int 0 0 0 0 1 3 1.928 0.530185908052163 NA
No of Credits at this Bank int 0 0 0 0 1 4 1.407 0.5776544682461 NA
Occupation int 0 0 0 0 1 4 2.904 0.653613961915756 NA
No of dependents int 0 0 0 0 1 2 1.155 0.362085771753194 NA
Telephone int 0 0 0 0 1 2 1.404 0.490942995698101 NA
Foreign Worker int 0 0 0 0 1 2 1.037 0.18885620632287 NA
In [4]:
### Convert Numeric to Categorical ###
to_factors <- c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20)
for(i in to_factors) data[,i] <- h2o.asfactor(data[,i])
In [5]:
h2o.describe(data)
Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
Creditability enum 0 300 0 0 0 1 0.7 0.458486870270251 2
Account Balance enum 0 274 0 0 0 3 NA NA 4
Duration of Credit (month) int 0 0 0 0 4 72 20.903 12.0588144527564 NA
Payment Status of Previous Credit enum 0 40 0 0 0 4 NA NA 5
Purpose enum 0 234 0 0 0 9 NA NA 10
Credit Amount int 0 0 0 0 250 18424 3271.248 2822.75175989565 NA
Value Savings/Stocks enum 0 603 0 0 0 4 NA NA 5
Length of current employment enum 0 62 0 0 0 4 NA NA 5
Instalment per cent enum 0 136 0 0 0 3 NA NA 4
Sex & Marital Status enum 0 50 0 0 0 3 NA NA 4
Guarantors enum 0 907 0 0 0 2 NA NA 3
Duration in Current address enum 0 130 0 0 0 3 NA NA 4
Most valuable available asset enum 0 282 0 0 0 3 NA NA 4
Age (years) int 0 0 0 0 19 75 35.542 11.3526701316967 NA
Concurrent Credits enum 0 139 0 0 0 2 NA NA 3
Type of apartment enum 0 179 0 0 0 2 NA NA 3
No of Credits at this Bank enum 0 633 0 0 0 3 NA NA 4
Occupation enum 0 22 0 0 0 3 NA NA 4
No of dependents enum 0 845 0 0 0 1 0.155 0.362085771753194 2
Telephone enum 0 596 0 0 0 1 0.404 0.4909429956981 2
Foreign Worker int 0 0 0 0 1 2 1.037 0.18885620632287 NA
In [6]:
h2o.summary(data)
Warning message in h2o.summary(data):
“Approximated quantiles computed! If you are interested in exact quantiles, please pass the `exact_quantiles=TRUE` parameter.”
Creditability Account Balance Duration of Credit (month)
1:700 4:394 Min. : 4.0
0:300 1:274 1st Qu.:12.0
2:269 Median :18.0
3: 63 Mean :20.9
3rd Qu.:24.0
Max. :72.0
Payment Status of Previous Credit Purpose Credit Amount Value Savings/Stocks
2:530 3:280 Min. : 250 1:603
4:293 0:234 1st Qu.: 1359 5:183
3: 88 2:181 Median : 2304 2:103
1: 49 1:103 Mean : 3271 3: 63
0: 40 9: 97 3rd Qu.: 3958 4: 48
6: 50 Max. :18424
Length of current employment Instalment per cent Sex & Marital Status
3:339 4:476 3:548
5:253 2:231 2:310
4:174 3:157 4: 92
2:172 1:136 1: 50
1: 62
Guarantors Duration in Current address Most valuable available asset
1:907 4:413 3:332
3: 52 2:308 1:282
2: 41 3:149 2:232
1:130 4:154
Age (years) Concurrent Credits Type of apartment
Min. :19.00 3:814 2:714
1st Qu.:27.00 1:139 1:179
Median :33.00 2: 47 3:107
Mean :35.54
3rd Qu.:42.00
Max. :75.00
No of Credits at this Bank Occupation No of dependents Telephone
1:633 3:630 1:845 1:596
2:333 2:200 2:155 2:404
3: 28 4:148
4: 6 1: 22
Foreign Worker
Min. :1.000
1st Qu.:1.000
Median :1.000
Mean :1.037
3rd Qu.:1.000
Max. :2.000
In [7]:
h2o.str(data)
Class 'H2OFrame' <environment: 0x314cbf0>
- attr(*, "op")= chr ":="
- attr(*, "eval")= logi TRUE
- attr(*, "id")= chr "RTMP_sid_9e76_19"
- attr(*, "nrow")= int 1000
- attr(*, "ncol")= int 21
- attr(*, "types")=List of 21
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "int"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "int"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "int"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "int"
- attr(*, "data")='data.frame': 10 obs. of 21 variables:
..$ Creditability : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2
..$ Account Balance : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2
..$ Duration of Credit (month) : num 18 9 12 12 12 10 8 6 18 24
..$ Payment Status of Previous Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3
..$ Purpose : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4
..$ Credit Amount : num 1049 2799 841 2122 2171 ...
..$ Value Savings/Stocks : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 1 1 1 1 1 1 3
..$ Length of current employment : Factor w/ 5 levels "1","2","3","4",..: 2 3 4 3 3 2 4 2 1 1
..$ Instalment per cent : Factor w/ 4 levels "1","2","3","4": 4 2 2 3 4 1 1 2 4 1
..$ Sex & Marital Status : Factor w/ 4 levels "1","2","3","4": 2 3 2 3 3 3 3 3 2 2
..$ Guarantors : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1
..$ Duration in Current address : Factor w/ 4 levels "1","2","3","4": 4 2 4 2 4 3 4 4 4 4
..$ Most valuable available asset : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 2 1 1 1 3 4
..$ Age (years) : num 21 36 23 39 38 48 39 40 65 23
..$ Concurrent Credits : Factor w/ 3 levels "1","2","3": 3 3 3 3 1 3 3 3 3 3
..$ Type of apartment : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 1 2 2 2 1
..$ No of Credits at this Bank : Factor w/ 4 levels "1","2","3","4": 1 2 1 2 2 2 2 1 2 1
..$ Occupation : Factor w/ 4 levels "1","2","3","4": 3 3 2 2 2 2 2 2 1 1
..$ No of dependents : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 1
..$ Telephone : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1
..$ Foreign Worker : num 1 1 1 2 2 2 2 2 1 1
In [8]:
h2o.group_by(data, by="Creditability",nrow("Creditability"))
Creditability nrow_Creditability
1 0 300
2 1 700
[2 rows x 2 columns]
In [9]:
h2o.hist(data[,"Credit Amount"])
#h2o.hist(data[,14])
In [10]:
data$credit_amount_trnsf <- h2o.log(data[,"Credit Amount"])
h2o.hist(data$credit_amount_trnsf)
In [11]:
target <- "Creditability"
In [12]:
print(target)
[1] "Creditability"
In [13]:
a<-colnames(data)
features <- a[2:22]
print(features)
[1] "Account Balance" "Duration of Credit (month)"
[3] "Payment Status of Previous Credit" "Purpose"
[5] "Credit Amount" "Value Savings/Stocks"
[7] "Length of current employment" "Instalment per cent"
[9] "Sex & Marital Status" "Guarantors"
[11] "Duration in Current address" "Most valuable available asset"
[13] "Age (years)" "Concurrent Credits"
[15] "Type of apartment" "No of Credits at this Bank"
[17] "Occupation" "No of dependents"
[19] "Telephone" "Foreign Worker"
[21] "credit_amount_trnsf"
In [14]:
set.seed(102) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data
# Split dataset giving the training dataset 75% of the data
data.split <- h2o.splitFrame(data=data, ratios=0.75)
# Create a training set from the 1st dataset in the split
data.train <- data.split[[1]]
# Create a testing set from the 2nd dataset in the split
data.test <- data.split[[2]]
In [15]:
nrow(data.train)
767
In [16]:
glm_model1 <- h2o.glm(x = features,
y = target,
training_frame = data.train,
model_id = "glm_model1",
family = "binomial")
|======================================================================| 100%
In [17]:
print(summary(glm_model1))
Model Details:
==============
H2OBinomialModel: glm
Model Key: glm_model1
GLM Model: summary
family link regularization
1 binomial logit Elastic Net (alpha = 0.5, lambda = 0.02283 )
number_of_predictors_total number_of_active_predictors number_of_iterations
1 71 18 5
training_frame
1 RTMP_sid_9e76_24
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.1621729
RMSE: 0.402707
LogLoss: 0.4934618
Mean Per-Class Error: 0.3228609
AUC: 0.8069031
Gini: 0.6138062
R^2: 0.233218
Null Deviance: 941.9281
Residual Deviance: 756.9705
AIC: 794.9705
Confusion Matrix for F1-optimal threshold:
0 1 Error Rate
0 100 133 0.570815 =133/233
1 40 494 0.074906 =40/534
Totals 140 627 0.225554 =173/767
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.528189 0.850991 307
2 max f2 0.328556 0.924835 386
3 max f0point5 0.655950 0.837428 229
4 max accuracy 0.528189 0.774446 307
5 max precision 0.962081 1.000000 0
6 max recall 0.328556 1.000000 386
7 max specificity 0.962081 1.000000 0
8 max absolute_mcc 0.633872 0.452292 243
9 max min_per_class_accuracy 0.675999 0.726592 217
10 max mean_per_class_accuracy 0.655950 0.737036 229
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Scoring History:
timestamp duration iteration negative_log_likelihood objective
1 2017-04-17 19:38:36 0.000 sec 0 470.96404 0.61403
2 2017-04-17 19:38:36 0.037 sec 1 391.47936 0.54639
3 2017-04-17 19:38:36 0.051 sec 2 387.08134 0.54480
4 2017-04-17 19:38:37 0.068 sec 3 386.80849 0.54479
5 2017-04-17 19:38:37 0.106 sec 4 378.62245 0.54253
6 2017-04-17 19:38:37 0.112 sec 5 378.48524 0.54253
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Standardized Coefficient Magnitudes: standardized coefficient magnitudes
names coefficients sign
1 Account Balance.4 0.834273 POS
2 Account Balance.1 0.463571 NEG
3 Duration of Credit (month) 0.339717 NEG
4 Payment Status of Previous Credit.4 0.272301 POS
5 Instalment per cent.4 0.210977 NEG
---
names coefficients sign
66 Concurrent Credits.2 0.000000 POS
67 No of dependents.1 0.000000 POS
68 No of dependents.2 0.000000 POS
69 Telephone.1 0.000000 POS
70 Telephone.2 0.000000 POS
71 credit_amount_trnsf 0.000000 POS
Standardized Coefficient Magnitudes: standardized coefficient magnitudes
names coefficients sign
1 Account Balance.4 0.834273 POS
2 Account Balance.1 0.463571 NEG
3 Duration of Credit (month) 0.339717 NEG
4 Payment Status of Previous Credit.4 0.272301 POS
5 Instalment per cent.4 0.210977 NEG
---
names coefficients sign
66 Concurrent Credits.2 0.000000 POS
67 No of dependents.1 0.000000 POS
68 No of dependents.2 0.000000 POS
69 Telephone.1 0.000000 POS
70 Telephone.2 0.000000 POS
71 credit_amount_trnsf 0.000000 POS
In [18]:
perf_obj <- h2o.performance(glm_model1, newdata = data.test)
In [19]:
print(perf_obj)
H2OBinomialMetrics: glm
MSE: 0.1831267
RMSE: 0.4279331
LogLoss: 0.5429005
Mean Per-Class Error: 0.46426
AUC: 0.7169574
Gini: 0.4339148
R^2: 0.1061169
Null Deviance: 279.8683
Residual Deviance: 252.9917
AIC: 290.9917
Confusion Matrix for F1-optimal threshold:
0 1 Error Rate
0 6 61 0.910448 =61/67
1 3 163 0.018072 =3/166
Totals 9 224 0.274678 =64/233
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.360782 0.835897 223
2 max f2 0.211924 0.925307 232
3 max f0point5 0.702929 0.810398 121
4 max accuracy 0.472808 0.725322 203
5 max precision 0.939168 1.000000 0
6 max recall 0.211924 1.000000 232
7 max specificity 0.939168 1.000000 0
8 max absolute_mcc 0.702929 0.362274 121
9 max min_per_class_accuracy 0.668232 0.668675 132
10 max mean_per_class_accuracy 0.702929 0.699874 121
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
In [23]:
h2o.accuracy(perf_obj,0.939168338305063)
- 0.291845493562232
In [21]:
pred_creditability <- h2o.predict(glm_model1,data.test)
pred_creditability
|======================================================================| 100%
predict p0 p1
1 1 0.2970714 0.7029286
2 0 0.5300577 0.4699423
3 1 0.2398353 0.7601647
4 1 0.3139088 0.6860912
5 0 0.4860326 0.5139674
6 1 0.1663972 0.8336028
[233 rows x 3 columns]
In [ ]:
Content source: micio1970/H2oaiMaster
Similar notebooks: