Machine Learning with H2O - Tutorial 2: Basic Data Manipulation


Objective:

  • This tutorial demonstrates basic data manipulation with H2O.

Titanic Dataset:


Full Technical Reference:



In [1]:
# Start and connect to a local H2O cluster
suppressPackageStartupMessages(library(h2o))
h2o.init(nthreads = -1)


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/Rtmpmut9K7/h2o_joe_started_from_r.out
    /tmp/Rtmpmut9K7/h2o_joe_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 992 milliseconds 
    H2O cluster version:        3.10.3.5 
    H2O cluster version age:    9 days  
    H2O cluster name:           H2O_started_from_R_joe_osm405 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   5.21 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.3.2 (2016-10-31) 


In [2]:
# Import Titanic data (local CSV)
titanic = h2o.importFile("kaggle_titanic.csv")


  |======================================================================| 100%

In [3]:
# Explore the dataset using various functions
head(titanic, 10)


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 NaN 7.2500 NA S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)female 38 1 0 NaN 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 NaN 7.9250 NA S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NA S
6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NA Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NA S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NA S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NA C


Explain why we need to transform



In [4]:
# Explore the column 'Survived'
h2o.describe(titanic[, 'Survived'])


LabelTypeMissingZerosPosInfNegInfMinMaxMeanSigmaCardinality
Survived int 0 549 0 0 0 1 0.3838383838383840.486592454264857NA

In [5]:
# Use hist() to create a histogram
h2o.hist(titanic[, 'Survived'])



In [6]:
# Use table() to summarize 0s and 1s
h2o.table(titanic[, 'Survived'])


  Survived Count
1        0   549
2        1   342

[2 rows x 2 columns] 

In [7]:
# Convert 'Survived' to categorical variable
titanic[, 'Survived'] = as.factor(titanic[, 'Survived'])

In [8]:
# Look at the summary of 'Survived' again
# The feature is now an 'enum' (enum is the name of categorical variable in Java)
h2o.describe(titanic[, 'Survived'])


LabelTypeMissingZerosPosInfNegInfMinMaxMeanSigmaCardinality
Survived enum 0 549 0 0 0 1 0.3838383838383840.4865924542648572


Doing the same for 'Pclass'



In [9]:
# Explore the column 'Pclass'
h2o.describe(titanic[,'Pclass'])


LabelTypeMissingZerosPosInfNegInfMinMaxMeanSigmaCardinality
Pclass int 0 0 0 0 1 3 2.30864197530864 0.836071240977049NA

In [10]:
# Use hist() to create a histogram
h2o.hist(titanic[, 'Pclass'])



In [11]:
# Use table() to summarize 1s, 2s and 3s
h2o.table(titanic[, 'Pclass'])


  Pclass Count
1      1   216
2      2   184
3      3   491

[3 rows x 2 columns] 

In [12]:
# Convert 'Pclass' to categorical variable
titanic[, 'Pclass'] = as.factor(titanic[, 'Pclass'])

In [13]:
# Look at the summary of 'Pclass' again
# The feature is now an 'enum' (enum is the name of categorical variable in Java)
h2o.describe(titanic[, 'Pclass'])


LabelTypeMissingZerosPosInfNegInfMinMaxMeanSigmaCardinality
Pclassenum 0 216 0 0 0 2 NA NA 3