In our first notebook we received an introduction to R4ML and conducted some exploratory data analysis with it. In this section, we will go over the typical operations a data scientist performs while cleaning and pre-processing the input data. Now, we will look into dimensionality reduction, first starting out with the boiler plate code from Part I for the purposes of loading etc. The following code is copy pasted from part I:
In [1]:
library(R4ML)
library(SparkR)
r4ml.session()
# read the airline dataset
airt <- airline
# testing, we just use the small dataset
airt <- airt[airt$Year >= "2007",]
airt <- airline
# testing, we just use the small dataset
airt <- airt[airt$Year >= "2007",]
air_hf <- as.r4ml.frame(airt)
airs <- r4ml.sample(air_hf, 0.1)[[1]]
#
total_feat <- c(
"Year", "Month", "DayofMonth", "DayOfWeek", "DepTime","CRSDepTime","ArrTime",
"CRSArrTime", "UniqueCarrier", "FlightNum", "TailNum", "ActualElapsedTime",
"CRSElapsedTime", "AirTime", "ArrDelay", "DepDelay", "Origin", "Dest",
"Distance", "TaxiIn", "TaxiOut", "Cancelled", "CancellationCode",
"Diverted", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay",
"LateAircraftDelay")
#categorical features
#"Year", "Month", "DayofMonth", "DayOfWeek",
cat_feat <- c("UniqueCarrier", "FlightNum", "TailNum", "Origin", "Dest",
"CancellationCode", "Diverted")
numeric_feat <- setdiff(total_feat, cat_feat)
# these features have no predictive power as it is uniformly distributed i.e less information
unif_feat <- c("Year", "Month", "DayofMonth", "DayOfWeek")
# these are the constant features and we can ignore without much difference in output
const_feat <- c("WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay")
col2rm <- c(unif_feat, const_feat, cat_feat)
rairs <- SparkR::as.data.frame(airs)
airs_names <- names(rairs)
rairs2_names <- setdiff(airs_names, col2rm)
rairs2 <- rairs[, rairs2_names]
# first we will create the imputation maps
# we impute all the columns
airs_ia <- total_feat
# imputation methods
airs_im <- lapply(total_feat,
function(feat) {
v= if (feat %in% numeric_feat) {"global_mean"} else {"constant"}
v
})
# convert to vector
airs_im <- sapply(airs_im, function(e) {e})
#imputation values
airs_iv <- setNames(as.list(rep("CAT_NA", length(cat_feat))), cat_feat)
na.cols<- setdiff(total_feat, airs_ia)
# we cache the output so that not to re-exe DAG
dummy <- cache(airs)
The function is called r4ml.ml.preprocess and it supports all major pre-processing tasks, feature transformation routines e.g R4ML supports all the major pre-processing task:
Method | R4ML options | Description |
---|---|---|
NA Removal | imputationMethod, imputationValues, omit.na | this options let one remove the missing data or user can substitute with the constant or with the mean (if it is numeric) |
Binning | binningAttrs, numBins | One of the typical use case of binning is say we have the height of the people in feet and inches and if we only care about three top level category i.e short , medium and tall, user can use this option |
Scaling and Centering | scalingAttrs | Most of the algo’s prediction power or the result become better if it is normalized i.e substract the mean and then divide it by stddev. |
Encoding (Recode) | recodeAttrs | Since most of the machine learning algorithm is the matrix based linear algebra. Many times the categorical columns have inherit order in it like height or shirt_size, it options provide user the ability to encode those string columns into the ordinal categorical numeric value |
Encoding (OneHot or DummyCoding) | dummyCodeAttrs | when categorical columns have no inherit orders like people’s race or states they lives in, in that case, we would like to do one hot encoding for those. |
In [2]:
airs_t_res <- r4ml.ml.preprocess(
airs,
dummycodeAttrs = c(),
binningAttrs = c(), numBins=4,
missingAttrs = airs_ia,
imputationMethod = airs_im,
imputationValues = airs_iv,
omit.na=na.cols, # we remove all the na, this is just the dummy
recodeAttrs=cat_feat # recode all the categorical features
)
# the transformed data frame
airs_t <- airs_t_res$data
# the relevant metadata list
airs_t_mdb <- airs_t_res$metadata
# cache the transformed data
dummy <- cache(airs_t)
showDF(airs_t, n = 2)
All the values are numeric now and we are ready for the next steps of the pipeline. However, we want to end this topic with a note to the user that in using the custom DML (explained later), the advanced user can write the advanced data pre-processing step independently.
Other data pre-processing steps, that are typically done:
Input data transformation (Box Cox) We saw in the previous section that the input data when transformed by log gave us a better feature. That was the special case of Box-Cox transformation. Alternatively, statistical methods can be used to empirically identify an appropriate transformation. Box and Cox proposes a family of transformations that are indexed by a parameter lambda, (this feature will be available in future.) Practically, one can calculate the kurtosis and skewness for various lambda or one can do the significance testing to check which lambda gives the better result.
Outlier Detection
Method | Description |
---|---|
Box Plot (univariate Stats) | This way one can visually detect outliers for each of dimensions. But again it is not the best as it is not explaning all the details. |
mahalanobis distance (for multivariate stats) | This is like calculating the number of units of variance in each dimensions after rotating the axis in the direction of maximum variance. |
use classifier like logistic regression | use the mahalanobis distance, to calculate the threshold i.e 3 sigma distanace and then anything more than level it as Y = 1 and then run logistic regression to find out what other variables are outliers. |
Spacial sign transformation | If a model is considered to be sensitive to outliers, one data transformation that can minimize the problem is the spatial sign. This procedure projects the predictor values onto a multidimensional sphere. This has the effect of making all the samples the same distance from the center of the sphere. Mathematically, each sample is divided by its squared norm. |
In this subsection, we saw how R4ML, helps simplify the 60% of the time spent by data scientists on data preparation by providing a unified and expandable API for pre-processing. We also saw a small example of R4ML in action for this and we gave a few advanced technique for power users.
Dimensionality reduction is choosing a basis or mathematical representation within which you can describe most but not all of the variance within your data, thereby retaining the relevant information, while reducing the amount of information necessary to represent it. There are a variety of techniques for doing this including but not limited to PCA, ICA, and Matrix Feature Factorization. These will take existing data and reduce it to the most discriminative components. All of these allow you to represent most of the information in your dataset with fewer, more discriminative features.
In terms of performance, having data of high dimensionality is problematic because:
Dimensionality reduction addresses both of these problems, while (hopefully) preserving most of the relevant information in the data needed to learn accurate, predictive models.
Also note that, in general visualization of lower dimension data and it's interpretation are more straightforward and it coule be used for geting insights into the data.
Principal component analysis (PCA) rotates the original data space such that the axes of the new coordinate system point into the directions of highest variance of the data. The axes or new variables are termed principal components (PCs) and are ordered by variance: the first component, PC 1, represents the direction of the highest variance of the data. The direction of the second component, PC 2, represents the highest of the remaining variance orthogonal to the first component. This can be naturally extended to obtain the required number of components which together span a component space covering the desired amount of variance. Since components describe specific directions in the data space, each component depends on certain fraction of each of the original variables: each component is a linear combination of all original variables.
Low variance can often be assumed to represent undesired background noise. The dimensionality of the data can therefore be reduced, without loss of relevant information, by extracting a lower dimensional component space covering the highest variance. Using a lower number of principal components instead of the high-dimensional original data is a common pre-processing step that often improves results of subsequent analyses such as classification. For visualization, the first and second component can be plotted against each other to obtain a two-dimensional representation of the data that captures most of the variance (assumed to be most of the relevant information), useful to analyze and interpret the structure of a data set.
In [3]:
# since from the exploratory data analysis, we knew that certain
# variables can be remove for the analysis, we will perform the pca
# on the smaller set of features
airs_t3_tmp <- select(airs_t, names(rairs2)) # recall rairs2 from before
airs_t3 <- as.r4ml.matrix(airs_t3_tmp)
# do the PCA analysis with 12 components
airs_t3.pca <- r4ml.pca(airs_t3, center=T, scale=T, projData=F, k=12)
# the eigen values for each of the components which is the square of
# stddev and is equivalent to the variance
airs_t3.pca@eigen.values
In [4]:
# the corresponding eigenvectors are
airs_t3.pca@eigen.vectors
In [5]:
sorted_evals <- sort(airs_t3.pca@eigen.values$EigenValues1, decreasing = T)
csum_sorted_evals <- cumsum(sorted_evals)
# find the cut-off i.e the number of principal component which capture
# 90% of variance
evals_ratio <- csum_sorted_evals/max(csum_sorted_evals)
evals_ratio
In [7]:
pca.pc.count <- which(evals_ratio > 0.9)[1]
pca.pc.count
Analytically we can see that we need the first six principal components. Let's also verify it intuitively or graphically. Here we will plot the PCA variance and see the overall area spanned.
In [8]:
library(ggplot2)
pca.plot.df <- data.frame(
index=1:length(sorted_evals),
PrincipalComponent=sprintf("PC-%02d", 1:length(sorted_evals)),
Variances = sorted_evals)
pca.g1 <- ggplot(data=pca.plot.df,
aes(x=PrincipalComponent, y=Variances, group=1, colour=Variances))
pca.g1 <- pca.g1 + geom_point() + geom_line()
# highlight the area containing the 90% variances
#subset region and plot
pca.g_data <- ggplot_build(pca.g1)$data[[1]]
#plot the next shaded graph
pca.g2 <- pca.g1 + geom_area(data=subset(pca.g_data, x<=pca.pc.count),
aes(x=x, y=y), fill="red4", inherit.aes=F)
pca.g2 <- pca.g2 + geom_point() + geom_line()
pca.g2
In this subsection, we went over why dimensionality reduction is important and how R4ML helps in this task by providing a scalable PCA implementation. We also went through an example to understand it more.
In general, using the custom algorithms, one can implement other dimensional reduction techniques like t-sne and kernel pca also.