Data Preprocessing

Library and dataset loading


In [16]:
source("https://raw.githubusercontent.com/eogasawara/mylibrary/master/myPreprocessing.R")
loadlibrary("RColorBrewer")
loadlibrary("dplyr")
loadlibrary("gridExtra")
loadlibrary("reshape")

col.set <- brewer.pal(11, 'Spectral')
mycolors <- col.set[c(1,3,5,7,9)]

plot_size(4, 3)

Sampling

Comparing random sampling with stratified sampling.


In [17]:
sampler <- sample.random(iris)
head(sampler$sample)

samples <- sample.stratified(iris, "Species")

tbl <- rbind(table(iris$Species), table(sampler$sample$Species), table(samples$sample$Species))
rownames(tbl) <- c("dataset", "random sample", "stratified sample")
head(tbl)

tbl <- tbl[1,]-tbl

tbl <- tbl[2:3,]

head(tbl)


Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
517.0 3.2 4.7 1.4 versicolor
886.3 2.3 4.4 1.3 versicolor
1396.0 3.0 4.8 1.8 virginica
825.5 2.4 3.7 1.0 versicolor
866.0 3.4 4.5 1.6 versicolor
696.2 2.2 4.5 1.5 versicolor
setosaversicolorvirginica
dataset505050
random sample384339
stratified sample404040
setosaversicolorvirginica
random sample12 711
stratified sample101010

Sampling data into folds

Sampling k-folds with random and stratified techniques.


In [18]:
foldsr <- sample.random_kfold(iris, k=4)
foldss <- sample.stratified_kfold(iris, "Species", k=4)

tbls <- tblr <- NULL
for (i in (1:4)) {
    tblr <- rbind(tblr, table(foldsr[[i]]$Species))
}
rownames(tblr) <- rep("random sampling", 4)
head(tblr)

for (i in (1:4)) {
    tbls <- rbind(tbls, table(foldss[[i]]$Species))
}
rownames(tbls) <- rep("stratified sampling", 4)
head(tbls)


setosaversicolorvirginica
random sampling121213
random sampling15 913
random sampling121115
random sampling1118 9
setosaversicolorvirginica
stratified sampling131313
stratified sampling131313
stratified sampling121212
stratified sampling121212

Outlier analysis

Using box-plot an outlier is a value that is below than $Q_1 - 1.5 \cdot IQR$ or higher than $Q_3 + 1.5 \cdot IQR$


In [19]:
out <- outliers.boxplot(iris)
myiris <- iris[!out,]
head(iris[out,])


Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
165.7 4.4 1.5 0.4 setosa
335.2 4.1 1.5 0.1 setosa
345.5 4.2 1.4 0.2 setosa
615.0 2.0 3.5 1.0 versicolor

Normalization

Normalization is a technique used to equal strength among variables.

It is also important to apply it as an input for some machine learning methods.

Min-Max: Adjust to 0 (minimum value) - 1 (maximum value).

Z-Score: Adjust to 0 (mean), 1 (variance).


In [20]:
myirisM <- normalize.minmax(iris)

myirisZ <- normalize.zscore(iris)

myirisZS <- normalize.zscore(iris, nmean=0.5, nsd=0.5/2.698)


grfA <- plot.density(iris %>% select(variable="Sepal.Width", value=Sepal.Width), color=mycolors[1]) 
grfB <- plot.density(myirisM$data %>% select(variable="Sepal.Width", value=Sepal.Width), color=mycolors[1]) + xlim(0,1)
grfC <- plot.density(myirisZ$data %>% select(variable="Sepal.Width", value=Sepal.Width), color=mycolors[1]) 
grfD <- plot.density(myirisZS$data %>% select(variable="Sepal.Width", value=Sepal.Width), color=mycolors[1]) + xlim(0,1)

plot_size(8, 3)
grid.arrange(grfA, grfB, grfC, grfD, ncol=4)
plot_size(4, 3)


Warning message:
"Removed 1 rows containing non-finite values (stat_density)."

PCA

PCA is a technique that finds a projection that captures the largest amount of variation in data.


In [21]:
head(iris[,1:4])

mypca <- dt.pca(iris, "Species")
head(mypca$pca)

head(mypca$transf$pca.transf)

plot.scatter(mypca$pca %>% select(x=PC1, value=PC2, variable=Species), colors=mycolors[1:3])


Sepal.LengthSepal.WidthPetal.LengthPetal.Width
5.13.51.40.2
4.93.01.40.2
4.73.21.30.2
4.63.11.50.2
5.03.61.40.2
5.43.91.70.4
PC1PC2Species
2.640270 -5.204041setosa
2.670730 -4.666910setosa
2.454606 -4.773636setosa
2.545517 -4.648463setosa
2.561228 -5.258629setosa
2.975946 -5.707321setosa
PC1PC2
Sepal.Length 0.5210659 -0.37741762
Sepal.Width-0.2693474 -0.92329566
Petal.Length 0.5804131 -0.02449161
Petal.Width 0.5648565 -0.06694199

Discretization & smoothing

Discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts.

Smoothing is a technique that creates an approximating function that attempts to capture important patterns in the data while leaving out noise or other fine-scale structures/rapid phenomena.

An important part of the discretization/smoothing is to set up bins for proceeding the approximation.


In [22]:
bi <- smoothing.interval(iris$Sepal.Length, n=2)
bf <- smoothing.freq(iris$Sepal.Length, n=2)
bc <- smoothing.cluster(iris$Sepal.Length, n=2)

show_row(c('interval: ', sprintf("%.1f",bi$interval), 'entropy: ', sprintf("%.2f",entropy_group(bi$bins_factor, iris$Species))))
show_row(c('freq: ', sprintf("%.1f",bf$interval), 'entropy: ', sprintf("%.2f",entropy_group(bf$bins_factor, iris$Species))))
show_row(c('cluster: ', sprintf("%.1f",bc$interval), 'entropy: ', sprintf("%.2f",entropy_group(bc$bins_factor, iris$Species))))


interval: 4.3 6.1 7.9 entropy: 1.19
freq: 4.3 5.8 7.9 entropy: 1.10
cluster: 4.3 5.9 7.9 entropy: 1.10

Optimizing the smoothing using bins based on frequencies


In [23]:
bsl <- smoothing.opt(iris$Sepal.Length, smoothing=smoothing.freq)
bsw <- smoothing.opt(iris$Sepal.Width, smoothing=smoothing.freq)
bpl <- smoothing.opt(iris$Petal.Length, smoothing=smoothing.freq)
bpw <- smoothing.opt(iris$Petal.Width, smoothing=smoothing.freq)


show_row(c('Sepal.Length: ', sprintf("%.1f",bsl$interval), 'entropy: ', sprintf("%.2f",entropy_group(bsl$bins_factor, iris$Species))))
show_row(c('Sepal.Width: ', sprintf("%.1f",bsw$interval), 'entropy: ', sprintf("%.2f",entropy_group(bsw$bins_factor, iris$Species))))
show_row(c('Petal.Length: ', sprintf("%.1f",bpl$interval), 'entropy: ', sprintf("%.2f",entropy_group(bpl$bins_factor, iris$Species))))
show_row(c('Petal.Width: ', sprintf("%.1f",bpw$interval), 'entropy: ', sprintf("%.2f",entropy_group(bpw$bins_factor, iris$Species))))


Sepal.Length: 4.3 5.0 5.4 5.8 6.3 6.7 7.9 entropy: 0.87
Sepal.Width: 2.0 2.7 2.9 3.0 3.2 3.4 4.4 entropy: 1.19
Petal.Length: 1.0 1.5 3.9 4.6 5.3 6.9 entropy: 0.39
Petal.Width: 0.1 0.2 1.2 1.5 1.9 2.5 entropy: 0.38

Optimizing the smoothing using bins based on clusters


In [24]:
bsl <- smoothing.opt(iris$Sepal.Length, smoothing=smoothing.cluster)
bsw <- smoothing.opt(iris$Sepal.Width, smoothing=smoothing.cluster)
bpl <- smoothing.opt(iris$Petal.Length, smoothing=smoothing.cluster)
bpw <- smoothing.opt(iris$Petal.Width, smoothing=smoothing.cluster)


show_row(c('Sepal.Length: ', sprintf("%.1f",bsl$interval), 'entropy: ', sprintf("%.2f",entropy_group(bsl$bins_factor, iris$Species))))
show_row(c('Sepal.Width: ', sprintf("%.1f",bsw$interval), 'entropy: ', sprintf("%.2f",entropy_group(bsw$bins_factor, iris$Species))))
show_row(c('Petal.Length: ', sprintf("%.1f",bpl$interval), 'entropy: ', sprintf("%.2f",entropy_group(bpl$bins_factor, iris$Species))))
show_row(c('Petal.Width: ', sprintf("%.1f",bpw$interval), 'entropy: ', sprintf("%.2f",entropy_group(bpw$bins_factor, iris$Species))))


Sepal.Length: 4.3 4.8 5.3 5.8 6.3 7.0 7.9 entropy: 0.85
Sepal.Width: 2.0 2.8 3.2 3.6 3.8 4.1 4.4 entropy: 1.22
Petal.Length: 1.0 2.4 3.8 4.5 5.4 6.9 entropy: 0.27
Petal.Width: 0.1 0.2 0.8 1.6 2.0 2.5 entropy: 0.24

Balancing datasets

The first line artificially unbalances the dataset.

Oversampling and subsampling are used to correct the unbalanced dataset.


In [25]:
#forcing an unballanced dataset
myiris <- iris[c(1:20,51:100, 110:120),]
myiris.bo <- balance.oversampling(myiris, "Species")
myiris.bs <- balance.subsampling(myiris, "Species")
tbl <- rbind(table(myiris$Species), table(myiris.bo$Species), table(myiris.bs$Species))
rownames(tbl) <- c('unbalanced', 'oversampling', 'subsampling')
head(tbl)


setosaversicolorvirginica
unbalanced205011
oversampling505050
subsampling111111

Categorical mapping

A categorical attribute with $n$ distinct values is mapped into $n$ binary attributes.

It is also possible to map into $n-1$ binary values, where the scenario where all binary attributes are equal to zero corresponds to the last categorical value not indicated in the attributes.


In [26]:
mycm <- dt.categ_mapping(sampler$sample, "Species")
head(mycm)


Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesSpeciessetosaSpeciesversicolorSpeciesvirginica
517.0 3.2 4.7 1.4 versicolor0 1 0
886.3 2.3 4.4 1.3 versicolor0 1 0
1396.0 3.0 4.8 1.8 virginica 0 0 1
825.5 2.4 3.7 1.0 versicolor0 1 0
866.0 3.4 4.5 1.6 versicolor0 1 0
696.2 2.2 4.5 1.5 versicolor0 1 0

In [ ]: