Load the libraries
In [1]:
library(ggplot2)
library(dplyr)
library(FNN)
A function which takes in the data and sampling rate and returns a data frame containg different sampling rate , mis-calssification rate and k value which are obtained by running the KNN algorithm
In [2]:
modifiedKNN = function(creditData,samplingRate){
observations = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())
rowCount = nrow(creditData)
trainingDataRowsCount = sample(1:rowCount, samplingRate * rowCount,replace=FALSE)
trainingData = subset(creditData[trainingDataRowsCount, ], select = c(Age,Job,Credit.amount,Duration))
trainingLabels = creditData$Credit.Risks[trainingDataRowsCount]
testingDataRowsCount = setdiff(1:rowCount, trainingDataRowsCount)
testingData = subset(creditData[testingDataRowsCount, ], select = c(Age,Job,Credit.amount,Duration))
testingLabels = creditData$Credit.Risks[testingDataRowsCount]
for(k in 3:20){
predictedLabels = knn(trainingData, testingData, trainingLabels, k)
incorrectLabels = sum(predictedLabels != testingLabels)
misClassificationRate = (incorrectLabels/length(testingDataRowsCount))*100
tempResult = data.frame(kVal = k, samplingRateVal = samplingRate, misClassificationRateVal = misClassificationRate)
observations = rbind(observations,tempResult)
}
return (observations)
}
Load the German Credit Data
In [3]:
creditData = read.csv(file="dataSets/german_credit_data1.csv", header=TRUE, sep=",")
Show the data
In [4]:
head(creditData)
Show the summary of the data
In [5]:
summary(creditData)
Select only the relevant data field from the given data
In [6]:
creditData = creditData %>% select(Age,Job,Credit.amount,Duration,Credit.Risks)
Remove all the data fields which are NA in the selected data
In [7]:
creditData = na.omit(creditData)
Show the data
In [8]:
head(creditData)
Create a summary of the data again and check if there is any change or not
In [9]:
summary(creditData)
Set a seed value
In [10]:
set.seed(1234)
This is the main function which will tell us the various predicted values generated by applying the KNN algorithm depending upon k value and samoling rate
In [11]:
observations = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())
for(samplingRate in seq(0.5, 0.9, by = 0.1)){
tempResult = modifiedKNN(creditData,samplingRate)
observations = rbind(observations,tempResult)
}
Print the observations we get from different values of sampling rates and k values
In [12]:
observations
Draw a plot showing the different values
In [13]:
plot(observations$kVal,observations$misClassificationRateVal, pch=19, xlim=c(0,20), ylim=c(0,60),xlab="K Value", ylab="Mis-classification Rate",main="Plot between K Value and Mis-classification Rate")
Arrange the values otained above in the asending order of mis-classification values
In [14]:
minObservations = observations %>%
arrange(misClassificationRateVal)
Show the values
In [15]:
minObservations
Here we will do a normalization based on the min max normalization. As we can see from the above data that the data values in Credit amount and Job varies a lot . So maybe by normalization we can obtain a better fit to the model
Define a Min Max normalization function
In [16]:
minMaxNormalization <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
Normalize the data except the Credit Risk data since those are our labels
In [17]:
normCreditData = as.data.frame(lapply(creditData[,1:4], minMaxNormalization))
normCreditData$Credit.Risks = creditData$Credit.Risks
Show the data after normalization
In [18]:
head(normCreditData)
Set the seed value
In [19]:
set.seed(2137)
This is the main function which will tell us the various predicted values generated by applying the KNN algorithm depending upon k value and sampling rate
In [20]:
observationsNorm = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())
for(samplingRate in seq(0.5, 0.9, by = 0.1)){
tempResultNorm = modifiedKNN(normCreditData,samplingRate)
observationsNorm = rbind(observationsNorm,tempResultNorm)
}
Show the data obtained from the above run
In [21]:
observationsNorm
Plot the data for various k values and mis- classification rate
In [22]:
plot(observationsNorm$kVal,observationsNorm$misClassificationRateVal, pch=19, xlim=c(0,20), ylim=c(0,60),xlab="K Value", ylab="Mis-classification Rate",main="Plot between K Value and Mis-classification Rate after Normalization(Min-Max)")
Sort the data in ascending order of mis-classification rate
In [23]:
minObservationsNorm = observationsNorm %>%
arrange(misClassificationRateVal)
Show the data
In [24]:
minObservationsNorm
In [ ]: