DIC LAB 3 Problem 2 : KNN algorithm on German Credit Data

Load the libraries


In [1]:
library(ggplot2)
library(dplyr)
library(FNN)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

A function which takes in the data and sampling rate and returns a data frame containg different sampling rate , mis-calssification rate and k value which are obtained by running the KNN algorithm


In [2]:
modifiedKNN = function(creditData,samplingRate){
        
    observations = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())
    
    rowCount = nrow(creditData)
    
    trainingDataRowsCount = sample(1:rowCount, samplingRate * rowCount,replace=FALSE)
    trainingData = subset(creditData[trainingDataRowsCount, ], select = c(Age,Job,Credit.amount,Duration))
    trainingLabels = creditData$Credit.Risks[trainingDataRowsCount]
    
    testingDataRowsCount = setdiff(1:rowCount, trainingDataRowsCount)
    testingData = subset(creditData[testingDataRowsCount, ], select = c(Age,Job,Credit.amount,Duration))
    testingLabels = creditData$Credit.Risks[testingDataRowsCount]
    
    for(k in 3:20){
        predictedLabels = knn(trainingData, testingData, trainingLabels, k) 
        incorrectLabels = sum(predictedLabels != testingLabels)
        misClassificationRate = (incorrectLabels/length(testingDataRowsCount))*100
        
        tempResult = data.frame(kVal = k, samplingRateVal = samplingRate, misClassificationRateVal = misClassificationRate)
        observations = rbind(observations,tempResult)
    }
    return (observations)
}

Load the German Credit Data


In [3]:
creditData = read.csv(file="dataSets/german_credit_data1.csv", header=TRUE, sep=",")

Show the data


In [4]:
head(creditData)


X.AgeSexJobHousingSaving.accountsChecking.accountCredit.amountDurationPurposeCredit.Risks
0 67 male 2 own NA little 1169 6 radio/TV 1
1 22 female 2 own little moderate 5951 48 radio/TV 2
2 49 male 1 own little NA 2096 12 education 1
3 45 male 2 free little little 7882 42 furniture/equipment1
4 53 male 2 free little little 4870 24 car 2
5 35 male 1 free NA NA 9055 36 education 1

Show the summary of the data


In [5]:
summary(creditData)


       X.             Age            Sex           Job        Housing   
 Min.   :  0.0   Min.   :19.00   female:310   Min.   :0.000   free:108  
 1st Qu.:249.8   1st Qu.:27.00   male  :690   1st Qu.:2.000   own :713  
 Median :499.5   Median :33.00                Median :2.000   rent:179  
 Mean   :499.5   Mean   :35.55                Mean   :1.904             
 3rd Qu.:749.2   3rd Qu.:42.00                3rd Qu.:2.000             
 Max.   :999.0   Max.   :75.00                Max.   :3.000             
                                                                        
   Saving.accounts Checking.account Credit.amount      Duration   
 little    :603    little  :274     Min.   :  250   Min.   : 4.0  
 moderate  :103    moderate:269     1st Qu.: 1366   1st Qu.:12.0  
 quite rich: 63    rich    : 63     Median : 2320   Median :18.0  
 rich      : 48    NA's    :394     Mean   : 3271   Mean   :20.9  
 NA's      :183                     3rd Qu.: 3972   3rd Qu.:24.0  
                                    Max.   :18424   Max.   :72.0  
                                                                  
                Purpose     Credit.Risks
 car                :337   Min.   :1.0  
 radio/TV           :280   1st Qu.:1.0  
 furniture/equipment:181   Median :1.0  
 business           : 97   Mean   :1.3  
 education          : 59   3rd Qu.:2.0  
 repairs            : 22   Max.   :2.0  
 (Other)            : 24                

Select only the relevant data field from the given data


In [6]:
creditData = creditData %>% select(Age,Job,Credit.amount,Duration,Credit.Risks)

Remove all the data fields which are NA in the selected data


In [7]:
creditData = na.omit(creditData)

Show the data


In [8]:
head(creditData)


AgeJobCredit.amountDurationCredit.Risks
67 2 1169 6 1
22 2 595148 2
49 1 209612 1
45 2 788242 1
53 2 487024 2
35 1 905536 1

Create a summary of the data again and check if there is any change or not


In [9]:
summary(creditData)


      Age             Job        Credit.amount      Duration     Credit.Risks
 Min.   :19.00   Min.   :0.000   Min.   :  250   Min.   : 4.0   Min.   :1.0  
 1st Qu.:27.00   1st Qu.:2.000   1st Qu.: 1366   1st Qu.:12.0   1st Qu.:1.0  
 Median :33.00   Median :2.000   Median : 2320   Median :18.0   Median :1.0  
 Mean   :35.55   Mean   :1.904   Mean   : 3271   Mean   :20.9   Mean   :1.3  
 3rd Qu.:42.00   3rd Qu.:2.000   3rd Qu.: 3972   3rd Qu.:24.0   3rd Qu.:2.0  
 Max.   :75.00   Max.   :3.000   Max.   :18424   Max.   :72.0   Max.   :2.0  

Set a seed value


In [10]:
set.seed(1234)

This is the main function which will tell us the various predicted values generated by applying the KNN algorithm depending upon k value and samoling rate


In [11]:
observations = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())

for(samplingRate in seq(0.5, 0.9, by = 0.1)){
    tempResult = modifiedKNN(creditData,samplingRate)
    observations = rbind(observations,tempResult)
}

Print the observations we get from different values of sampling rates and k values


In [12]:
observations


kValsamplingRateValmisClassificationRateVal
3 0.5 38.20
4 0.5 34.80
5 0.5 39.00
6 0.5 33.40
7 0.5 35.00
8 0.5 30.60
9 0.5 33.40
10 0.5 31.40
11 0.5 32.40
12 0.5 30.60
13 0.5 30.80
14 0.5 30.20
15 0.5 30.60
16 0.5 29.80
17 0.5 30.40
18 0.5 30.00
19 0.5 31.20
20 0.5 30.40
3 0.6 34.50
4 0.6 30.25
5 0.6 30.75
6 0.6 30.75
7 0.6 30.75
8 0.6 29.75
9 0.6 31.00
10 0.6 30.00
11 0.6 31.25
12 0.6 30.00
13 0.6 30.00
14 0.6 30.25
9 0.8 33.0
10 0.8 30.5
11 0.8 35.5
12 0.8 32.5
13 0.8 35.5
14 0.8 27.5
15 0.8 30.0
16 0.8 27.5
17 0.8 30.5
18 0.8 30.0
19 0.8 30.5
20 0.8 30.5
3 0.9 44.0
4 0.9 34.0
5 0.9 35.0
6 0.9 33.0
7 0.9 33.0
8 0.9 31.0
9 0.9 30.0
10 0.9 29.0
11 0.9 30.0
12 0.9 29.0
13 0.9 28.0
14 0.9 29.0
15 0.9 30.0
16 0.9 31.0
17 0.9 33.0
18 0.9 31.0
19 0.9 33.0
20 0.9 32.0

Draw a plot showing the different values


In [13]:
plot(observations$kVal,observations$misClassificationRateVal, pch=19, xlim=c(0,20), ylim=c(0,60),xlab="K Value", ylab="Mis-classification Rate",main="Plot between K Value and Mis-classification Rate")


Arrange the values otained above in the asending order of mis-classification values


In [14]:
minObservations = observations %>%
    arrange(misClassificationRateVal)

Show the values


In [15]:
minObservations


kValsamplingRateValmisClassificationRateVal
14 0.8 27.50000
16 0.8 27.50000
13 0.9 28.00000
20 0.6 28.75000
10 0.9 29.00000
12 0.9 29.00000
14 0.9 29.00000
20 0.7 29.33333
6 0.8 29.50000
12 0.7 29.66667
8 0.6 29.75000
16 0.5 29.80000
18 0.5 30.00000
10 0.6 30.00000
12 0.6 30.00000
13 0.6 30.00000
18 0.6 30.00000
13 0.7 30.00000
14 0.7 30.00000
15 0.7 30.00000
17 0.7 30.00000
18 0.7 30.00000
15 0.8 30.00000
18 0.8 30.00000
9 0.9 30.00000
11 0.9 30.00000
15 0.9 30.00000
14 0.5 30.20000
4 0.6 30.25000
14 0.6 30.25000
8 0.7 31.66667
15 0.6 32.00000
9 0.7 32.00000
20 0.9 32.00000
11 0.5 32.40000
5 0.8 32.50000
12 0.8 32.50000
7 0.7 32.66667
4 0.7 33.00000
9 0.8 33.00000
6 0.9 33.00000
7 0.9 33.00000
17 0.9 33.00000
19 0.9 33.00000
6 0.5 33.40000
9 0.5 33.40000
5 0.7 34.00000
4 0.9 34.00000
3 0.6 34.50000
3 0.7 34.66667
4 0.5 34.80000
7 0.5 35.00000
5 0.9 35.00000
7 0.8 35.50000
11 0.8 35.50000
13 0.8 35.50000
3 0.8 36.00000
3 0.5 38.20000
5 0.5 39.00000
3 0.9 44.00000

Conclusion

As we can see from the above data that that the lowest mis-classification rate is obtained when the sampling rate is 0.8 and k value is 14 and 16

Normalization

Here we will do a normalization based on the min max normalization. As we can see from the above data that the data values in Credit amount and Job varies a lot . So maybe by normalization we can obtain a better fit to the model

Define a Min Max normalization function


In [16]:
minMaxNormalization <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
  }

Normalize the data except the Credit Risk data since those are our labels


In [17]:
normCreditData = as.data.frame(lapply(creditData[,1:4], minMaxNormalization))
normCreditData$Credit.Risks = creditData$Credit.Risks

Show the data after normalization


In [18]:
head(normCreditData)


AgeJobCredit.amountDurationCredit.Risks
0.857142860.6666667 0.050566740.029411761
0.053571430.6666667 0.313689890.647058822
0.535714290.3333333 0.101573680.117647061
0.464285710.6666667 0.419940570.558823531
0.607142860.6666667 0.254209310.294117652
0.285714290.3333333 0.484483330.470588241

Set the seed value


In [19]:
set.seed(2137)

This is the main function which will tell us the various predicted values generated by applying the KNN algorithm depending upon k value and sampling rate


In [20]:
observationsNorm = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())

for(samplingRate in seq(0.5, 0.9, by = 0.1)){
    tempResultNorm = modifiedKNN(normCreditData,samplingRate)
    observationsNorm = rbind(observationsNorm,tempResultNorm)
}

Show the data obtained from the above run


In [21]:
observationsNorm


kValsamplingRateValmisClassificationRateVal
3 0.5 33.60
4 0.5 30.80
5 0.5 32.20
6 0.5 30.80
7 0.5 31.80
8 0.5 30.20
9 0.5 31.40
10 0.5 30.80
11 0.5 30.20
12 0.5 30.60
13 0.5 30.60
14 0.5 30.20
15 0.5 31.20
16 0.5 30.80
17 0.5 32.20
18 0.5 31.40
19 0.5 30.40
20 0.5 30.60
3 0.6 31.75
4 0.6 29.25
5 0.6 32.00
6 0.6 27.75
7 0.6 28.75
8 0.6 29.25
9 0.6 30.50
10 0.6 27.75
11 0.6 28.75
12 0.6 28.75
13 0.6 29.25
14 0.6 27.75
9 0.8 29.5
10 0.8 29.5
11 0.8 31.0
12 0.8 32.0
13 0.8 31.5
14 0.8 32.0
15 0.8 29.5
16 0.8 29.5
17 0.8 30.0
18 0.8 30.5
19 0.8 30.5
20 0.8 31.5
3 0.9 43.0
4 0.9 33.0
5 0.9 40.0
6 0.9 31.0
7 0.9 35.0
8 0.9 34.0
9 0.9 31.0
10 0.9 34.0
11 0.9 35.0
12 0.9 32.0
13 0.9 32.0
14 0.9 33.0
15 0.9 34.0
16 0.9 31.0
17 0.9 30.0
18 0.9 28.0
19 0.9 26.0
20 0.9 30.0

Plot the data for various k values and mis- classification rate


In [22]:
plot(observationsNorm$kVal,observationsNorm$misClassificationRateVal, pch=19, xlim=c(0,20), ylim=c(0,60),xlab="K Value", ylab="Mis-classification Rate",main="Plot between K Value and Mis-classification Rate after Normalization(Min-Max)")


Sort the data in ascending order of mis-classification rate


In [23]:
minObservationsNorm = observationsNorm %>% 
    arrange(misClassificationRateVal)

Show the data


In [24]:
minObservationsNorm


kValsamplingRateValmisClassificationRateVal
16 0.6 25.75000
19 0.9 26.00000
17 0.6 26.50000
18 0.6 27.00000
19 0.6 27.00000
20 0.6 27.25000
15 0.6 27.50000
6 0.6 27.75000
10 0.6 27.75000
14 0.6 27.75000
18 0.9 28.00000
8 0.8 28.50000
7 0.6 28.75000
11 0.6 28.75000
12 0.6 28.75000
4 0.6 29.25000
8 0.6 29.25000
13 0.6 29.25000
9 0.8 29.50000
10 0.8 29.50000
15 0.8 29.50000
16 0.8 29.50000
6 0.7 30.00000
17 0.8 30.00000
17 0.9 30.00000
20 0.9 30.00000
8 0.5 30.20000
11 0.5 30.20000
14 0.5 30.20000
17 0.7 30.33333
4 0.8 31.50000
13 0.8 31.50000
20 0.8 31.50000
8 0.7 31.66667
3 0.6 31.75000
7 0.5 31.80000
5 0.6 32.00000
15 0.7 32.00000
12 0.8 32.00000
14 0.8 32.00000
12 0.9 32.00000
13 0.9 32.00000
5 0.5 32.20000
17 0.5 32.20000
5 0.7 32.33333
11 0.7 32.33333
7 0.8 33.00000
4 0.9 33.00000
14 0.9 33.00000
3 0.5 33.60000
8 0.9 34.00000
10 0.9 34.00000
15 0.9 34.00000
5 0.8 35.00000
7 0.9 35.00000
11 0.9 35.00000
3 0.8 35.50000
3 0.7 38.00000
5 0.9 40.00000
3 0.9 43.00000

Conclusion

After normalization the mis-classification rate is reduced. So our hypothesis of normalizing the data was good. As we can see from the above data that that the lowest mis-classification rate is obtained when the sampling rate is 0.6 and k value is 16.

References


In [ ]: