DIC LAB 3 Problem 2 : KNN algorithm on German Credit Data

Load the libraries



In [1]:

    
library(ggplot2)
library(dplyr)
library(FNN)









    



Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

A function which takes in the data and sampling rate and returns a data frame containg different sampling rate , mis-calssification rate and k value which are obtained by running the KNN algorithm



In [2]:

    
modifiedKNN = function(creditData,samplingRate){
        
    observations = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())
    
    rowCount = nrow(creditData)
    
    trainingDataRowsCount = sample(1:rowCount, samplingRate * rowCount,replace=FALSE)
    trainingData = subset(creditData[trainingDataRowsCount, ], select = c(Age,Job,Credit.amount,Duration))
    trainingLabels = creditData$Credit.Risks[trainingDataRowsCount]
    
    testingDataRowsCount = setdiff(1:rowCount, trainingDataRowsCount)
    testingData = subset(creditData[testingDataRowsCount, ], select = c(Age,Job,Credit.amount,Duration))
    testingLabels = creditData$Credit.Risks[testingDataRowsCount]
    
    for(k in 3:20){
        predictedLabels = knn(trainingData, testingData, trainingLabels, k) 
        incorrectLabels = sum(predictedLabels != testingLabels)
        misClassificationRate = (incorrectLabels/length(testingDataRowsCount))*100
        
        tempResult = data.frame(kVal = k, samplingRateVal = samplingRate, misClassificationRateVal = misClassificationRate)
        observations = rbind(observations,tempResult)
    }
    return (observations)
}

Load the German Credit Data



In [3]:

    
creditData = read.csv(file="dataSets/german_credit_data1.csv", header=TRUE, sep=",")

Show the data



In [4]:

    
head(creditData)









    





X. Age Sex Job Housing Saving.accounts Checking.account Credit.amount Duration Purpose Credit.Risks

	0                  67                 male               2                  own                NA                 little             1169                6                 radio/TV           1                  
	1                  22                 female             2                  own                little             moderate           5951               48                 radio/TV           2                  
	2                  49                 male               1                  own                little             NA                 2096               12                 education          1                  
	3                  45                 male               2                  free               little             little             7882               42                 furniture/equipment 1                  
	4                  53                 male               2                  free               little             little             4870               24                 car                2                  
	5                  35                 male               1                  free               NA                 NA                 9055               36                 education          1

Show the summary of the data



In [5]:

    
summary(creditData)









    





       X.             Age            Sex           Job        Housing   
 Min.   :  0.0   Min.   :19.00   female:310   Min.   :0.000   free:108  
 1st Qu.:249.8   1st Qu.:27.00   male  :690   1st Qu.:2.000   own :713  
 Median :499.5   Median :33.00                Median :2.000   rent:179  
 Mean   :499.5   Mean   :35.55                Mean   :1.904             
 3rd Qu.:749.2   3rd Qu.:42.00                3rd Qu.:2.000             
 Max.   :999.0   Max.   :75.00                Max.   :3.000             
                                                                        
   Saving.accounts Checking.account Credit.amount      Duration   
 little    :603    little  :274     Min.   :  250   Min.   : 4.0  
 moderate  :103    moderate:269     1st Qu.: 1366   1st Qu.:12.0  
 quite rich: 63    rich    : 63     Median : 2320   Median :18.0  
 rich      : 48    NA's    :394     Mean   : 3271   Mean   :20.9  
 NA's      :183                     3rd Qu.: 3972   3rd Qu.:24.0  
                                    Max.   :18424   Max.   :72.0  
                                                                  
                Purpose     Credit.Risks
 car                :337   Min.   :1.0  
 radio/TV           :280   1st Qu.:1.0  
 furniture/equipment:181   Median :1.0  
 business           : 97   Mean   :1.3  
 education          : 59   3rd Qu.:2.0  
 repairs            : 22   Max.   :2.0  
 (Other)            : 24

Select only the relevant data field from the given data



In [6]:

    
creditData = creditData %>% select(Age,Job,Credit.amount,Duration,Credit.Risks)

Remove all the data fields which are NA in the selected data



In [7]:

    
creditData = na.omit(creditData)

Show the data



In [8]:

    
head(creditData)









    





Age Job Credit.amount Duration Credit.Risks

	67  2   1169  6  1   
	22  2   5951 48  2   
	49  1   2096 12  1   
	45  2   7882 42  1   
	53  2   4870 24  2   
	35  1   9055 36  1

Create a summary of the data again and check if there is any change or not



In [9]:

    
summary(creditData)









    





      Age             Job        Credit.amount      Duration     Credit.Risks
 Min.   :19.00   Min.   :0.000   Min.   :  250   Min.   : 4.0   Min.   :1.0  
 1st Qu.:27.00   1st Qu.:2.000   1st Qu.: 1366   1st Qu.:12.0   1st Qu.:1.0  
 Median :33.00   Median :2.000   Median : 2320   Median :18.0   Median :1.0  
 Mean   :35.55   Mean   :1.904   Mean   : 3271   Mean   :20.9   Mean   :1.3  
 3rd Qu.:42.00   3rd Qu.:2.000   3rd Qu.: 3972   3rd Qu.:24.0   3rd Qu.:2.0  
 Max.   :75.00   Max.   :3.000   Max.   :18424   Max.   :72.0   Max.   :2.0

Set a seed value



In [10]:

    
set.seed(1234)

This is the main function which will tell us the various predicted values generated by applying the KNN algorithm depending upon k value and samoling rate



In [11]:

    
observations = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())

for(samplingRate in seq(0.5, 0.9, by = 0.1)){
    tempResult = modifiedKNN(creditData,samplingRate)
    observations = rbind(observations,tempResult)
}

Print the observations we get from different values of sampling rates and k values



In [12]:

    
observations









    





kVal samplingRateVal misClassificationRateVal

	 3   0.5  38.20
	 4   0.5  34.80
	 5   0.5  39.00
	 6   0.5  33.40
	 7   0.5  35.00
	 8   0.5  30.60
	 9   0.5  33.40
	10   0.5  31.40
	11   0.5  32.40
	12   0.5  30.60
	13   0.5  30.80
	14   0.5  30.20
	15   0.5  30.60
	16   0.5  29.80
	17   0.5  30.40
	18   0.5  30.00
	19   0.5  31.20
	20   0.5  30.40
	 3   0.6  34.50
	 4   0.6  30.25
	 5   0.6  30.75
	 6   0.6  30.75
	 7   0.6  30.75
	 8   0.6  29.75
	 9   0.6  31.00
	10   0.6  30.00
	11   0.6  31.25
	12   0.6  30.00
	13   0.6  30.00
	14   0.6  30.25
	⋮ ⋮ ⋮
	 9  0.8 33.0
	10  0.8 30.5
	11  0.8 35.5
	12  0.8 32.5
	13  0.8 35.5
	14  0.8 27.5
	15  0.8 30.0
	16  0.8 27.5
	17  0.8 30.5
	18  0.8 30.0
	19  0.8 30.5
	20  0.8 30.5
	 3  0.9 44.0
	 4  0.9 34.0
	 5  0.9 35.0
	 6  0.9 33.0
	 7  0.9 33.0
	 8  0.9 31.0
	 9  0.9 30.0
	10  0.9 29.0
	11  0.9 30.0
	12  0.9 29.0
	13  0.9 28.0
	14  0.9 29.0
	15  0.9 30.0
	16  0.9 31.0
	17  0.9 33.0
	18  0.9 31.0
	19  0.9 33.0
	20  0.9 32.0

Draw a plot showing the different values



In [13]:

    
plot(observations$kVal,observations$misClassificationRateVal, pch=19, xlim=c(0,20), ylim=c(0,60),xlab="K Value", ylab="Mis-classification Rate",main="Plot between K Value and Mis-classification Rate")

Arrange the values otained above in the asending order of mis-classification values



In [14]:

    
minObservations = observations %>%
    arrange(misClassificationRateVal)

Show the values



In [15]:

    
minObservations









    





kVal samplingRateVal misClassificationRateVal

	14      0.8     27.50000
	16      0.8     27.50000
	13      0.9     28.00000
	20      0.6     28.75000
	10      0.9     29.00000
	12      0.9     29.00000
	14      0.9     29.00000
	20      0.7     29.33333
	 6      0.8     29.50000
	12      0.7     29.66667
	 8      0.6     29.75000
	16      0.5     29.80000
	18      0.5     30.00000
	10      0.6     30.00000
	12      0.6     30.00000
	13      0.6     30.00000
	18      0.6     30.00000
	13      0.7     30.00000
	14      0.7     30.00000
	15      0.7     30.00000
	17      0.7     30.00000
	18      0.7     30.00000
	15      0.8     30.00000
	18      0.8     30.00000
	 9      0.9     30.00000
	11      0.9     30.00000
	15      0.9     30.00000
	14      0.5     30.20000
	 4      0.6     30.25000
	14      0.6     30.25000
	⋮ ⋮ ⋮
	 8      0.7     31.66667
	15      0.6     32.00000
	 9      0.7     32.00000
	20      0.9     32.00000
	11      0.5     32.40000
	 5      0.8     32.50000
	12      0.8     32.50000
	 7      0.7     32.66667
	 4      0.7     33.00000
	 9      0.8     33.00000
	 6      0.9     33.00000
	 7      0.9     33.00000
	17      0.9     33.00000
	19      0.9     33.00000
	 6      0.5     33.40000
	 9      0.5     33.40000
	 5      0.7     34.00000
	 4      0.9     34.00000
	 3      0.6     34.50000
	 3      0.7     34.66667
	 4      0.5     34.80000
	 7      0.5     35.00000
	 5      0.9     35.00000
	 7      0.8     35.50000
	11      0.8     35.50000
	13      0.8     35.50000
	 3      0.8     36.00000
	 3      0.5     38.20000
	 5      0.5     39.00000
	 3      0.9     44.00000

Conclusion

As we can see from the above data that that the lowest mis-classification rate is obtained when the sampling rate is 0.8 and k value is 14 and 16

Normalization

Here we will do a normalization based on the min max normalization. As we can see from the above data that the data values in Credit amount and Job varies a lot . So maybe by normalization we can obtain a better fit to the model

Define a Min Max normalization function



In [16]:

    
minMaxNormalization <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
  }

Normalize the data except the Credit Risk data since those are our labels



In [17]:

    
normCreditData = as.data.frame(lapply(creditData[,1:4], minMaxNormalization))
normCreditData$Credit.Risks = creditData$Credit.Risks

Show the data after normalization



In [18]:

    
head(normCreditData)









    





Age Job Credit.amount Duration Credit.Risks

	0.85714286 0.6666667 0.05056674 0.02941176 1         
	0.05357143 0.6666667 0.31368989 0.64705882 2         
	0.53571429 0.3333333 0.10157368 0.11764706 1         
	0.46428571 0.6666667 0.41994057 0.55882353 1         
	0.60714286 0.6666667 0.25420931 0.29411765 2         
	0.28571429 0.3333333 0.48448333 0.47058824 1

Set the seed value



In [19]:

    
set.seed(2137)

This is the main function which will tell us the various predicted values generated by applying the KNN algorithm depending upon k value and sampling rate



In [20]:

    
observationsNorm = data.frame(kVal = double(),samplingRateVal = double(),misClassificationRateVal = double())

for(samplingRate in seq(0.5, 0.9, by = 0.1)){
    tempResultNorm = modifiedKNN(normCreditData,samplingRate)
    observationsNorm = rbind(observationsNorm,tempResultNorm)
}

Show the data obtained from the above run



In [21]:

    
observationsNorm









    





kVal samplingRateVal misClassificationRateVal

	 3   0.5  33.60
	 4   0.5  30.80
	 5   0.5  32.20
	 6   0.5  30.80
	 7   0.5  31.80
	 8   0.5  30.20
	 9   0.5  31.40
	10   0.5  30.80
	11   0.5  30.20
	12   0.5  30.60
	13   0.5  30.60
	14   0.5  30.20
	15   0.5  31.20
	16   0.5  30.80
	17   0.5  32.20
	18   0.5  31.40
	19   0.5  30.40
	20   0.5  30.60
	 3   0.6  31.75
	 4   0.6  29.25
	 5   0.6  32.00
	 6   0.6  27.75
	 7   0.6  28.75
	 8   0.6  29.25
	 9   0.6  30.50
	10   0.6  27.75
	11   0.6  28.75
	12   0.6  28.75
	13   0.6  29.25
	14   0.6  27.75
	⋮ ⋮ ⋮
	 9  0.8 29.5
	10  0.8 29.5
	11  0.8 31.0
	12  0.8 32.0
	13  0.8 31.5
	14  0.8 32.0
	15  0.8 29.5
	16  0.8 29.5
	17  0.8 30.0
	18  0.8 30.5
	19  0.8 30.5
	20  0.8 31.5
	 3  0.9 43.0
	 4  0.9 33.0
	 5  0.9 40.0
	 6  0.9 31.0
	 7  0.9 35.0
	 8  0.9 34.0
	 9  0.9 31.0
	10  0.9 34.0
	11  0.9 35.0
	12  0.9 32.0
	13  0.9 32.0
	14  0.9 33.0
	15  0.9 34.0
	16  0.9 31.0
	17  0.9 30.0
	18  0.9 28.0
	19  0.9 26.0
	20  0.9 30.0

Plot the data for various k values and mis- classification rate



In [22]:

    
plot(observationsNorm$kVal,observationsNorm$misClassificationRateVal, pch=19, xlim=c(0,20), ylim=c(0,60),xlab="K Value", ylab="Mis-classification Rate",main="Plot between K Value and Mis-classification Rate after Normalization(Min-Max)")

Sort the data in ascending order of mis-classification rate



In [23]:

    
minObservationsNorm = observationsNorm %>% 
    arrange(misClassificationRateVal)

Show the data



In [24]:

    
minObservationsNorm









    





kVal samplingRateVal misClassificationRateVal

	16      0.6     25.75000
	19      0.9     26.00000
	17      0.6     26.50000
	18      0.6     27.00000
	19      0.6     27.00000
	20      0.6     27.25000
	15      0.6     27.50000
	 6      0.6     27.75000
	10      0.6     27.75000
	14      0.6     27.75000
	18      0.9     28.00000
	 8      0.8     28.50000
	 7      0.6     28.75000
	11      0.6     28.75000
	12      0.6     28.75000
	 4      0.6     29.25000
	 8      0.6     29.25000
	13      0.6     29.25000
	 9      0.8     29.50000
	10      0.8     29.50000
	15      0.8     29.50000
	16      0.8     29.50000
	 6      0.7     30.00000
	17      0.8     30.00000
	17      0.9     30.00000
	20      0.9     30.00000
	 8      0.5     30.20000
	11      0.5     30.20000
	14      0.5     30.20000
	17      0.7     30.33333
	⋮ ⋮ ⋮
	 4      0.8     31.50000
	13      0.8     31.50000
	20      0.8     31.50000
	 8      0.7     31.66667
	 3      0.6     31.75000
	 7      0.5     31.80000
	 5      0.6     32.00000
	15      0.7     32.00000
	12      0.8     32.00000
	14      0.8     32.00000
	12      0.9     32.00000
	13      0.9     32.00000
	 5      0.5     32.20000
	17      0.5     32.20000
	 5      0.7     32.33333
	11      0.7     32.33333
	 7      0.8     33.00000
	 4      0.9     33.00000
	14      0.9     33.00000
	 3      0.5     33.60000
	 8      0.9     34.00000
	10      0.9     34.00000
	15      0.9     34.00000
	 5      0.8     35.00000
	 7      0.9     35.00000
	11      0.9     35.00000
	 3      0.8     35.50000
	 3      0.7     38.00000
	 5      0.9     40.00000
	 3      0.9     43.00000

Conclusion

After normalization the mis-classification rate is reduced. So our hypothesis of normalizing the data was good. As we can see from the above data that that the lowest mis-classification rate is obtained when the sampling rate is 0.6 and k value is 16.

References



In [ ]:

X.	Age	Sex	Job	Housing	Saving.accounts	Checking.account	Credit.amount	Duration	Purpose	Credit.Risks
0	67	male	2	own	NA	little	1169	6	radio/TV	1
1	22	female	2	own	little	moderate	5951	48	radio/TV	2
2	49	male	1	own	little	NA	2096	12	education	1
3	45	male	2	free	little	little	7882	42	furniture/equipment	1
4	53	male	2	free	little	little	4870	24	car	2
5	35	male	1	free	NA	NA	9055	36	education	1

kVal	samplingRateVal	misClassificationRateVal
3	0.5	38.20
4	0.5	34.80
5	0.5	39.00
6	0.5	33.40
7	0.5	35.00
8	0.5	30.60
9	0.5	33.40
10	0.5	31.40
11	0.5	32.40
12	0.5	30.60
13	0.5	30.80
14	0.5	30.20
15	0.5	30.60
16	0.5	29.80
17	0.5	30.40
18	0.5	30.00
19	0.5	31.20
20	0.5	30.40
3	0.6	34.50
4	0.6	30.25
5	0.6	30.75
6	0.6	30.75
7	0.6	30.75
8	0.6	29.75
9	0.6	31.00
10	0.6	30.00
11	0.6	31.25
12	0.6	30.00
13	0.6	30.00
14	0.6	30.25
⋮	⋮	⋮
9	0.8	33.0
10	0.8	30.5
11	0.8	35.5
12	0.8	32.5
13	0.8	35.5
14	0.8	27.5
15	0.8	30.0
16	0.8	27.5
17	0.8	30.5
18	0.8	30.0
19	0.8	30.5
20	0.8	30.5
3	0.9	44.0
4	0.9	34.0
5	0.9	35.0
6	0.9	33.0
7	0.9	33.0
8	0.9	31.0
9	0.9	30.0
10	0.9	29.0
11	0.9	30.0
12	0.9	29.0
13	0.9	28.0
14	0.9	29.0
15	0.9	30.0
16	0.9	31.0
17	0.9	33.0
18	0.9	31.0
19	0.9	33.0
20	0.9	32.0

kVal	samplingRateVal	misClassificationRateVal
14	0.8	27.50000
16	0.8	27.50000
13	0.9	28.00000
20	0.6	28.75000
10	0.9	29.00000
12	0.9	29.00000
14	0.9	29.00000
20	0.7	29.33333
6	0.8	29.50000
12	0.7	29.66667
8	0.6	29.75000
16	0.5	29.80000
18	0.5	30.00000
10	0.6	30.00000
12	0.6	30.00000
13	0.6	30.00000
18	0.6	30.00000
13	0.7	30.00000
14	0.7	30.00000
15	0.7	30.00000
17	0.7	30.00000
18	0.7	30.00000
15	0.8	30.00000
18	0.8	30.00000
9	0.9	30.00000
11	0.9	30.00000
15	0.9	30.00000
14	0.5	30.20000
4	0.6	30.25000
14	0.6	30.25000
⋮	⋮	⋮
8	0.7	31.66667
15	0.6	32.00000
9	0.7	32.00000
20	0.9	32.00000
11	0.5	32.40000
5	0.8	32.50000
12	0.8	32.50000
7	0.7	32.66667
4	0.7	33.00000
9	0.8	33.00000
6	0.9	33.00000
7	0.9	33.00000
17	0.9	33.00000
19	0.9	33.00000
6	0.5	33.40000
9	0.5	33.40000
5	0.7	34.00000
4	0.9	34.00000
3	0.6	34.50000
3	0.7	34.66667
4	0.5	34.80000
7	0.5	35.00000
5	0.9	35.00000
7	0.8	35.50000
11	0.8	35.50000
13	0.8	35.50000
3	0.8	36.00000
3	0.5	38.20000
5	0.5	39.00000
3	0.9	44.00000

Age	Job	Credit.amount	Duration	Credit.Risks
0.85714286	0.6666667	0.05056674	0.02941176	1
0.05357143	0.6666667	0.31368989	0.64705882	2
0.53571429	0.3333333	0.10157368	0.11764706	1
0.46428571	0.6666667	0.41994057	0.55882353	1
0.60714286	0.6666667	0.25420931	0.29411765	2
0.28571429	0.3333333	0.48448333	0.47058824	1

kVal	samplingRateVal	misClassificationRateVal
3	0.5	33.60
4	0.5	30.80
5	0.5	32.20
6	0.5	30.80
7	0.5	31.80
8	0.5	30.20
9	0.5	31.40
10	0.5	30.80
11	0.5	30.20
12	0.5	30.60
13	0.5	30.60
14	0.5	30.20
15	0.5	31.20
16	0.5	30.80
17	0.5	32.20
18	0.5	31.40
19	0.5	30.40
20	0.5	30.60
3	0.6	31.75
4	0.6	29.25
5	0.6	32.00
6	0.6	27.75
7	0.6	28.75
8	0.6	29.25
9	0.6	30.50
10	0.6	27.75
11	0.6	28.75
12	0.6	28.75
13	0.6	29.25
14	0.6	27.75
⋮	⋮	⋮
9	0.8	29.5
10	0.8	29.5
11	0.8	31.0
12	0.8	32.0
13	0.8	31.5
14	0.8	32.0
15	0.8	29.5
16	0.8	29.5
17	0.8	30.0
18	0.8	30.5
19	0.8	30.5
20	0.8	31.5
3	0.9	43.0
4	0.9	33.0
5	0.9	40.0
6	0.9	31.0
7	0.9	35.0
8	0.9	34.0
9	0.9	31.0
10	0.9	34.0
11	0.9	35.0
12	0.9	32.0
13	0.9	32.0
14	0.9	33.0
15	0.9	34.0
16	0.9	31.0
17	0.9	30.0
18	0.9	28.0
19	0.9	26.0
20	0.9	30.0

kVal	samplingRateVal	misClassificationRateVal
16	0.6	25.75000
19	0.9	26.00000
17	0.6	26.50000
18	0.6	27.00000
19	0.6	27.00000
20	0.6	27.25000
15	0.6	27.50000
6	0.6	27.75000
10	0.6	27.75000
14	0.6	27.75000
18	0.9	28.00000
8	0.8	28.50000
7	0.6	28.75000
11	0.6	28.75000
12	0.6	28.75000
4	0.6	29.25000
8	0.6	29.25000
13	0.6	29.25000
9	0.8	29.50000
10	0.8	29.50000
15	0.8	29.50000
16	0.8	29.50000
6	0.7	30.00000
17	0.8	30.00000
17	0.9	30.00000
20	0.9	30.00000
8	0.5	30.20000
11	0.5	30.20000
14	0.5	30.20000
17	0.7	30.33333
⋮	⋮	⋮
4	0.8	31.50000
13	0.8	31.50000
20	0.8	31.50000
8	0.7	31.66667
3	0.6	31.75000
7	0.5	31.80000
5	0.6	32.00000
15	0.7	32.00000
12	0.8	32.00000
14	0.8	32.00000
12	0.9	32.00000
13	0.9	32.00000
5	0.5	32.20000
17	0.5	32.20000
5	0.7	32.33333
11	0.7	32.33333
7	0.8	33.00000
4	0.9	33.00000
14	0.9	33.00000
3	0.5	33.60000
8	0.9	34.00000
10	0.9	34.00000
15	0.9	34.00000
5	0.8	35.00000
7	0.9	35.00000
11	0.9	35.00000
3	0.8	35.50000
3	0.7	38.00000
5	0.9	40.00000
3	0.9	43.00000