This is one of the more oiular sites for online datasets http://archive.ics.uci.edu/ml/machine-learning-databases/ We are pulling in the iris datasets


In [1]:
flowers <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
flowers


Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generated
X5.1X3.5X1.4X0.2Iris.setosa
14.931.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
453.61.40.2Iris-setosa
55.43.91.70.4Iris-setosa
64.63.41.40.3Iris-setosa
753.41.50.2Iris-setosa
84.42.91.40.2Iris-setosa
94.93.11.50.1Iris-setosa
105.43.71.50.2Iris-setosa
114.83.41.60.2Iris-setosa
124.831.40.1Iris-setosa
134.331.10.1Iris-setosa
145.841.20.2Iris-setosa
155.74.41.50.4Iris-setosa
165.43.91.30.4Iris-setosa
175.13.51.40.3Iris-setosa
185.73.81.70.3Iris-setosa
195.13.81.50.3Iris-setosa
205.43.41.70.2Iris-setosa
215.13.71.50.4Iris-setosa
224.63.610.2Iris-setosa
235.13.31.70.5Iris-setosa
244.83.41.90.2Iris-setosa
25531.60.2Iris-setosa
2653.41.60.4Iris-setosa
275.23.51.50.2Iris-setosa
285.23.41.40.2Iris-setosa
294.73.21.60.2Iris-setosa
304.83.11.60.2Iris-setosa
31<8b><8b><8b><8b>NA
1206.93.25.72.3Iris-virginica
1215.62.84.92Iris-virginica
1227.72.86.72Iris-virginica
1236.32.74.91.8Iris-virginica
1246.73.35.72.1Iris-virginica
1257.23.261.8Iris-virginica
1266.22.84.81.8Iris-virginica
1276.134.91.8Iris-virginica
1286.42.85.62.1Iris-virginica
1297.235.81.6Iris-virginica
1307.42.86.11.9Iris-virginica
1317.93.86.42Iris-virginica
1326.42.85.62.2Iris-virginica
1336.32.85.11.5Iris-virginica
1346.12.65.61.4Iris-virginica
1357.736.12.3Iris-virginica
1366.33.45.62.4Iris-virginica
1376.43.15.51.8Iris-virginica
138634.81.8Iris-virginica
1396.93.15.42.1Iris-virginica
1406.73.15.62.4Iris-virginica
1416.93.15.12.3Iris-virginica
1425.82.75.11.9Iris-virginica
1436.83.25.92.3Iris-virginica
1446.73.35.72.5Iris-virginica
1456.735.22.3Iris-virginica
1466.32.551.9Iris-virginica
1476.535.22Iris-virginica
1486.23.45.42.3Iris-virginica
1495.935.11.8Iris-virginica

By default, read skips a row with a blank element this adds it back in. Notice that the there are now 150 rows...


In [2]:
flowers <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",blank.lines.skip=FALSE)
flowers


Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generated
X5.1X3.5X1.4X0.2Iris.setosa
14.931.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
453.61.40.2Iris-setosa
55.43.91.70.4Iris-setosa
64.63.41.40.3Iris-setosa
753.41.50.2Iris-setosa
84.42.91.40.2Iris-setosa
94.93.11.50.1Iris-setosa
105.43.71.50.2Iris-setosa
114.83.41.60.2Iris-setosa
124.831.40.1Iris-setosa
134.331.10.1Iris-setosa
145.841.20.2Iris-setosa
155.74.41.50.4Iris-setosa
165.43.91.30.4Iris-setosa
175.13.51.40.3Iris-setosa
185.73.81.70.3Iris-setosa
195.13.81.50.3Iris-setosa
205.43.41.70.2Iris-setosa
215.13.71.50.4Iris-setosa
224.63.610.2Iris-setosa
235.13.31.70.5Iris-setosa
244.83.41.90.2Iris-setosa
25531.60.2Iris-setosa
2653.41.60.4Iris-setosa
275.23.51.50.2Iris-setosa
285.23.41.40.2Iris-setosa
294.73.21.60.2Iris-setosa
304.83.11.60.2Iris-setosa
31<8b><8b><8b><8b>NA
1215.62.84.92Iris-virginica
1227.72.86.72Iris-virginica
1236.32.74.91.8Iris-virginica
1246.73.35.72.1Iris-virginica
1257.23.261.8Iris-virginica
1266.22.84.81.8Iris-virginica
1276.134.91.8Iris-virginica
1286.42.85.62.1Iris-virginica
1297.235.81.6Iris-virginica
1307.42.86.11.9Iris-virginica
1317.93.86.42Iris-virginica
1326.42.85.62.2Iris-virginica
1336.32.85.11.5Iris-virginica
1346.12.65.61.4Iris-virginica
1357.736.12.3Iris-virginica
1366.33.45.62.4Iris-virginica
1376.43.15.51.8Iris-virginica
138634.81.8Iris-virginica
1396.93.15.42.1Iris-virginica
1406.73.15.62.4Iris-virginica
1416.93.15.12.3Iris-virginica
1425.82.75.11.9Iris-virginica
1436.83.25.92.3Iris-virginica
1446.73.35.72.5Iris-virginica
1456.735.22.3Iris-virginica
1466.32.551.9Iris-virginica
1476.535.22Iris-virginica
1486.23.45.42.3Iris-virginica
1495.935.11.8Iris-virginica
150NANANANA

so let's remove it explicitly


In [3]:
flowers <- na.omit(flowers)
flowers


Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generated
X5.1X3.5X1.4X0.2Iris.setosa
14.931.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
453.61.40.2Iris-setosa
55.43.91.70.4Iris-setosa
64.63.41.40.3Iris-setosa
753.41.50.2Iris-setosa
84.42.91.40.2Iris-setosa
94.93.11.50.1Iris-setosa
105.43.71.50.2Iris-setosa
114.83.41.60.2Iris-setosa
124.831.40.1Iris-setosa
134.331.10.1Iris-setosa
145.841.20.2Iris-setosa
155.74.41.50.4Iris-setosa
165.43.91.30.4Iris-setosa
175.13.51.40.3Iris-setosa
185.73.81.70.3Iris-setosa
195.13.81.50.3Iris-setosa
205.43.41.70.2Iris-setosa
215.13.71.50.4Iris-setosa
224.63.610.2Iris-setosa
235.13.31.70.5Iris-setosa
244.83.41.90.2Iris-setosa
25531.60.2Iris-setosa
2653.41.60.4Iris-setosa
275.23.51.50.2Iris-setosa
285.23.41.40.2Iris-setosa
294.73.21.60.2Iris-setosa
304.83.11.60.2Iris-setosa
31<8b><8b><8b><8b>NA
1206.93.25.72.3Iris-virginica
1215.62.84.92Iris-virginica
1227.72.86.72Iris-virginica
1236.32.74.91.8Iris-virginica
1246.73.35.72.1Iris-virginica
1257.23.261.8Iris-virginica
1266.22.84.81.8Iris-virginica
1276.134.91.8Iris-virginica
1286.42.85.62.1Iris-virginica
1297.235.81.6Iris-virginica
1307.42.86.11.9Iris-virginica
1317.93.86.42Iris-virginica
1326.42.85.62.2Iris-virginica
1336.32.85.11.5Iris-virginica
1346.12.65.61.4Iris-virginica
1357.736.12.3Iris-virginica
1366.33.45.62.4Iris-virginica
1376.43.15.51.8Iris-virginica
138634.81.8Iris-virginica
1396.93.15.42.1Iris-virginica
1406.73.15.62.4Iris-virginica
1416.93.15.12.3Iris-virginica
1425.82.75.11.9Iris-virginica
1436.83.25.92.3Iris-virginica
1446.73.35.72.5Iris-virginica
1456.735.22.3Iris-virginica
1466.32.551.9Iris-virginica
1476.535.22Iris-virginica
1486.23.45.42.3Iris-virginica
1495.935.11.8Iris-virginica

let's rename the columns


In [4]:
colnames(flowers) <- c("F1", "F2", "F3", "F4", "Label")
summary(flowers)


       F1              F2              F3              F4       
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.400   Median :1.300  
 Mean   :5.848   Mean   :3.051   Mean   :3.774   Mean   :1.205  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
             Label   
                : 0  
 Iris-setosa    :49  
 Iris-versicolor:50  
 Iris-virginica :50  
                     
                     

we don't know anything about this dataset so let's do a kmeans (like a scatterplot)


In [5]:
indexes = sample(1:nrow(flowers), size=0.6*nrow(flowers))
flowers.train <- flowers[-indexes,]
flowers.test <- flowers[indexes,]
fit <- kmeans(flowers.train[,1:4],5)
fit


K-means clustering with 5 clusters of sizes 4, 11, 20, 14, 11

Cluster means:
        F1       F2       F3        F4
1 7.700000 3.050000 6.600000 2.2000000
2 4.790909 3.127273 1.390909 0.2181818
3 6.485000 2.940000 5.285000 1.8900000
4 5.578571 2.700000 4.050000 1.2642857
5 5.272727 3.690909 1.527273 0.3181818

Clustering vector:
  1   3   4   5   6  10  11  14  16  21  22  23  25  27  31  35  37  41  44  45 
  2   2   5   5   2   5   2   5   5   5   2   5   2   5   5   2   2   2   5   2 
 46  49  50  53  54  58  59  61  66  73  77  79  81  82  83  84  89  90  94  95 
  5   2   3   4   3   3   4   4   4   4   3   4   4   4   3   4   4   4   4   4 
 98 108 111 112 113 116 117 118 120 122 123 124 126 132 134 135 136 139 143 148 
  4   3   3   3   3   3   1   1   3   1   3   3   3   3   3   1   3   3   3   3 

Within cluster sum of squares by cluster:
[1] 1.250000 1.836364 9.137000 4.570714 1.549091
 (between_SS / total_SS =  93.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

lets see what it looks like in a graphical format


In [6]:
plot(flowers.train[c("F1", "F2")], col=fit$cluster)
points(fit$centers[,c("F1", "F2")], col=1:3, pch=8, cex=2)



In [ ]: