by:
Akinwande Atanda | University of Canterbury | New Zealand
EDA is the statistical approach for examining data sets by summarising their key features and chracteristics through visualization and descriptive statistics. An EDA might also include hypothesis testing and modelling.
One of the advantages of EDA techniques in the field of data mining and big data analytics is that it provide the support for selecting the appropriate statistical tools for fitting a dataset. The process of conducting EDA to make informed business or operational decisions are shown in the figure below:
The Iris data set is used to perform EDA. The analyses covered here include the following:
In [41]:
library(dplyr)
In [2]:
dim(iris) #Dimension (R x C) of the Dataset
head(iris,10)
List of unique species in the dataset
In [3]:
distinct(iris, Species)
In [4]:
summary(iris)
In [5]:
flowers <- group_by(iris, Species)
Average Sepal and Petal Length and Width for all the species
In [6]:
summarise(iris, Avg.SL= mean(Sepal.Length), Avg.SW = mean(Sepal.Width), Avg.PL=mean(Petal.Length), Avg.PW =mean(Petal.Width))
Variation of Sepal and Petal Length and Width across all species
In [7]:
summarise(iris, sd.SL= sd(Sepal.Length), sd.SW = sd(Sepal.Width), sd.PL=sd(Petal.Length), sd.PW =sd(Petal.Width))
Average Sepal and Petal Length and Width by Species
In [8]:
Avg.Features.Species <- summarise(flowers, count = n(), Avg.SL= mean(Sepal.Length), Avg.SW = mean(Sepal.Width), Avg.PL=mean(Petal.Length), Avg.PW =mean(Petal.Width))
In [9]:
Avg.Features.Species
Variation of Sepal and Petal Length and Width by Species
In [10]:
Features.Variation.Species <- summarise(flowers, count = n(), sd.SL= sd(Sepal.Length), sd.SW = sd(Sepal.Width), sd.PL=sd(Petal.Length), sd.PW =sd(Petal.Width))
In [11]:
Features.Variation.Species
Between species summary analysis
The following is to compare the sepal length of the flowers by species based on their mean, 1st quantile, 3rd quantile, max and min values.
The box plot indicates that the average sepal length for each flower specie is different. Setosa has the least length and average, while virginica has the highest mean and length value.
In [12]:
boxplot(iris$Sepal.Length ~ iris$Species, ylab="Sepal Length(cm)", col="orange")
In [13]:
plot(iris$Sepal.Length, iris$Sepal.Width,
xlab="Sepal Length(cm)", ylab="Sepal Width(cm)",
xlim=c(0,10),ylim=c(0,5), col=as.numeric(iris$Species))
legend(2,2,pch=16,col=1:3,c("setosa", "versicolor", "virginica"))
text(5,5,"Cross-Variation of Different Species by Length and Width of Sepal")
As shown in the plot below for petal's length of different species, it is clear that setosa has the least length among the examined flowers. Therefore, Setosa is the only flower with the shortest sepal and petal, while virginica has the longest
In [14]:
boxplot(iris$Petal.Length ~ iris$Species, ylab="Petal Length(cm)", col=3)
In [15]:
plot(iris$Petal.Length, iris$Petal.Width,
xlab="Petal Length(cm)", ylab="Petal Width(cm)",
xlim=c(0,10),ylim=c(0,5), col=as.numeric(iris$Species))
legend(7,2,pch=16,col=1:3,c("setosa", "versicolor", "virginica"))
text(5,5,"Cross-Variation of Different Species by Length and Width of Petal")
In [16]:
library(ggplot2)
In [17]:
countsp <- summarise(iris, count=n())
In [18]:
ggplot(iris, aes(iris$Petal.Length, iris$Petal.Width)) + geom_point(aes(size = countsp), alpha = 1/2) + geom_smooth() +scale_size_area()
The fitted scatter plot above shows the wide variation in petal length relative to the width for all of the flowers species. Some species such as Setosa significantly cluster close the average compared to other variants of flowers.
In [19]:
Species.Ver.Vir <- filter(iris, Species != 'setosa')
In [20]:
head(Species.Ver.Vir,10)
In [21]:
hist(Species.Ver.Vir$Sepal.Length, col='orange', main='')
In [22]:
t.test(Sepal.Length ~ Species, data=Species.Ver.Vir)
The tested hypothesis strongly indicate evidence of statistical difference between the average speal length of versicolor and virginican species
In [23]:
t.test(Sepal.Length ~ sample(Species), data=Species.Ver.Vir)
Taking a random sample of the data and re-testing the hypothesis yield outcomes that contradict the previous finding. This indicates that there is no sufficient evidence to support the evidence of statistical difference between average sepal length of versicolor and virginican species.
In [24]:
VerVir.t <- numeric(10000)
VerVir.pvalue <- numeric(10000)
In [25]:
VVtest <- t.test(Sepal.Length ~ sample(Species), data=Species.Ver.Vir)
names(VVtest)
In [26]:
for (i in 1:10000){
VV <- t.test(Sepal.Length ~ sample(Species), data=Species.Ver.Vir)
VerVir.t[i] <- VV$statistic
}
In [27]:
summary(VerVir.t)
In [28]:
hist(VerVir.t, col='orange', main='')
In [29]:
for (i in 1:10000){
VV <- t.test(Sepal.Length ~ sample(Species), data=Species.Ver.Vir)
VerVir.pvalue[i] <- VV$p.value
}
summary(VerVir.pvalue)
In [30]:
hist(VerVir.pvalue, col='darkgreen', main='')
Also, outcomes from the simulation further provide evidence that there is no statistical difference between average sepal length of versicolor and virginican species.
In [31]:
iris$Length.Ratio <- iris$Sepal.Length/iris$Petal.Length
head(iris,5)
In [32]:
summarise(group_by(iris, Species), Avg.Length.Ratio = mean(Length.Ratio), Sd.Length.Ratio = sd(Length.Ratio), Var.Length.Ratio = var(Length.Ratio), count=n())
In [33]:
Avg.Length.Ratio <- tapply(iris$Length.Ratio, iris$Species, mean)
Sd.Length.Ratio <- tapply(iris$Length.Ratio, iris$Species, sd)
In [34]:
Lower.CI.Baseline <- round(Avg.Length.Ratio - 1.96*Sd.Length.Ratio/sqrt(50),2)
Upper.CI.Baseline <- round(Avg.Length.Ratio + 1.96*Sd.Length.Ratio/sqrt(50),2)
In [35]:
Lower.CI.Baseline; Upper.CI.Baseline
Then, boostrapping is conducted to determine the sample adequacy for mean variability in the confidence interval in order to show that the baseline mean is within the range shown above. For instance if the sample size is adequate to establish that setosa length ratio (3.46) is significantly between 3.35 and 3.58
In [36]:
Setosa.Sepal.Petal.Length.Boostrap <- numeric(10000)
In [37]:
for (i in 1:10000){
Setosa.Sepal.Petal.Length.Boostrap[i] <- mean(sample(iris$Length.Ratio[iris$Species=='setosa'],50,replace=T))
}
In [38]:
head(Setosa.Sepal.Petal.Length.Boostrap,5)
In [39]:
hist(Setosa.Sepal.Petal.Length.Boostrap, col=3, main='')
In [40]:
quantile(Setosa.Sepal.Petal.Length.Boostrap,c(.25,.975))
The boostrap result shows that the sample is adequate to support the evidence that the average setosa length ratio in Sepal and Petal is significantly 3.46