For the final project, you will conduct your own exploratory data analysis and create an RMD file that explores the variables, structure, patterns, oddities, and underlying relationships of a data set of your choice.
The analysis should be almost like a stream-of-consciousness as you ask questions, create visualizations, and explore your data.
This project is open-ended in that we are not looking for one right answer. As John Tukey stated, "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." We want you to ask interesting questions about data and give you a chance to explore. We will provide some options of data sets to explore; however, you may choose to explore an entirely different data set. You should be aware that finding your own data set and cleaning that data set into a form that can be read into R can take considerable time and effort. This can add as much as a day, a week, or even months to your project so only adventure to find and clean a data set if you are truly prepared with programming and data wrangling skills.
In [1]:
library(ggplot2)
library(GGally)
In [2]:
redWineData <- read.csv('wineQualityReds.csv')
Let's start by looking at all columns of the data to get a general sense about how the data looks like
In [3]:
redWineData$X <- NULL
redWineData
In [4]:
str(redWineData)
In [5]:
names(redWineData)
In [6]:
summary(redWineData)
In [5]:
create_plot <- function(variable, binwidth = 0.01) {
return(ggplot(aes_string(x = variable), data = redWineData) +
geom_histogram(binwidth = binwidth))
}
In [6]:
create_plot('fixed.acidity', 0.25)
The fixed acidity peaks at around 8. It appears to be a normal distribution
In [7]:
create_plot('volatile.acidity', 0.05)
Volatile acidity has a plateau in the range 0.3 to 0.6. Somewhat like a normal distribution when I look at this plot. If I change the bind width then it will look a normal distribution.
In [11]:
create_plot('citric.acid', 0.04)
citric.acid has 2 peaks at 0 and 0.5
In [14]:
create_plot('residual.sugar', 0.5)
Not much information can be seen from this about the tail. Will have to change the scale of data to see that.
In [20]:
create_plot('residual.sugar', 0.05) +
scale_x_log10()
similar results like other fields. residual.sugar
is normal distribution
In [24]:
create_plot('chlorides', 0.05)
create_plot('chlorides', 0.05) +
scale_x_log10()
chlorides appear to be a normal distribution
In [26]:
create_plot('free.sulfur.dioxide', 2.5)
free.sulfur.dioxide has a peak near 5
In [48]:
create_plot('total.sulfur.dioxide', 2)
total.sulfur.dioxide peaks near 10.
In [42]:
create_plot('density', 0.0005)
Again a normal distribution
In [50]:
create_plot('pH', 0.04)
Again a normal distribution
In [52]:
create_plot('sulphates', 0.04)
In [57]:
create_plot('alcohol', 0.2)
This peaks around 9
In [62]:
create_plot('quality', 1)
The main feature of interest is quality. We need to find what all features affect that.
From the data so far it is hard to say which will be useful but from the description of the variables I would consider the following
No did not create any new variables
All of the data is numerical. The distributions were normal. No changes needed.
In [15]:
create_plot_box <- function(variable, ylim = c(0, 1)) {
return(ggplot(aes_string(x = 'quality', y = variable, group = 'quality'),
data = redWineData) +
geom_boxplot() +
stat_summary(fun.y = mean, geom = 'point', shape = 4) +
coord_cartesian(ylim = ylim))
}
In [18]:
create_plot_box('fixed.acidity', c(4.5, 13))
There does not seem to have much relationship between quality and fixed.acidity. Maybe a weak one that fixed.acidity is higher for higher quality wine
In [68]:
create_plot_box('volatile.acidity', c(0.1, 1.2))
Now this one is a very pronounced effect. It can be clearly seen that as quality increases volatile.acidity decreases
In [69]:
create_plot_box('citric.acid', c(0, 0.8))
The relationship between quality and citric.acid is direct and easily visible. Higher quality has higher citric.acid
In [70]:
create_plot_box('residual.sugar', c(1, 4.5))
There does not seem to be any relationship
In [73]:
create_plot_box('chlorides', c(0.04, 0.2))
Maybe a weak relationship that higher quality has lower chlorides
In [74]:
create_plot_box('total.sulfur.dioxide', c(0, 170))
Seems like a normal relationship. Medium quality wines have higher total sulpher dioxide
In [75]:
create_plot_box('free.sulfur.dioxide', c(0, 45))
Seems like a normal relationship. Medium quality wines have higher free sulpher dioxide
In [77]:
create_plot_box('density', c(0.991, 1.002))
Higher quality wine have lower density
In [78]:
create_plot_box('pH', c(2.9, 3.8))
Higher quality wine have lower pH
In [79]:
create_plot_box('sulphates', c(0.3, 1.05))
Higher quality wine has higher sulphates
In [80]:
create_plot_box('alcohol', c(8, 14))
Higher quality wine have higher alchohol
Now let's make a grid to see the relationships between all variables to see whether I missed anything.
In [229]:
theme_set(theme_minimal(20))
set.seed(1836)
ggpairs(redWineData, axisLabels = 'internal')
Let's try and check the correlations between the various variables as the above one isn't that clear and I want to see that more clearly
In [14]:
cor(redWineData)
The plot and table above shows me that there are various other relationships that can be explored. I am taking any correlation above 0.4 as something that can be explored. Example would be
In [61]:
create_scatter_plot <- function(x, y, alpha) {
return(ggplot(aes_string(x = x, y = y), data = redWineData) +
geom_point(alpha = alpha))
}
Let's make a plot of citric acid and fixed acidity that has a correlation of 0.67170343
In [63]:
create_scatter_plot('citric.acid', 'fixed.acidity', 1/5)
As citric acid increases fixed acidity also increases. But we can see that this is not a direct relationship. For values of 0 there is acidity present so there is some other factor present
Let's make a plot of density and fixed acidity which has a correlation of 0.66804729
In [68]:
create_scatter_plot('density', 'fixed.acidity', 1/5)
We can see that as density increases fixed acidity also increases.
Let's make a plot of fixed acidity and pH which has a correlation of -0.68297819
In [69]:
create_scatter_plot('fixed.acidity', 'pH', 1/5)
As fixed acidity increases pH decreases
Let's make a plot of citric acid and volatile acidity which has a correlation of -0.552495685
In [70]:
create_scatter_plot('citric.acid', 'volatile.acidity', 1/5)
As citric acid increases volatile acidity decreases
Let's make a plot of citric acid and pH which has correlation of -0.54190414
In [71]:
create_scatter_plot('citric.acid', 'pH', 1/5)
As citric acid increase the pH decreases
Let's make a plot between free sulphur dioxide and total sulphur dioxide which has a correlation of 0.667666450
In [72]:
create_scatter_plot('free.sulfur.dioxide', 'total.sulfur.dioxide', 1/5)
Out of all the scatter plot that I have seen so far this interests me the most. The lower part of y seems to be a straight line while all the other plots so far were more of spread out. Maybe the total sulphur dioxide is at least the amount of free sulphur dioxide.
These do not seem to have a direct relationship with quality
These seem to be related with quality
From the chosen 4 initial variables
all except free.sulfur.dioxide
had seem to have strong relationship with quality of the wine. sulphates
were not expected but they had strong relationships
Many variables had correlation magnitude greater than 0.5
But none of them were correlated to quality that strongly
Total sulphur dioxide seems to lower bounded by free sulphur dioxide
Looking at the graphs and the correlations
Here I will explore and try to find out whether variables that don't have have high correlation with quality by themselves taken together can have some kind of trend with quality.
In [116]:
create_multi_plot <- function(x, y, alpha) {
return(ggplot(aes_string(x = x, y = y),
data = redWineData) +
facet_wrap(~quality) +
geom_point(alpha = alpha) +
theme(axis.text.x = element_text(angle = 90)))
}
I will start with the acidity as it seems that it has some relationships with quality. But what about different type of acidity taken together?
In [113]:
create_multi_plot('volatile.acidity', 'fixed.acidity', 0.1)
Looking at the graphs it looks like higher quality wine have lesser volatile acidity and the spread of fixed acidity is grouped between 7 and 12
Now I will see the combined effect of volatile acidity and citric acid on quality. Both of them are related to the taste of wine so trying to find out what kind of taste makes better wine
In [114]:
create_multi_plot('volatile.acidity', 'citric.acid', 0.08)
From this it looks like lower quality wine has more vinegar like taste and as th quality increases there seems to be 2 types of taste that seem to be good as seen by 2 seemingly different groups of points most apparent in quality of wine with rating 7 but also slightly prominent in quality of wine with rating 6
So for good quality wine maybe we need one of citric acid and volatile acidity in medium quantity while other should be low or negligible but not both
I will explore the relationship of fixed.acidity and density on the basis of quality to see if I can see anything
In [122]:
create_multi_plot('density', 'fixed.acidity', 0.1)
Looking at this nothing actionable seems to come out.
Let me try citric acid and residual sugar. Both of them change taste
In [128]:
create_multi_plot('citric.acid', 'residual.sugar', 0.1)
Again looking at this nothing actionable seems to come out.
Let me try citric acid and free sulphur dioxide. They say older wine is better. I wonder how the components that add freshness and prevent microbial growth affect the quality
In [130]:
create_multi_plot('citric.acid', 'free.sulfur.dioxide', 0.1)
There doesn't seem to anything interesting here
In [30]:
create_final_plot <- function(variable, ylabel = 'ylabel', title = 'title', ylim = c(0, 1)) {
return(ggplot(aes_string(x = 'quality', y = variable, group = 'quality'), data = redWineData) +
geom_boxplot() +
xlab('Quality rating of Wine (1 to 10)') +
ylab(ylabel) +
labs(title = title) +
stat_summary(fun.y = mean, geom = 'point', shape = 4) +
scale_x_continuous(breaks = 3:8) +
coord_cartesian(ylim = ylim)
)
}
In [32]:
create_final_plot(
'volatile.acidity',
'acetic acid (g / dm^3)',
'Effect of Acetic Acid on Quality of Wine',
c(0.1, 1.2)
)
This plot shows relationship between acetic acid and quality of wine. It struck me as the correlation was clear to the eyes. Higher quality wine had lower acetic acid on an average
In [33]:
create_final_plot(
'citric.acid',
'citric acid (g / dm^3)',
'Effect of citric acid on Quality of Wine',
c(0, 0.8)
)
This plot shows relationship between citric acid and quality of wine. It struck me as the correlation was clear to the eyes. Higher quality wine had higher citric acid on an average
In [132]:
create_multi_plot('volatile.acidity', 'citric.acid', 0.08) +
labs(title = "Acetic Acid and citric acid's affect on Wine") +
xlab('Acetic acid - g / dm^3') +
ylab('citric acid (g / dm^3)')
Struck me as good quality wine seems to have one of citric acid and acetic acid in medium quantity while other should be low or negligible
While doing this project I looked at single variable's distributions. Not much came out of them. Nothing useful except that most of them were just normal. I was thinking I wasn't going to get anything out of this dataset.
But when I used the bivariate analysis some insights about the correlation came out that can be very interesting. Looking at the various plots and correlations I drew the following general conclusions
But the multivariate analysis was a challenge. Was not really sure what kind of relationships I could explore. The ggpairs plot prepared before helped me get some direction but not too much. So the multi variate analysis seemed too weak. To improve that I could go and check the original paper that did this wine analysis as suggested somewhere in the Udacity forums to see whether there are no relationships or did I just not have enough imagination but that would be against the Honour code so will do that after the project is finalized.