For the final project, you will conduct your own exploratory data analysis and create an RMD file that explores the variables, structure, patterns, oddities, and underlying relationships of a data set of your choice.

The analysis should be almost like a stream-of-consciousness as you ask questions, create visualizations, and explore your data.

This project is open-ended in that we are not looking for one right answer. As John Tukey stated, "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." We want you to ask interesting questions about data and give you a chance to explore. We will provide some options of data sets to explore; however, you may choose to explore an entirely different data set. You should be aware that finding your own data set and cleaning that data set into a form that can be read into R can take considerable time and effort. This can add as much as a day, a week, or even months to your project so only adventure to find and clean a data set if you are truly prepared with programming and data wrangling skills.



Red Wine Quality

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).



Red Wine Quality Exploration by Aseem Bansal

In this report we explore the chemical properties of red wine and try and find which properties affect the quality of the wine. This data set contains data for 1599 red wines with 11 variables.


In [1]:
library(ggplot2)
library(GGally)

In [2]:
redWineData <- read.csv('wineQualityReds.csv')

Let's start by looking at all columns of the data to get a general sense about how the data looks like


In [3]:
redWineData$X <- NULL

redWineData


fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcoholquality
7.4 0.700 0.00 1.9 0.076 11 34 0.99783.51 0.56 9.4 5
7.8 0.880 0.00 2.6 0.098 25 67 0.99683.20 0.68 9.8 5
7.8 0.760 0.04 2.3 0.092 15 54 0.99703.26 0.65 9.8 5
11.2 0.280 0.56 1.9 0.075 17 60 0.99803.16 0.58 9.8 6
7.4 0.700 0.00 1.9 0.076 11 34 0.99783.51 0.56 9.4 5
7.4 0.660 0.00 1.8 0.075 13 40 0.99783.51 0.56 9.4 5
7.9 0.600 0.06 1.6 0.069 15 59 0.99643.30 0.46 9.4 5
7.3 0.650 0.00 1.2 0.065 15 21 0.99463.39 0.47 10.0 7
7.8 0.580 0.02 2.0 0.073 9 18 0.99683.36 0.57 9.5 7
7.5 0.500 0.36 6.1 0.071 17 102 0.99783.35 0.80 10.5 5
6.7 0.580 0.08 1.8 0.097 15 65 0.99593.28 0.54 9.2 5
7.5 0.500 0.36 6.1 0.071 17 102 0.99783.35 0.80 10.5 5
5.6 0.615 0.00 1.6 0.089 16 59 0.99433.58 0.52 9.9 5
7.8 0.610 0.29 1.6 0.114 9 29 0.99743.26 1.56 9.1 5
8.9 0.620 0.18 3.8 0.176 52 145 0.99863.16 0.88 9.2 5
8.9 0.620 0.19 3.9 0.170 51 148 0.99863.17 0.93 9.2 5
8.5 0.280 0.56 1.8 0.092 35 103 0.99693.30 0.75 10.5 7
8.1 0.560 0.28 1.7 0.368 16 56 0.99683.11 1.28 9.3 5
7.4 0.590 0.08 4.4 0.086 6 29 0.99743.38 0.50 9.0 4
7.9 0.320 0.51 1.8 0.341 17 56 0.99693.04 1.08 9.2 6
8.9 0.220 0.48 1.8 0.077 29 60 0.99683.39 0.53 9.4 6
7.6 0.390 0.31 2.3 0.082 23 71 0.99823.52 0.65 9.7 5
7.9 0.430 0.21 1.6 0.106 10 37 0.99663.17 0.91 9.5 5
8.5 0.490 0.11 2.3 0.084 9 67 0.99683.17 0.53 9.4 5
6.9 0.400 0.14 2.4 0.085 21 40 0.99683.43 0.63 9.7 6
6.3 0.390 0.16 1.4 0.080 11 23 0.99553.34 0.56 9.3 5
7.6 0.410 0.24 1.8 0.080 4 11 0.99623.28 0.59 9.5 5
7.9 0.430 0.21 1.6 0.106 10 37 0.99663.17 0.91 9.5 5
7.1 0.710 0.00 1.9 0.080 14 35 0.99723.47 0.55 9.4 5
7.8 0.645 0.00 2.0 0.082 8 16 0.99643.38 0.59 9.8 6
6.2 0.510 0.14 1.9 0.056 15 34 0.993963.48 0.57 11.5 6
6.4 0.360 0.53 2.2 0.230 19 35 0.993403.37 0.93 12.4 6
6.4 0.380 0.14 2.2 0.038 15 25 0.995143.44 0.65 11.1 6
7.3 0.690 0.32 2.2 0.069 35 104 0.996323.33 0.51 9.5 5
6.0 0.580 0.20 2.4 0.075 15 50 0.994673.58 0.67 12.5 6
5.6 0.310 0.78 13.9 0.074 23 92 0.996773.39 0.48 10.5 6
7.5 0.520 0.40 2.2 0.060 12 20 0.994743.26 0.64 11.8 6
8.0 0.300 0.63 1.6 0.081 16 29 0.995883.30 0.78 10.8 6
6.2 0.700 0.15 5.1 0.076 13 27 0.996223.54 0.60 11.9 6
6.8 0.670 0.15 1.8 0.118 13 20 0.995403.42 0.67 11.3 6
6.2 0.560 0.09 1.7 0.053 24 32 0.994023.54 0.60 11.3 5
7.4 0.350 0.33 2.4 0.068 9 26 0.994703.36 0.60 11.9 6
6.2 0.560 0.09 1.7 0.053 24 32 0.994023.54 0.60 11.3 5
6.1 0.715 0.10 2.6 0.053 13 27 0.993623.57 0.50 11.9 5
6.2 0.460 0.29 2.1 0.074 32 98 0.995783.33 0.62 9.8 5
6.7 0.320 0.44 2.4 0.061 24 34 0.994843.29 0.80 11.6 7
7.2 0.390 0.44 2.6 0.066 22 48 0.994943.30 0.84 11.5 6
7.5 0.310 0.41 2.4 0.065 34 60 0.994923.34 0.85 11.4 6
5.8 0.610 0.11 1.8 0.066 18 28 0.994833.55 0.66 10.9 6
7.2 0.660 0.33 2.5 0.068 34 102 0.994143.27 0.78 12.8 6
6.6 0.725 0.20 7.8 0.073 29 79 0.997703.29 0.54 9.2 5
6.3 0.550 0.15 1.8 0.077 26 35 0.993143.32 0.82 11.6 6
5.4 0.740 0.09 1.7 0.089 16 26 0.994023.67 0.56 11.6 6
6.3 0.510 0.13 2.3 0.076 29 40 0.995743.42 0.75 11.0 6
6.8 0.620 0.08 1.9 0.068 28 38 0.996513.42 0.82 9.5 6
6.2 0.600 0.08 2.0 0.090 32 44 0.994903.45 0.58 10.5 5
5.9 0.550 0.10 2.2 0.062 39 51 0.995123.52 0.76 11.2 6
6.3 0.510 0.13 2.3 0.076 29 40 0.995743.42 0.75 11.0 6
5.9 0.645 0.12 2.0 0.075 32 44 0.995473.57 0.71 10.2 5
6.0 0.310 0.47 3.6 0.067 18 42 0.995493.39 0.66 11.0 6

In [4]:
str(redWineData)


'data.frame':	1599 obs. of  12 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

In [5]:
names(redWineData)


  1. 'fixed.acidity'
  2. 'volatile.acidity'
  3. 'citric.acid'
  4. 'residual.sugar'
  5. 'chlorides'
  6. 'free.sulfur.dioxide'
  7. 'total.sulfur.dioxide'
  8. 'density'
  9. 'pH'
  10. 'sulphates'
  11. 'alcohol'
  12. 'quality'

In [6]:
summary(redWineData)


 fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
 Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
 1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
 Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
 Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
 3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
 Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
   chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
 Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
 1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
 Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
 Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
 3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
 Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
       pH          sulphates         alcohol         quality     
 Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
 1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
 Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
 Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
 3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
 Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000  

Now let's start by creating some plots. I will plot all of the variables once to ensure that I get a sense of all it.

Histograms of various variables


In [5]:
create_plot <- function(variable, binwidth = 0.01) {
    return(ggplot(aes_string(x = variable), data = redWineData) +
               geom_histogram(binwidth = binwidth))
    }

In [6]:
create_plot('fixed.acidity', 0.25)


The fixed acidity peaks at around 8. It appears to be a normal distribution


In [7]:
create_plot('volatile.acidity', 0.05)


Volatile acidity has a plateau in the range 0.3 to 0.6. Somewhat like a normal distribution when I look at this plot. If I change the bind width then it will look a normal distribution.


In [11]:
create_plot('citric.acid', 0.04)


citric.acid has 2 peaks at 0 and 0.5


In [14]:
create_plot('residual.sugar', 0.5)


Not much information can be seen from this about the tail. Will have to change the scale of data to see that.


In [20]:
create_plot('residual.sugar', 0.05) + 
    scale_x_log10()


similar results like other fields. residual.sugar is normal distribution


In [24]:
create_plot('chlorides', 0.05)

create_plot('chlorides', 0.05) + 
    scale_x_log10()


chlorides appear to be a normal distribution


In [26]:
create_plot('free.sulfur.dioxide', 2.5)


free.sulfur.dioxide has a peak near 5


In [48]:
create_plot('total.sulfur.dioxide', 2)


total.sulfur.dioxide peaks near 10.


In [42]:
create_plot('density', 0.0005)


Again a normal distribution


In [50]:
create_plot('pH', 0.04)


Again a normal distribution


In [52]:
create_plot('sulphates', 0.04)



In [57]:
create_plot('alcohol', 0.2)


This peaks around 9


In [62]:
create_plot('quality', 1)


Univariate Analysis

What is the structure of your dataset?

  • We have 1599 rows for wine data present.
  • 11 numerical variables are present. Most of them have a normal distribution except citric.acid which has a bimodal distribution
  • 1 integer variable (quality) is present
  • Distribution of data
    • normal distribution - fixed.acidity, volatile.acidity, residual.sugar, chlorides, density, pH, sulphates, quality
    • Not normal distribution - citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, alcohol

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. We need to find what all features affect that.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From the data so far it is hard to say which will be useful but from the description of the variables I would consider the following

  • volatile.acidity: too high of levels can lead to an unpleasant, vinegar taste
  • citric.acid: can add 'freshness' and flavor to wines
  • free.sulfur.dioxide: prevents microbial growth and the oxidation of wine
  • alcohol: the percent alcohol content of the wine

Did you create any new variables from existing variables in the dataset?

No did not create any new variables

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

All of the data is numerical. The distributions were normal. No changes needed.

Bivariate Plots Section

Let me make plots of all the variables versus quality to see what kind of coorelation appears

Box Plots of various variables vs. quality


In [15]:
create_plot_box <- function(variable, ylim = c(0, 1)) {
    return(ggplot(aes_string(x = 'quality', y = variable, group = 'quality'), 
                  data = redWineData) +
           geom_boxplot() +
           stat_summary(fun.y = mean, geom = 'point', shape = 4) +
           coord_cartesian(ylim = ylim))
    }

In [18]:
create_plot_box('fixed.acidity', c(4.5, 13))


There does not seem to have much relationship between quality and fixed.acidity. Maybe a weak one that fixed.acidity is higher for higher quality wine


In [68]:
create_plot_box('volatile.acidity', c(0.1, 1.2))


Now this one is a very pronounced effect. It can be clearly seen that as quality increases volatile.acidity decreases


In [69]:
create_plot_box('citric.acid', c(0, 0.8))


The relationship between quality and citric.acid is direct and easily visible. Higher quality has higher citric.acid


In [70]:
create_plot_box('residual.sugar', c(1, 4.5))


There does not seem to be any relationship


In [73]:
create_plot_box('chlorides', c(0.04, 0.2))


Maybe a weak relationship that higher quality has lower chlorides


In [74]:
create_plot_box('total.sulfur.dioxide', c(0, 170))


Seems like a normal relationship. Medium quality wines have higher total sulpher dioxide


In [75]:
create_plot_box('free.sulfur.dioxide', c(0, 45))


Seems like a normal relationship. Medium quality wines have higher free sulpher dioxide


In [77]:
create_plot_box('density', c(0.991, 1.002))


Higher quality wine have lower density


In [78]:
create_plot_box('pH', c(2.9, 3.8))


Higher quality wine have lower pH


In [79]:
create_plot_box('sulphates', c(0.3, 1.05))


Higher quality wine has higher sulphates


In [80]:
create_plot_box('alcohol', c(8, 14))


Higher quality wine have higher alchohol

Now let's make a grid to see the relationships between all variables to see whether I missed anything.


In [229]:
theme_set(theme_minimal(20))

set.seed(1836)
ggpairs(redWineData, axisLabels = 'internal')


Let's try and check the correlations between the various variables as the above one isn't that clear and I want to see that more clearly


In [14]:
cor(redWineData)


fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcoholquality
fixed.acidity 1.00000000 -0.256130895 0.67170343 0.114776724 0.093705186-0.153794193-0.11318144 0.66804729 -0.68297819 0.183005664-0.06166827 0.12405165
volatile.acidity-0.25613089 1.000000000-0.55249568 0.001917882 0.061297772-0.010503827 0.07647000 0.02202623 0.23493729 -0.260986685-0.20228803 -0.39055778
citric.acid 0.67170343 -0.552495685 1.00000000 0.143577162 0.203822914-0.060978129 0.03553302 0.36494718 -0.54190414 0.312770044 0.10990325 0.22637251
residual.sugar 0.11477672 0.001917882 0.14357716 1.000000000 0.055609535 0.187048995 0.20302788 0.35528337 -0.08565242 0.005527121 0.04207544 0.01373164
chlorides 0.09370519 0.061297772 0.20382291 0.055609535 1.000000000 0.005562147 0.04740047 0.20063233 -0.26502613 0.371260481-0.22114054 -0.12890656
free.sulfur.dioxide-0.15379419 -0.010503827-0.06097813 0.187048995 0.005562147 1.000000000 0.66766645 -0.02194583 0.07037750 0.051657572-0.06940835 -0.05065606
total.sulfur.dioxide-0.11318144 0.076470005 0.03553302 0.203027882 0.047400468 0.667666450 1.00000000 0.07126948 -0.06649456 0.042946836-0.20565394 -0.18510029
density 0.66804729 0.022026232 0.36494718 0.355283371 0.200632327-0.021945831 0.07126948 1.00000000 -0.34169933 0.148506412-0.49617977 -0.17491923
pH-0.68297819 0.234937294-0.54190414 -0.085652422-0.265026131 0.070377499-0.06649456 -0.34169933 1.00000000 -0.196647602 0.20563251 -0.05773139
sulphates 0.18300566 -0.260986685 0.31277004 0.005527121 0.371260481 0.051657572 0.04294684 0.14850641 -0.19664760 1.000000000 0.09359475 0.25139708
alcohol-0.06166827 -0.202288027 0.10990325 0.042075437-0.221140545-0.069408354-0.20565394 -0.49617977 0.20563251 0.093594750 1.00000000 0.47616632
quality 0.12405165 -0.390557780 0.22637251 0.013731637-0.128906560-0.050656057-0.18510029 -0.17491923 -0.05773139 0.251397079 0.47616632 1.00000000

The plot and table above shows me that there are various other relationships that can be explored. I am taking any correlation above 0.4 as something that can be explored. Example would be

  • fixed.acidity and pH had correlation of -0.683
  • citric.acid and pH had correlation of -0.542

In [61]:
create_scatter_plot <- function(x, y, alpha) {
    return(ggplot(aes_string(x = x, y =  y), data = redWineData) +
               geom_point(alpha = alpha))
}

Let's make a plot of citric acid and fixed acidity that has a correlation of 0.67170343


In [63]:
create_scatter_plot('citric.acid', 'fixed.acidity', 1/5)


As citric acid increases fixed acidity also increases. But we can see that this is not a direct relationship. For values of 0 there is acidity present so there is some other factor present

Let's make a plot of density and fixed acidity which has a correlation of 0.66804729


In [68]:
create_scatter_plot('density', 'fixed.acidity', 1/5)


We can see that as density increases fixed acidity also increases.

Let's make a plot of fixed acidity and pH which has a correlation of -0.68297819


In [69]:
create_scatter_plot('fixed.acidity', 'pH', 1/5)


As fixed acidity increases pH decreases

Let's make a plot of citric acid and volatile acidity which has a correlation of -0.552495685


In [70]:
create_scatter_plot('citric.acid', 'volatile.acidity', 1/5)


As citric acid increases volatile acidity decreases

Let's make a plot of citric acid and pH which has correlation of -0.54190414


In [71]:
create_scatter_plot('citric.acid', 'pH', 1/5)


As citric acid increase the pH decreases

Let's make a plot between free sulphur dioxide and total sulphur dioxide which has a correlation of 0.667666450


In [72]:
create_scatter_plot('free.sulfur.dioxide', 'total.sulfur.dioxide', 1/5)


Out of all the scatter plot that I have seen so far this interests me the most. The lower part of y seems to be a straight line while all the other plots so far were more of spread out. Maybe the total sulphur dioxide is at least the amount of free sulphur dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Weak or No Relationships

These do not seem to have a direct relationship with quality

  • fixed.acidity
  • residual.sugar
  • chlorides: The higher the quality of wine lesser the mean volatile.acidity in the wine. But the relationship did not seem too strong as the decrease wasn't too much

Medium or Strong Relationships

These seem to be related with quality

  • volatile.acidity: The higher the quality of wine lesser the mean volatile.acidity in the wine
  • citric.acid: The higher the quality of wine higher the mean citric.acid in the wine
  • free.sulfur.dioxide: Good quality wines seemed to have medium quantity of this chemical
  • total.sulfur.dioxide: Good quality wines seemed to have medium quantity of this chemical
  • density: For good quality wine the average density seems to decrease
  • pH: For good quality wine the mean pH seems to decrease

From the chosen 4 initial variables

  • volatile.acidity
  • citric.acid
  • free.sulfur.dioxide
  • alcohol

all except free.sulfur.dioxide had seem to have strong relationship with quality of the wine. sulphates were not expected but they had strong relationships

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Many variables had correlation magnitude greater than 0.5

  • fixed.acidity and pH had correlation of -0.683
  • citric.acid and pH had correlation of -0.542
  • citric.acid and fixed.acidity had correlation of 0.672
  • density and fixed.acidity had correlation of 0.668
  • citric.acid and volatile.acidity had correlation of -0.552
  • total.sulfur.dioxide and free.sulfur.dioxide had correlation of 0.668

But none of them were correlated to quality that strongly

  • alcohol 0.476
  • volatile.acidity -0.391
  • sulphates 0.251
  • citric.acid 0.226

Total sulphur dioxide seems to lower bounded by free sulphur dioxide

What was the strongest relationship you found?

Looking at the graphs and the correlations

  • volatile.acidity with quality
  • citric.acid with quality
  • sulphates with quality
  • alcohol with quality

Multivariate Plots Section

Here I will explore and try to find out whether variables that don't have have high correlation with quality by themselves taken together can have some kind of trend with quality.


In [116]:
create_multi_plot <- function(x, y, alpha) {
    return(ggplot(aes_string(x = x, y =  y), 
                  data = redWineData) +
           facet_wrap(~quality) +
           geom_point(alpha = alpha) +
           theme(axis.text.x = element_text(angle = 90)))
}

I will start with the acidity as it seems that it has some relationships with quality. But what about different type of acidity taken together?


In [113]:
create_multi_plot('volatile.acidity', 'fixed.acidity', 0.1)


Looking at the graphs it looks like higher quality wine have lesser volatile acidity and the spread of fixed acidity is grouped between 7 and 12

Now I will see the combined effect of volatile acidity and citric acid on quality. Both of them are related to the taste of wine so trying to find out what kind of taste makes better wine


In [114]:
create_multi_plot('volatile.acidity', 'citric.acid', 0.08)


From this it looks like lower quality wine has more vinegar like taste and as th quality increases there seems to be 2 types of taste that seem to be good as seen by 2 seemingly different groups of points most apparent in quality of wine with rating 7 but also slightly prominent in quality of wine with rating 6

  • Low or negligible citric acid with low volatile acidity.
  • Medium citric acid with low or neglibible volatile acidity

So for good quality wine maybe we need one of citric acid and volatile acidity in medium quantity while other should be low or negligible but not both

I will explore the relationship of fixed.acidity and density on the basis of quality to see if I can see anything


In [122]:
create_multi_plot('density', 'fixed.acidity', 0.1)


Looking at this nothing actionable seems to come out.

Let me try citric acid and residual sugar. Both of them change taste


In [128]:
create_multi_plot('citric.acid', 'residual.sugar', 0.1)


Again looking at this nothing actionable seems to come out.

Let me try citric acid and free sulphur dioxide. They say older wine is better. I wonder how the components that add freshness and prevent microbial growth affect the quality


In [130]:
create_multi_plot('citric.acid', 'free.sulfur.dioxide', 0.1)


There doesn't seem to anything interesting here

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • good quality wine seems to have one of citric acid and volatile acidity in medium quantity while other should be low or negligible but not both
  • More acidic wine seems to have higher quality. Particularly acidity that does not evaporate matters. This was shown by positive correlation (0.124) of fixed.acidity and negative correlation (-0.391) of volatile.acidity. The pH didn't really matter as shown by the very small correlation (-0.0507) as these two components affected the quality in opposite directions and thus overall acidity did not matter

Were there any interesting or surprising interactions between features?

  • citric.acid and fixed.acidity seem to have a quadratic relationship between them

Final Plots and Summary

Plot One


In [30]:
create_final_plot <- function(variable, ylabel = 'ylabel', title = 'title', ylim = c(0, 1)) {
    return(ggplot(aes_string(x = 'quality', y = variable, group = 'quality'), data = redWineData) +
               geom_boxplot() +
               xlab('Quality rating of Wine (1 to 10)') +
               ylab(ylabel) +
               labs(title = title) +
               stat_summary(fun.y = mean, geom = 'point', shape = 4) +
               scale_x_continuous(breaks = 3:8) +
               coord_cartesian(ylim = ylim)
          )
    }

In [32]:
create_final_plot(
    'volatile.acidity', 
    'acetic acid (g / dm^3)', 
    'Effect of Acetic Acid on Quality of Wine', 
    c(0.1, 1.2)
)


This plot shows relationship between acetic acid and quality of wine. It struck me as the correlation was clear to the eyes. Higher quality wine had lower acetic acid on an average


In [33]:
create_final_plot(
    'citric.acid', 
    'citric acid (g / dm^3)', 
    'Effect of citric acid on Quality of Wine', 
    c(0, 0.8)
)


This plot shows relationship between citric acid and quality of wine. It struck me as the correlation was clear to the eyes. Higher quality wine had higher citric acid on an average


In [132]:
create_multi_plot('volatile.acidity', 'citric.acid', 0.08) +
    labs(title = "Acetic Acid and citric acid's affect on Wine") + 
    xlab('Acetic acid - g / dm^3') +
    ylab('citric acid (g / dm^3)')


Struck me as good quality wine seems to have one of citric acid and acetic acid in medium quantity while other should be low or negligible

Reflection

While doing this project I looked at single variable's distributions. Not much came out of them. Nothing useful except that most of them were just normal. I was thinking I wasn't going to get anything out of this dataset.

But when I used the bivariate analysis some insights about the correlation came out that can be very interesting. Looking at the various plots and correlations I drew the following general conclusions

  • More acidic wine seems to have higher quality. Particularly acidity that does not evaporate matters. This was shown by positive correlation (0.124) of fixed.acidity and negative correlation (-0.391) of volatile.acidity. The pH didn't really matter as shown by the very small correlation (-0.0507) as these two components affected the quality in opposite directions and thus overall acidity did not matter. Based on this if I want to take this exploration further and need to build a model I will consider the following while choosing features for my model
    • more alcohol means better wine
    • acidity that stays matters. citric.acid seems to help this a lot while sulphates do minor contributions

But the multivariate analysis was a challenge. Was not really sure what kind of relationships I could explore. The ggpairs plot prepared before helped me get some direction but not too much. So the multi variate analysis seemed too weak. To improve that I could go and check the original paper that did this wine analysis as suggested somewhere in the Udacity forums to see whether there are no relationships or did I just not have enough imagination but that would be against the Honour code so will do that after the project is finalized.