For the final project, you will conduct your own exploratory data analysis and create an RMD file that explores the variables, structure, patterns, oddities, and underlying relationships of a data set of your choice.

The analysis should be almost like a stream-of-consciousness as you ask questions, create visualizations, and explore your data.

This project is open-ended in that we are not looking for one right answer. As John Tukey stated, "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." We want you to ask interesting questions about data and give you a chance to explore. We will provide some options of data sets to explore; however, you may choose to explore an entirely different data set. You should be aware that finding your own data set and cleaning that data set into a form that can be read into R can take considerable time and effort. This can add as much as a day, a week, or even months to your project so only adventure to find and clean a data set if you are truly prepared with programming and data wrangling skills.

Red Wine Quality

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Red Wine Quality Exploration by Aseem Bansal

In this report we explore the chemical properties of red wine and try and find which properties affect the quality of the wine. This data set contains data for 1599 red wines with 11 variables.



In [1]:

    
library(ggplot2)
library(GGally)



In [2]:

    
redWineData <- read.csv('wineQualityReds.csv')

Let's start by looking at all columns of the data to get a general sense about how the data looks like



In [3]:

    
redWineData$X <- NULL

redWineData









    





fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality

	 7.4  0.700 0.00  1.9   0.076 11     34   0.9978 3.51  0.56   9.4  5     
	 7.8  0.880 0.00  2.6   0.098 25     67   0.9968 3.20  0.68   9.8  5     
	 7.8  0.760 0.04  2.3   0.092 15     54   0.9970 3.26  0.65   9.8  5     
	11.2  0.280 0.56  1.9   0.075 17     60   0.9980 3.16  0.58   9.8  6     
	 7.4  0.700 0.00  1.9   0.076 11     34   0.9978 3.51  0.56   9.4  5     
	 7.4  0.660 0.00  1.8   0.075 13     40   0.9978 3.51  0.56   9.4  5     
	 7.9  0.600 0.06  1.6   0.069 15     59   0.9964 3.30  0.46   9.4  5     
	 7.3  0.650 0.00  1.2   0.065 15     21   0.9946 3.39  0.47  10.0  7     
	 7.8  0.580 0.02  2.0   0.073  9     18   0.9968 3.36  0.57   9.5  7     
	 7.5  0.500 0.36  6.1   0.071 17    102   0.9978 3.35  0.80  10.5  5     
	 6.7  0.580 0.08  1.8   0.097 15     65   0.9959 3.28  0.54   9.2  5     
	 7.5  0.500 0.36  6.1   0.071 17    102   0.9978 3.35  0.80  10.5  5     
	 5.6  0.615 0.00  1.6   0.089 16     59   0.9943 3.58  0.52   9.9  5     
	 7.8  0.610 0.29  1.6   0.114  9     29   0.9974 3.26  1.56   9.1  5     
	 8.9  0.620 0.18  3.8   0.176 52    145   0.9986 3.16  0.88   9.2  5     
	 8.9  0.620 0.19  3.9   0.170 51    148   0.9986 3.17  0.93   9.2  5     
	 8.5  0.280 0.56  1.8   0.092 35    103   0.9969 3.30  0.75  10.5  7     
	 8.1  0.560 0.28  1.7   0.368 16     56   0.9968 3.11  1.28   9.3  5     
	 7.4  0.590 0.08  4.4   0.086  6     29   0.9974 3.38  0.50   9.0  4     
	 7.9  0.320 0.51  1.8   0.341 17     56   0.9969 3.04  1.08   9.2  6     
	 8.9  0.220 0.48  1.8   0.077 29     60   0.9968 3.39  0.53   9.4  6     
	 7.6  0.390 0.31  2.3   0.082 23     71   0.9982 3.52  0.65   9.7  5     
	 7.9  0.430 0.21  1.6   0.106 10     37   0.9966 3.17  0.91   9.5  5     
	 8.5  0.490 0.11  2.3   0.084  9     67   0.9968 3.17  0.53   9.4  5     
	 6.9  0.400 0.14  2.4   0.085 21     40   0.9968 3.43  0.63   9.7  6     
	 6.3  0.390 0.16  1.4   0.080 11     23   0.9955 3.34  0.56   9.3  5     
	 7.6  0.410 0.24  1.8   0.080  4     11   0.9962 3.28  0.59   9.5  5     
	 7.9  0.430 0.21  1.6   0.106 10     37   0.9966 3.17  0.91   9.5  5     
	 7.1  0.710 0.00  1.9   0.080 14     35   0.9972 3.47  0.55   9.4  5     
	 7.8  0.645 0.00  2.0   0.082  8     16   0.9964 3.38  0.59   9.8  6     
	⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
	6.2    0.510  0.14    1.9   0.056  15      34    0.99396 3.48   0.57   11.5   6      
	6.4    0.360  0.53    2.2   0.230  19      35    0.99340 3.37   0.93   12.4   6      
	6.4    0.380  0.14    2.2   0.038  15      25    0.99514 3.44   0.65   11.1   6      
	7.3    0.690  0.32    2.2   0.069  35     104    0.99632 3.33   0.51    9.5   5      
	6.0    0.580  0.20    2.4   0.075  15      50    0.99467 3.58   0.67   12.5   6      
	5.6    0.310  0.78   13.9   0.074  23      92    0.99677 3.39   0.48   10.5   6      
	7.5    0.520  0.40    2.2   0.060  12      20    0.99474 3.26   0.64   11.8   6      
	8.0    0.300  0.63    1.6   0.081  16      29    0.99588 3.30   0.78   10.8   6      
	6.2    0.700  0.15    5.1   0.076  13      27    0.99622 3.54   0.60   11.9   6      
	6.8    0.670  0.15    1.8   0.118  13      20    0.99540 3.42   0.67   11.3   6      
	6.2    0.560  0.09    1.7   0.053  24      32    0.99402 3.54   0.60   11.3   5      
	7.4    0.350  0.33    2.4   0.068   9      26    0.99470 3.36   0.60   11.9   6      
	6.2    0.560  0.09    1.7   0.053  24      32    0.99402 3.54   0.60   11.3   5      
	6.1    0.715  0.10    2.6   0.053  13      27    0.99362 3.57   0.50   11.9   5      
	6.2    0.460  0.29    2.1   0.074  32      98    0.99578 3.33   0.62    9.8   5      
	6.7    0.320  0.44    2.4   0.061  24      34    0.99484 3.29   0.80   11.6   7      
	7.2    0.390  0.44    2.6   0.066  22      48    0.99494 3.30   0.84   11.5   6      
	7.5    0.310  0.41    2.4   0.065  34      60    0.99492 3.34   0.85   11.4   6      
	5.8    0.610  0.11    1.8   0.066  18      28    0.99483 3.55   0.66   10.9   6      
	7.2    0.660  0.33    2.5   0.068  34     102    0.99414 3.27   0.78   12.8   6      
	6.6    0.725  0.20    7.8   0.073  29      79    0.99770 3.29   0.54    9.2   5      
	6.3    0.550  0.15    1.8   0.077  26      35    0.99314 3.32   0.82   11.6   6      
	5.4    0.740  0.09    1.7   0.089  16      26    0.99402 3.67   0.56   11.6   6      
	6.3    0.510  0.13    2.3   0.076  29      40    0.99574 3.42   0.75   11.0   6      
	6.8    0.620  0.08    1.9   0.068  28      38    0.99651 3.42   0.82    9.5   6      
	6.2    0.600  0.08    2.0   0.090  32      44    0.99490 3.45   0.58   10.5   5      
	5.9    0.550  0.10    2.2   0.062  39      51    0.99512 3.52   0.76   11.2   6      
	6.3    0.510  0.13    2.3   0.076  29      40    0.99574 3.42   0.75   11.0   6      
	5.9    0.645  0.12    2.0   0.075  32      44    0.99547 3.57   0.71   10.2   5      
	6.0    0.310  0.47    3.6   0.067  18      42    0.99549 3.39   0.66   11.0   6



In [4]:

    
str(redWineData)









    



'data.frame':	1599 obs. of  12 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...



In [5]:

    
names(redWineData)









    





	'fixed.acidity'
	'volatile.acidity'
	'citric.acid'
	'residual.sugar'
	'chlorides'
	'free.sulfur.dioxide'
	'total.sulfur.dioxide'
	'density'
	'pH'
	'sulphates'
	'alcohol'
	'quality'



In [6]:

    
summary(redWineData)









    





 fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
 Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
 1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
 Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
 Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
 3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
 Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
   chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
 Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
 1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
 Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
 Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
 3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
 Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
       pH          sulphates         alcohol         quality     
 Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
 1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
 Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
 Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
 3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
 Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Now let's start by creating some plots. I will plot all of the variables once to ensure that I get a sense of all it.

Histograms of various variables



In [5]:

    
create_plot <- function(variable, binwidth = 0.01) {
    return(ggplot(aes_string(x = variable), data = redWineData) +
               geom_histogram(binwidth = binwidth))
    }



In [6]:

    
create_plot('fixed.acidity', 0.25)

The fixed acidity peaks at around 8. It appears to be a normal distribution



In [7]:

    
create_plot('volatile.acidity', 0.05)

Volatile acidity has a plateau in the range 0.3 to 0.6. Somewhat like a normal distribution when I look at this plot. If I change the bind width then it will look a normal distribution.



In [11]:

    
create_plot('citric.acid', 0.04)

citric.acid has 2 peaks at 0 and 0.5



In [14]:

    
create_plot('residual.sugar', 0.5)

Not much information can be seen from this about the tail. Will have to change the scale of data to see that.



In [20]:

    
create_plot('residual.sugar', 0.05) + 
    scale_x_log10()

similar results like other fields. residual.sugar is normal distribution



In [24]:

    
create_plot('chlorides', 0.05)

create_plot('chlorides', 0.05) + 
    scale_x_log10()

chlorides appear to be a normal distribution



In [26]:

    
create_plot('free.sulfur.dioxide', 2.5)

free.sulfur.dioxide has a peak near 5



In [48]:

    
create_plot('total.sulfur.dioxide', 2)

total.sulfur.dioxide peaks near 10.



In [42]:

    
create_plot('density', 0.0005)

Again a normal distribution



In [50]:

    
create_plot('pH', 0.04)

Again a normal distribution



In [52]:

    
create_plot('sulphates', 0.04)



In [57]:

    
create_plot('alcohol', 0.2)

This peaks around 9



In [62]:

    
create_plot('quality', 1)

Univariate Analysis

What is the structure of your dataset?

We have 1599 rows for wine data present.
11 numerical variables are present. Most of them have a normal distribution except citric.acid which has a bimodal distribution
1 integer variable (quality) is present
Distribution of data
- normal distribution - fixed.acidity, volatile.acidity, residual.sugar, chlorides, density, pH, sulphates, quality
- Not normal distribution - citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, alcohol

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. We need to find what all features affect that.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From the data so far it is hard to say which will be useful but from the description of the variables I would consider the following

volatile.acidity: too high of levels can lead to an unpleasant, vinegar taste
citric.acid: can add 'freshness' and flavor to wines
free.sulfur.dioxide: prevents microbial growth and the oxidation of wine
alcohol: the percent alcohol content of the wine

Did you create any new variables from existing variables in the dataset?

No did not create any new variables

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

All of the data is numerical. The distributions were normal. No changes needed.

Bivariate Plots Section

Let me make plots of all the variables versus quality to see what kind of coorelation appears

Box Plots of various variables vs. quality



In [15]:

    
create_plot_box <- function(variable, ylim = c(0, 1)) {
    return(ggplot(aes_string(x = 'quality', y = variable, group = 'quality'), 
                  data = redWineData) +
           geom_boxplot() +
           stat_summary(fun.y = mean, geom = 'point', shape = 4) +
           coord_cartesian(ylim = ylim))
    }



In [18]:

    
create_plot_box('fixed.acidity', c(4.5, 13))

There does not seem to have much relationship between quality and fixed.acidity. Maybe a weak one that fixed.acidity is higher for higher quality wine



In [68]:

    
create_plot_box('volatile.acidity', c(0.1, 1.2))

Now this one is a very pronounced effect. It can be clearly seen that as quality increases volatile.acidity decreases



In [69]:

    
create_plot_box('citric.acid', c(0, 0.8))

The relationship between quality and citric.acid is direct and easily visible. Higher quality has higher citric.acid



In [70]:

    
create_plot_box('residual.sugar', c(1, 4.5))

There does not seem to be any relationship



In [73]:

    
create_plot_box('chlorides', c(0.04, 0.2))

Maybe a weak relationship that higher quality has lower chlorides



In [74]:

    
create_plot_box('total.sulfur.dioxide', c(0, 170))

Seems like a normal relationship. Medium quality wines have higher total sulpher dioxide



In [75]:

    
create_plot_box('free.sulfur.dioxide', c(0, 45))

Seems like a normal relationship. Medium quality wines have higher free sulpher dioxide



In [77]:

    
create_plot_box('density', c(0.991, 1.002))

Higher quality wine have lower density



In [78]:

    
create_plot_box('pH', c(2.9, 3.8))

Higher quality wine have lower pH



In [79]:

    
create_plot_box('sulphates', c(0.3, 1.05))

Higher quality wine has higher sulphates



In [80]:

    
create_plot_box('alcohol', c(8, 14))

Higher quality wine have higher alchohol

Now let's make a grid to see the relationships between all variables to see whether I missed anything.



In [229]:

    
theme_set(theme_minimal(20))

set.seed(1836)
ggpairs(redWineData, axisLabels = 'internal')

Let's try and check the correlations between the various variables as the above one isn't that clear and I want to see that more clearly



In [14]:

    
cor(redWineData)









    





fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality

	fixed.acidity  1.00000000 -0.256130895  0.67170343  0.114776724  0.093705186 -0.153794193 -0.11318144  0.66804729 -0.68297819  0.183005664 -0.06166827  0.12405165 
	volatile.acidity -0.25613089  1.000000000 -0.55249568  0.001917882  0.061297772 -0.010503827  0.07647000  0.02202623  0.23493729 -0.260986685 -0.20228803 -0.39055778 
	citric.acid  0.67170343 -0.552495685  1.00000000  0.143577162  0.203822914 -0.060978129  0.03553302  0.36494718 -0.54190414  0.312770044  0.10990325  0.22637251 
	residual.sugar  0.11477672  0.001917882  0.14357716  1.000000000  0.055609535  0.187048995  0.20302788  0.35528337 -0.08565242  0.005527121  0.04207544  0.01373164 
	chlorides  0.09370519  0.061297772  0.20382291  0.055609535  1.000000000  0.005562147  0.04740047  0.20063233 -0.26502613  0.371260481 -0.22114054 -0.12890656 
	free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813  0.187048995  0.005562147  1.000000000  0.66766645 -0.02194583  0.07037750  0.051657572 -0.06940835 -0.05065606 
	total.sulfur.dioxide -0.11318144  0.076470005  0.03553302  0.203027882  0.047400468  0.667666450  1.00000000  0.07126948 -0.06649456  0.042946836 -0.20565394 -0.18510029 
	density  0.66804729  0.022026232  0.36494718  0.355283371  0.200632327 -0.021945831  0.07126948  1.00000000 -0.34169933  0.148506412 -0.49617977 -0.17491923 
	pH -0.68297819  0.234937294 -0.54190414 -0.085652422 -0.265026131  0.070377499 -0.06649456 -0.34169933  1.00000000 -0.196647602  0.20563251 -0.05773139 
	sulphates  0.18300566 -0.260986685  0.31277004  0.005527121  0.371260481  0.051657572  0.04294684  0.14850641 -0.19664760  1.000000000  0.09359475  0.25139708 
	alcohol -0.06166827 -0.202288027  0.10990325  0.042075437 -0.221140545 -0.069408354 -0.20565394 -0.49617977  0.20563251  0.093594750  1.00000000  0.47616632 
	quality  0.12405165 -0.390557780  0.22637251  0.013731637 -0.128906560 -0.050656057 -0.18510029 -0.17491923 -0.05773139  0.251397079  0.47616632  1.00000000

The plot and table above shows me that there are various other relationships that can be explored. I am taking any correlation above 0.4 as something that can be explored. Example would be

fixed.acidity and pH had correlation of -0.683
citric.acid and pH had correlation of -0.542



In [61]:

    
create_scatter_plot <- function(x, y, alpha) {
    return(ggplot(aes_string(x = x, y =  y), data = redWineData) +
               geom_point(alpha = alpha))
}

Let's make a plot of citric acid and fixed acidity that has a correlation of 0.67170343



In [63]:

    
create_scatter_plot('citric.acid', 'fixed.acidity', 1/5)

As citric acid increases fixed acidity also increases. But we can see that this is not a direct relationship. For values of 0 there is acidity present so there is some other factor present

Let's make a plot of density and fixed acidity which has a correlation of 0.66804729



In [68]:

    
create_scatter_plot('density', 'fixed.acidity', 1/5)

We can see that as density increases fixed acidity also increases.

Let's make a plot of fixed acidity and pH which has a correlation of -0.68297819



In [69]:

    
create_scatter_plot('fixed.acidity', 'pH', 1/5)

As fixed acidity increases pH decreases

Let's make a plot of citric acid and volatile acidity which has a correlation of -0.552495685



In [70]:

    
create_scatter_plot('citric.acid', 'volatile.acidity', 1/5)

As citric acid increases volatile acidity decreases

Let's make a plot of citric acid and pH which has correlation of -0.54190414



In [71]:

    
create_scatter_plot('citric.acid', 'pH', 1/5)

As citric acid increase the pH decreases

Let's make a plot between free sulphur dioxide and total sulphur dioxide which has a correlation of 0.667666450



In [72]:

    
create_scatter_plot('free.sulfur.dioxide', 'total.sulfur.dioxide', 1/5)

Out of all the scatter plot that I have seen so far this interests me the most. The lower part of y seems to be a straight line while all the other plots so far were more of spread out. Maybe the total sulphur dioxide is at least the amount of free sulphur dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Weak or No Relationships

These do not seem to have a direct relationship with quality

fixed.acidity
residual.sugar
chlorides: The higher the quality of wine lesser the mean volatile.acidity in the wine. But the relationship did not seem too strong as the decrease wasn't too much

Medium or Strong Relationships

These seem to be related with quality

volatile.acidity: The higher the quality of wine lesser the mean volatile.acidity in the wine
citric.acid: The higher the quality of wine higher the mean citric.acid in the wine
free.sulfur.dioxide: Good quality wines seemed to have medium quantity of this chemical
total.sulfur.dioxide: Good quality wines seemed to have medium quantity of this chemical
density: For good quality wine the average density seems to decrease
pH: For good quality wine the mean pH seems to decrease

From the chosen 4 initial variables

volatile.acidity
citric.acid
free.sulfur.dioxide
alcohol

all except free.sulfur.dioxide had seem to have strong relationship with quality of the wine. sulphates were not expected but they had strong relationships

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Many variables had correlation magnitude greater than 0.5

fixed.acidity and pH had correlation of -0.683
citric.acid and pH had correlation of -0.542
citric.acid and fixed.acidity had correlation of 0.672
density and fixed.acidity had correlation of 0.668
citric.acid and volatile.acidity had correlation of -0.552
total.sulfur.dioxide and free.sulfur.dioxide had correlation of 0.668

But none of them were correlated to quality that strongly

alcohol 0.476
volatile.acidity -0.391
sulphates 0.251
citric.acid 0.226

Total sulphur dioxide seems to lower bounded by free sulphur dioxide

What was the strongest relationship you found?

Looking at the graphs and the correlations

volatile.acidity with quality
citric.acid with quality
sulphates with quality
alcohol with quality

Multivariate Plots Section

Here I will explore and try to find out whether variables that don't have have high correlation with quality by themselves taken together can have some kind of trend with quality.



In [116]:

    
create_multi_plot <- function(x, y, alpha) {
    return(ggplot(aes_string(x = x, y =  y), 
                  data = redWineData) +
           facet_wrap(~quality) +
           geom_point(alpha = alpha) +
           theme(axis.text.x = element_text(angle = 90)))
}

I will start with the acidity as it seems that it has some relationships with quality. But what about different type of acidity taken together?



In [113]:

    
create_multi_plot('volatile.acidity', 'fixed.acidity', 0.1)

Looking at the graphs it looks like higher quality wine have lesser volatile acidity and the spread of fixed acidity is grouped between 7 and 12

Now I will see the combined effect of volatile acidity and citric acid on quality. Both of them are related to the taste of wine so trying to find out what kind of taste makes better wine



In [114]:

    
create_multi_plot('volatile.acidity', 'citric.acid', 0.08)

From this it looks like lower quality wine has more vinegar like taste and as th quality increases there seems to be 2 types of taste that seem to be good as seen by 2 seemingly different groups of points most apparent in quality of wine with rating 7 but also slightly prominent in quality of wine with rating 6

Low or negligible citric acid with low volatile acidity.
Medium citric acid with low or neglibible volatile acidity

So for good quality wine maybe we need one of citric acid and volatile acidity in medium quantity while other should be low or negligible but not both

I will explore the relationship of fixed.acidity and density on the basis of quality to see if I can see anything



In [122]:

    
create_multi_plot('density', 'fixed.acidity', 0.1)

Looking at this nothing actionable seems to come out.

Let me try citric acid and residual sugar. Both of them change taste



In [128]:

    
create_multi_plot('citric.acid', 'residual.sugar', 0.1)

Again looking at this nothing actionable seems to come out.

Let me try citric acid and free sulphur dioxide. They say older wine is better. I wonder how the components that add freshness and prevent microbial growth affect the quality



In [130]:

    
create_multi_plot('citric.acid', 'free.sulfur.dioxide', 0.1)

There doesn't seem to anything interesting here

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

good quality wine seems to have one of citric acid and volatile acidity in medium quantity while other should be low or negligible but not both
More acidic wine seems to have higher quality. Particularly acidity that does not evaporate matters. This was shown by positive correlation (0.124) of fixed.acidity and negative correlation (-0.391) of volatile.acidity. The pH didn't really matter as shown by the very small correlation (-0.0507) as these two components affected the quality in opposite directions and thus overall acidity did not matter

Were there any interesting or surprising interactions between features?

citric.acid and fixed.acidity seem to have a quadratic relationship between them

Final Plots and Summary

Plot One



In [30]:

    
create_final_plot <- function(variable, ylabel = 'ylabel', title = 'title', ylim = c(0, 1)) {
    return(ggplot(aes_string(x = 'quality', y = variable, group = 'quality'), data = redWineData) +
               geom_boxplot() +
               xlab('Quality rating of Wine (1 to 10)') +
               ylab(ylabel) +
               labs(title = title) +
               stat_summary(fun.y = mean, geom = 'point', shape = 4) +
               scale_x_continuous(breaks = 3:8) +
               coord_cartesian(ylim = ylim)
          )
    }



In [32]:

    
create_final_plot(
    'volatile.acidity', 
    'acetic acid (g / dm^3)', 
    'Effect of Acetic Acid on Quality of Wine', 
    c(0.1, 1.2)
)

This plot shows relationship between acetic acid and quality of wine. It struck me as the correlation was clear to the eyes. Higher quality wine had lower acetic acid on an average



In [33]:

    
create_final_plot(
    'citric.acid', 
    'citric acid (g / dm^3)', 
    'Effect of citric acid on Quality of Wine', 
    c(0, 0.8)
)

This plot shows relationship between citric acid and quality of wine. It struck me as the correlation was clear to the eyes. Higher quality wine had higher citric acid on an average



In [132]:

    
create_multi_plot('volatile.acidity', 'citric.acid', 0.08) +
    labs(title = "Acetic Acid and citric acid's affect on Wine") + 
    xlab('Acetic acid - g / dm^3') +
    ylab('citric acid (g / dm^3)')

Struck me as good quality wine seems to have one of citric acid and acetic acid in medium quantity while other should be low or negligible

Reflection

While doing this project I looked at single variable's distributions. Not much came out of them. Nothing useful except that most of them were just normal. I was thinking I wasn't going to get anything out of this dataset.

But when I used the bivariate analysis some insights about the correlation came out that can be very interesting. Looking at the various plots and correlations I drew the following general conclusions

More acidic wine seems to have higher quality. Particularly acidity that does not evaporate matters. This was shown by positive correlation (0.124) of fixed.acidity and negative correlation (-0.391) of volatile.acidity. The pH didn't really matter as shown by the very small correlation (-0.0507) as these two components affected the quality in opposite directions and thus overall acidity did not matter. Based on this if I want to take this exploration further and need to build a model I will consider the following while choosing features for my model
- more alcohol means better wine
- acidity that stays matters. citric.acid seems to help this a lot while sulphates do minor contributions

But the multivariate analysis was a challenge. Was not really sure what kind of relationships I could explore. The ggpairs plot prepared before helped me get some direction but not too much. So the multi variate analysis seemed too weak. To improve that I could go and check the original paper that did this wine analysis as suggested somewhere in the Udacity forums to see whether there are no relationships or did I just not have enough imagination but that would be against the Honour code so will do that after the project is finalized.

fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
7.4	0.700	0.00	1.9	0.076	11	34	0.9978	3.51	0.56	9.4	5
7.8	0.880	0.00	2.6	0.098	25	67	0.9968	3.20	0.68	9.8	5
7.8	0.760	0.04	2.3	0.092	15	54	0.9970	3.26	0.65	9.8	5
11.2	0.280	0.56	1.9	0.075	17	60	0.9980	3.16	0.58	9.8	6
7.4	0.700	0.00	1.9	0.076	11	34	0.9978	3.51	0.56	9.4	5
7.4	0.660	0.00	1.8	0.075	13	40	0.9978	3.51	0.56	9.4	5
7.9	0.600	0.06	1.6	0.069	15	59	0.9964	3.30	0.46	9.4	5
7.3	0.650	0.00	1.2	0.065	15	21	0.9946	3.39	0.47	10.0	7
7.8	0.580	0.02	2.0	0.073	9	18	0.9968	3.36	0.57	9.5	7
7.5	0.500	0.36	6.1	0.071	17	102	0.9978	3.35	0.80	10.5	5
6.7	0.580	0.08	1.8	0.097	15	65	0.9959	3.28	0.54	9.2	5
7.5	0.500	0.36	6.1	0.071	17	102	0.9978	3.35	0.80	10.5	5
5.6	0.615	0.00	1.6	0.089	16	59	0.9943	3.58	0.52	9.9	5
7.8	0.610	0.29	1.6	0.114	9	29	0.9974	3.26	1.56	9.1	5
8.9	0.620	0.18	3.8	0.176	52	145	0.9986	3.16	0.88	9.2	5
8.9	0.620	0.19	3.9	0.170	51	148	0.9986	3.17	0.93	9.2	5
8.5	0.280	0.56	1.8	0.092	35	103	0.9969	3.30	0.75	10.5	7
8.1	0.560	0.28	1.7	0.368	16	56	0.9968	3.11	1.28	9.3	5
7.4	0.590	0.08	4.4	0.086	6	29	0.9974	3.38	0.50	9.0	4
7.9	0.320	0.51	1.8	0.341	17	56	0.9969	3.04	1.08	9.2	6
8.9	0.220	0.48	1.8	0.077	29	60	0.9968	3.39	0.53	9.4	6
7.6	0.390	0.31	2.3	0.082	23	71	0.9982	3.52	0.65	9.7	5
7.9	0.430	0.21	1.6	0.106	10	37	0.9966	3.17	0.91	9.5	5
8.5	0.490	0.11	2.3	0.084	9	67	0.9968	3.17	0.53	9.4	5
6.9	0.400	0.14	2.4	0.085	21	40	0.9968	3.43	0.63	9.7	6
6.3	0.390	0.16	1.4	0.080	11	23	0.9955	3.34	0.56	9.3	5
7.6	0.410	0.24	1.8	0.080	4	11	0.9962	3.28	0.59	9.5	5
7.9	0.430	0.21	1.6	0.106	10	37	0.9966	3.17	0.91	9.5	5
7.1	0.710	0.00	1.9	0.080	14	35	0.9972	3.47	0.55	9.4	5
7.8	0.645	0.00	2.0	0.082	8	16	0.9964	3.38	0.59	9.8	6
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
6.2	0.510	0.14	1.9	0.056	15	34	0.99396	3.48	0.57	11.5	6
6.4	0.360	0.53	2.2	0.230	19	35	0.99340	3.37	0.93	12.4	6
6.4	0.380	0.14	2.2	0.038	15	25	0.99514	3.44	0.65	11.1	6
7.3	0.690	0.32	2.2	0.069	35	104	0.99632	3.33	0.51	9.5	5
6.0	0.580	0.20	2.4	0.075	15	50	0.99467	3.58	0.67	12.5	6
5.6	0.310	0.78	13.9	0.074	23	92	0.99677	3.39	0.48	10.5	6
7.5	0.520	0.40	2.2	0.060	12	20	0.99474	3.26	0.64	11.8	6
8.0	0.300	0.63	1.6	0.081	16	29	0.99588	3.30	0.78	10.8	6
6.2	0.700	0.15	5.1	0.076	13	27	0.99622	3.54	0.60	11.9	6
6.8	0.670	0.15	1.8	0.118	13	20	0.99540	3.42	0.67	11.3	6
6.2	0.560	0.09	1.7	0.053	24	32	0.99402	3.54	0.60	11.3	5
7.4	0.350	0.33	2.4	0.068	9	26	0.99470	3.36	0.60	11.9	6
6.2	0.560	0.09	1.7	0.053	24	32	0.99402	3.54	0.60	11.3	5
6.1	0.715	0.10	2.6	0.053	13	27	0.99362	3.57	0.50	11.9	5
6.2	0.460	0.29	2.1	0.074	32	98	0.99578	3.33	0.62	9.8	5
6.7	0.320	0.44	2.4	0.061	24	34	0.99484	3.29	0.80	11.6	7
7.2	0.390	0.44	2.6	0.066	22	48	0.99494	3.30	0.84	11.5	6
7.5	0.310	0.41	2.4	0.065	34	60	0.99492	3.34	0.85	11.4	6
5.8	0.610	0.11	1.8	0.066	18	28	0.99483	3.55	0.66	10.9	6
7.2	0.660	0.33	2.5	0.068	34	102	0.99414	3.27	0.78	12.8	6
6.6	0.725	0.20	7.8	0.073	29	79	0.99770	3.29	0.54	9.2	5
6.3	0.550	0.15	1.8	0.077	26	35	0.99314	3.32	0.82	11.6	6
5.4	0.740	0.09	1.7	0.089	16	26	0.99402	3.67	0.56	11.6	6
6.3	0.510	0.13	2.3	0.076	29	40	0.99574	3.42	0.75	11.0	6
6.8	0.620	0.08	1.9	0.068	28	38	0.99651	3.42	0.82	9.5	6
6.2	0.600	0.08	2.0	0.090	32	44	0.99490	3.45	0.58	10.5	5
5.9	0.550	0.10	2.2	0.062	39	51	0.99512	3.52	0.76	11.2	6
6.3	0.510	0.13	2.3	0.076	29	40	0.99574	3.42	0.75	11.0	6
5.9	0.645	0.12	2.0	0.075	32	44	0.99547	3.57	0.71	10.2	5
6.0	0.310	0.47	3.6	0.067	18	42	0.99549	3.39	0.66	11.0	6

	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
fixed.acidity	1.00000000	-0.256130895	0.67170343	0.114776724	0.093705186	-0.153794193	-0.11318144	0.66804729	-0.68297819	0.183005664	-0.06166827	0.12405165
volatile.acidity	-0.25613089	1.000000000	-0.55249568	0.001917882	0.061297772	-0.010503827	0.07647000	0.02202623	0.23493729	-0.260986685	-0.20228803	-0.39055778
citric.acid	0.67170343	-0.552495685	1.00000000	0.143577162	0.203822914	-0.060978129	0.03553302	0.36494718	-0.54190414	0.312770044	0.10990325	0.22637251
residual.sugar	0.11477672	0.001917882	0.14357716	1.000000000	0.055609535	0.187048995	0.20302788	0.35528337	-0.08565242	0.005527121	0.04207544	0.01373164
chlorides	0.09370519	0.061297772	0.20382291	0.055609535	1.000000000	0.005562147	0.04740047	0.20063233	-0.26502613	0.371260481	-0.22114054	-0.12890656
free.sulfur.dioxide	-0.15379419	-0.010503827	-0.06097813	0.187048995	0.005562147	1.000000000	0.66766645	-0.02194583	0.07037750	0.051657572	-0.06940835	-0.05065606
total.sulfur.dioxide	-0.11318144	0.076470005	0.03553302	0.203027882	0.047400468	0.667666450	1.00000000	0.07126948	-0.06649456	0.042946836	-0.20565394	-0.18510029
density	0.66804729	0.022026232	0.36494718	0.355283371	0.200632327	-0.021945831	0.07126948	1.00000000	-0.34169933	0.148506412	-0.49617977	-0.17491923
pH	-0.68297819	0.234937294	-0.54190414	-0.085652422	-0.265026131	0.070377499	-0.06649456	-0.34169933	1.00000000	-0.196647602	0.20563251	-0.05773139
sulphates	0.18300566	-0.260986685	0.31277004	0.005527121	0.371260481	0.051657572	0.04294684	0.14850641	-0.19664760	1.000000000	0.09359475	0.25139708
alcohol	-0.06166827	-0.202288027	0.10990325	0.042075437	-0.221140545	-0.069408354	-0.20565394	-0.49617977	0.20563251	0.093594750	1.00000000	0.47616632
quality	0.12405165	-0.390557780	0.22637251	0.013731637	-0.128906560	-0.050656057	-0.18510029	-0.17491923	-0.05773139	0.251397079	0.47616632	1.00000000