Below are the results from our wine tasting. It had 55 participants and 30 wines selected randomly from the shelves at Trader Joes's.


In [1]:
%matplotlib inline
import pylab as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy.stats
import datetime as dt
import random
from IPython.display import display

In [2]:
sns.set_context("notebook", font_scale=2)

In [3]:
raw_data = pd.read_csv('/Users/tom/Downloads/Wine Tasting Data - Sheet1.csv')

In [96]:
raw_data['tasting_order'] = raw_data.groupby('person_name').cumcount() + 1

In [97]:
wine_mapping = pd.read_csv('/Users/tom/Downloads/Wine Mapping - Sheet1.csv')

In [98]:
def make_display_name(wines):
    if len(wines.variety.unique()) == 1:
        res = wines.wine_name.values
    else:
        res = wines.wine_name.values + ' - ' + wines.variety.values
    return pd.Series(res, index=wines.number, name='wine_display_name')
display_names = wine_mapping.groupby('wine_name').apply(make_display_name)
#wine_mapping.merge(wine_mapping.groupby('wine_name').apply(make_display_name), left_on='number', right_index=True)

In [99]:
display_names = display_names.reset_index()
del display_names['wine_name']

In [100]:
wine_mapping = wine_mapping.merge(display_names, on='number')

In [101]:
data = raw_data.merge(wine_mapping, left_on='wine_number', right_on='number')

Counts of Wine Varieties


In [76]:
wine_mapping.variety.value_counts().plot(kind='barh', title='histogram of wine varieties')
plt.xlabel('count')


Out[76]:
<matplotlib.text.Text at 0x10e885110>

Best and worst varieties of wine


In [73]:
data.groupby('variety').score.mean().sort_values(ascending=False).plot(kind='barh', figsize=(10,6), title='Score by variety of wine')
plt.xlabel('average score')


Out[73]:
<matplotlib.text.Text at 0x10ce6f310>

Best and worst wines


In [70]:
sorted_wines = data.groupby('wine_display_name').score.mean().sort_values(ascending=False)

In [72]:
sorted_wines.plot(kind='barh', figsize=(16,10), title='Average score for each wine')
plt.xlabel('average score')


Out[72]:
<matplotlib.text.Text at 0x10cc95210>

Red vs. white


In [59]:
data['type'] = data['type'].str.strip()

In [64]:
red_vs_white = pd.DataFrame(data.groupby('type').score.mean().sort_values())
red_vs_white.columns = ['average score']
red_vs_white


Out[64]:
average score
type
red 2.805085
white 2.806557

Who liked the wines the most?


In [81]:
data.groupby('person_name').score.mean().sort_values(ascending=False).plot(kind='barh', figsize=(16, 20), title='average score by person')
plt.xlabel('average score')


Out[81]:
<matplotlib.text.Text at 0x10eec8350>

Relationship of price and score


In [115]:
sns.lmplot(x="price", y="score", data=data, size=8, x_jitter=0.3, y_jitter=0.3)
plt.title('Relationship of price and score')
plt.ylim(ymin=0)


Out[115]:
(0, 6.0)

The shaded blue area is the 95% confidence interval for the linear regression fit. There's no significant correlation.


In [116]:
sns.lmplot(x="price", y="score", data=data[data.price < 10.0], size=8, x_jitter=0.3, y_jitter=0.3)
plt.title('Relationship of price and score, wines under $10')
plt.ylim(ymin=0)


Out[116]:
(0, 6.0)

Similarly, no significant correlation


In [434]:
sns.lmplot(x="price", y="score", data=data[data.price >= 10], size=8, x_jitter=0.3, y_jitter=0.3)
plt.title('Relationship of price and score, wines $10+')


Out[434]:
<matplotlib.text.Text at 0x12c9a9550>

There's a significant negative correlation, as you go above $10, wines get worse on average (for this sample of wines).

Relationship of price and price guess


In [85]:
sns.lmplot(x="price", y="price_guess", data=data, size=8, x_jitter=0.2, y_jitter=0.2)
plt.title('Relationship of price and price guess')
plt.ylim(ymin=0, ymax=60)


Out[85]:
(0, 60)

No significant correlation

Best price guessers


In [86]:
data['abs_price_error'] = (data['price'] - data['price_guess']).abs()

In [87]:
price_guesses = data.groupby('person_name').abs_price_error.agg(['mean', 'count']).sort_values('mean')
price_guesses.columns = ['Avg. Price Error', 'Number of Wines Priced']

In [89]:
price_guesses.dropna()


Out[89]:
Avg. Price Error Number of Wines Priced
person_name
Anna R. 3.005000 8
Amanda M. 3.255000 4
Ana 5.802000 5
Tiana 6.051379 29
Lori 6.273077 13
Rachel D. 6.438000 30
anon 4 6.554667 30
Mike B. 6.802000 15
Erin D. 6.930000 7
Tom Q 6.994706 17
anon 5 7.056000 30
Rachel A. 7.212069 29
anon 13 7.323333 30
Paul 7.403333 15
Randy 7.556000 30
Maggie 7.690833 24
Rose K. 7.715000 12
anon 14 7.856000 30
Bridget 8.319474 19
anon 11 8.703333 15
anon 1 8.878333 12
Camille 8.963333 24
anon 15 8.971333 15
Luca 9.173333 18
Megan 9.199231 13
Evan 9.266296 27
anon 16 9.846000 25
Alex 10.090000 12
Spam 10.114211 19
Kris 10.222667 30
Michelle N. 10.240526 19
Teresa C. 10.789333 30
anon 12 10.840000 12
anon 2 10.991333 30
Ivena 11.281852 27
anon 6 11.351538 13
James B. 11.739231 13
anon 9 11.965714 14
Lee K 12.257500 8
Zak 13.370909 22
Noah 13.675333 15
Joeseph 13.990667 30
Jeremy G. 15.065556 18
Shayna 17.458000 10
Jessica P. 18.892000 30
Casey O. 19.048333 12
Colin 20.095882 17
anon 7 21.792000 30
anon 8 23.222667 30
anon 10 25.667000 10
Andrew S. 31.959333 30

Score as the night wore on


In [102]:
data.groupby('tasting_order').score.mean().plot()
plt.ylabel('average score')


Out[102]:
<matplotlib.text.Text at 0x115207b50>

There's no obvious trend in score as people tasted more wines, it looks random.

Tasting dedication


In [105]:
data.groupby('tasting_order').wine_number.count().plot(ylim=0)


Out[105]:
<matplotlib.axes._subplots.AxesSubplot at 0x11538d890>

An impressive number of people tasted all the wines!

Trying the same wine twice


In [107]:
def get_avg_score_diff(df):
    if len(df) > 1:
        return df.score.diff().abs().mean()
same_wine_avg_diff = data.groupby(['person_name', 'wine_name', 'variety']).apply(get_avg_score_diff).dropna().mean()

In [108]:
same_person_avg_diff = data.groupby(['person_name']).apply(get_avg_score_diff).dropna().mean()

In [117]:
same_wine_diffs = pd.Series([same_wine_avg_diff, same_person_avg_diff], index=['same person same wine', 'same person different wine'])

In [118]:
same_wine_diffs = pd.DataFrame(same_wine_diffs, columns=['average score difference'])
same_wine_diffs


Out[118]:
average score difference
same person same wine 1.037791
same person different wine 1.155291

On average, when a single person tasted the same wine twice and didn't know it, they gave it a score that differed by 1.04 points. This is only slightly smaller than the average difference of a person scoring two different wines, which was 1.16 points. This indicates that most of the variation in scores is due to factors other than the wine: randomness, what wine you tasted before, etc...


In [114]:
data.to_csv('full_wine_tasting_results.csv', index=False)