Process the Wine Review file from Kaggle

Do some very high level exploration of the data
Filter data down to only include wines from Australia
Other minor changes to support plotting

This notebook is a companion to the post at Practical Business Python



In [1]:

    
import pandas as pd
from bokeh.palettes import viridis



In [2]:

    
# Download and unpack the file from https://www.kaggle.com/zynicide/wine-reviews/data
df = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)



In [3]:

    
df.head()









    Out[3]:







  
    
      
      country
      description
      designation
      points
      price
      province
      region_1
      region_2
      taster_name
      taster_twitter_handle
      title
      variety
      winery
    
  
  
    
      0
      Italy
      Aromas include tropical fruit, broom, brimston...
      Vulkà Bianco
      87
      NaN
      Sicily & Sardinia
      Etna
      NaN
      Kerin O’Keefe
      @kerinokeefe
      Nicosia 2013 Vulkà Bianco  (Etna)
      White Blend
      Nicosia
    
    
      1
      Portugal
      This is ripe and fruity, a wine that is smooth...
      Avidagos
      87
      15.0
      Douro
      NaN
      NaN
      Roger Voss
      @vossroger
      Quinta dos Avidagos 2011 Avidagos Red (Douro)
      Portuguese Red
      Quinta dos Avidagos
    
    
      2
      US
      Tart and snappy, the flavors of lime flesh and...
      NaN
      87
      14.0
      Oregon
      Willamette Valley
      Willamette Valley
      Paul Gregutt
      @paulgwine
      Rainstorm 2013 Pinot Gris (Willamette Valley)
      Pinot Gris
      Rainstorm
    
    
      3
      US
      Pineapple rind, lemon pith and orange blossom ...
      Reserve Late Harvest
      87
      13.0
      Michigan
      Lake Michigan Shore
      NaN
      Alexander Peartree
      NaN
      St. Julian 2013 Reserve Late Harvest Riesling ...
      Riesling
      St. Julian
    
    
      4
      US
      Much like the regular bottling from 2012, this...
      Vintner's Reserve Wild Child Block
      87
      65.0
      Oregon
      Willamette Valley
      Willamette Valley
      Paul Gregutt
      @paulgwine
      Sweet Cheeks 2012 Vintner's Reserve Wild Child...
      Pinot Noir
      Sweet Cheeks



In [4]:

    
# Let's see the top 15 Countries
df.country.value_counts()[:15]









    Out[4]:





US              54504
France          22093
Italy           19540
Spain            6645
Portugal         5691
Chile            4472
Argentina        3800
Austria          3345
Australia        2329
Germany          2165
New Zealand      1419
South Africa     1401
Israel            505
Greece            466
Canada            257
Name: country, dtype: int64



In [5]:

    
# Parse out just the Australian wines
df_Aussie = df[df.country == "Australia"].copy()



In [6]:

    
# Let's see what the price and point distribution looks like
df_Aussie.describe()









    Out[6]:







  
    
      
      points
      price
    
  
  
    
      count
      2329.000000
      2294.000000
    
    
      mean
      88.580507
      35.437663
    
    
      std
      2.989900
      49.049458
    
    
      min
      80.000000
      5.000000
    
    
      25%
      87.000000
      15.000000
    
    
      50%
      89.000000
      21.000000
    
    
      75%
      91.000000
      38.000000
    
    
      max
      100.000000
      850.000000



In [7]:

    
# Let's see where null values might be
df_Aussie.isnull().any()









    Out[7]:





country                  False
description              False
designation               True
points                   False
price                     True
province                 False
region_1                  True
region_2                  True
taster_name               True
taster_twitter_handle     True
title                    False
variety                  False
winery                   False
dtype: bool

For this analysis, price and points are going to be most important. Let's see how many null prices there are



In [8]:

    
len(df_Aussie[df_Aussie.price.isnull()])









    Out[8]:





35



In [9]:

    
# We can drop these or fill with the average. For the sake of this analysis, I'll fill with the mean
df_Aussie['price'].fillna((df_Aussie['price'].mean()), inplace=True)



In [10]:

    
# Double check
len(df_Aussie[df_Aussie.price.isnull()])









    Out[10]:





0



In [11]:

    
# There are a couple of titles that are duplicated and it causes minor problems when plotting. 
df_Aussie.drop_duplicates(subset=['title'], inplace=True)



In [12]:

    
# One way we will filter the data is by province. Let's see how many there are
df_Aussie.province.value_counts()









    Out[12]:





South Australia      1268
Victoria              293
Western Australia     274
Australia Other       230
New South Wales        82
Tasmania               36
Name: province, dtype: int64



In [13]:

    
# It will be nice to color code the plot by the different varieties
df_Aussie.variety.value_counts()[:15]









    Out[13]:





Shiraz                       583
Chardonnay                   378
Cabernet Sauvignon           255
Riesling                     142
Pinot Noir                   119
Grenache                      72
Red Blend                     58
Sauvignon Blanc               50
Shiraz-Viognier               38
Viognier                      32
Sparkling Blend               31
Shiraz-Cabernet Sauvignon     29
Bordeaux-style Red Blend      28
Rosé                          25
Pinot Grigio                  25
Name: variety, dtype: int64



In [14]:

    
# Use the viridis pallette and create a column that contains the desired color
varieties = list(df_Aussie.variety.unique())
colors = viridis(len(varieties))
color_map = dict(zip(varieties, colors))
df_Aussie["variety_color"] = df_Aussie["variety"].map(color_map)



In [15]:

    
df_Aussie.head()









    Out[15]:







  
    
      
      country
      description
      designation
      points
      price
      province
      region_1
      region_2
      taster_name
      taster_twitter_handle
      title
      variety
      winery
      variety_color
    
  
  
    
      77
      Australia
      This medium-bodied Chardonnay features aromas ...
      Made With Organic Grapes
      86
      18.0
      South Australia
      South Australia
      NaN
      Joe Czerwinski
      @JoeCz
      Yalumba 2016 Made With Organic Grapes Chardonn...
      Chardonnay
      Yalumba
      #440154
    
    
      83
      Australia
      Pale copper in hue, this wine exudes passion f...
      Jester Sangiovese
      86
      20.0
      South Australia
      McLaren Vale
      NaN
      Joe Czerwinski
      @JoeCz
      Mitolo 2016 Jester Sangiovese Rosé (McLaren Vale)
      Rosé
      Mitolo
      #450558
    
    
      123
      Australia
      The blend is roughly two-thirds Shiraz and one...
      Parson's Flat
      92
      40.0
      South Australia
      Padthaway
      NaN
      Joe Czerwinski
      @JoeCz
      Henry's Drive Vignerons 2006 Parson's Flat Shi...
      Shiraz-Cabernet Sauvignon
      Henry's Drive Vignerons
      #460B5E
    
    
      191
      Australia
      From the little-known region of Padthaway, thi...
      The Trial of John Montford
      87
      30.0
      South Australia
      Padthaway
      NaN
      Joe Czerwinski
      @JoeCz
      Henry's Drive Vignerons 2006 The Trial of John...
      Cabernet Sauvignon
      Henry's Drive Vignerons
      #471163
    
    
      232
      Australia
      Lifted cedar and pine notes interspersed with ...
      Red Belly Black
      85
      12.0
      South Australia
      South Australia
      NaN
      NaN
      NaN
      Angove's 2006 Red Belly Black Shiraz (South Au...
      Shiraz
      Angove's
      #471669



In [16]:

    
# Save the filtered and processed file to be used in our bokeh plot
df_Aussie.to_csv("Aussie_Wines_Plotting.csv")



In [ ]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

	points	price
count	2329.000000	2294.000000
mean	88.580507	35.437663
std	2.989900	49.049458
min	80.000000	5.000000
25%	87.000000	15.000000
50%	89.000000	21.000000
75%	91.000000	38.000000
max	100.000000	850.000000

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery	variety_color
77	Australia	This medium-bodied Chardonnay features aromas ...	Made With Organic Grapes	86	18.0	South Australia	South Australia	NaN	Joe Czerwinski	@JoeCz	Yalumba 2016 Made With Organic Grapes Chardonn...	Chardonnay	Yalumba	#440154
83	Australia	Pale copper in hue, this wine exudes passion f...	Jester Sangiovese	86	20.0	South Australia	McLaren Vale	NaN	Joe Czerwinski	@JoeCz	Mitolo 2016 Jester Sangiovese Rosé (McLaren Vale)	Rosé	Mitolo	#450558
123	Australia	The blend is roughly two-thirds Shiraz and one...	Parson's Flat	92	40.0	South Australia	Padthaway	NaN	Joe Czerwinski	@JoeCz	Henry's Drive Vignerons 2006 Parson's Flat Shi...	Shiraz-Cabernet Sauvignon	Henry's Drive Vignerons	#460B5E
191	Australia	From the little-known region of Padthaway, thi...	The Trial of John Montford	87	30.0	South Australia	Padthaway	NaN	Joe Czerwinski	@JoeCz	Henry's Drive Vignerons 2006 The Trial of John...	Cabernet Sauvignon	Henry's Drive Vignerons	#471163
232	Australia	Lifted cedar and pine notes interspersed with ...	Red Belly Black	85	12.0	South Australia	South Australia	NaN	NaN	NaN	Angove's 2006 Red Belly Black Shiraz (South Au...	Shiraz	Angove's	#471669