This notebook is a companion to the post at Practical Business Python
In [1]:
import pandas as pd
from bokeh.palettes import viridis
In [2]:
# Download and unpack the file from https://www.kaggle.com/zynicide/wine-reviews/data
df = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)
In [3]:
df.head()
Out[3]:
In [4]:
# Let's see the top 15 Countries
df.country.value_counts()[:15]
Out[4]:
In [5]:
# Parse out just the Australian wines
df_Aussie = df[df.country == "Australia"].copy()
In [6]:
# Let's see what the price and point distribution looks like
df_Aussie.describe()
Out[6]:
In [7]:
# Let's see where null values might be
df_Aussie.isnull().any()
Out[7]:
For this analysis, price and points are going to be most important. Let's see how many null prices there are
In [8]:
len(df_Aussie[df_Aussie.price.isnull()])
Out[8]:
In [9]:
# We can drop these or fill with the average. For the sake of this analysis, I'll fill with the mean
df_Aussie['price'].fillna((df_Aussie['price'].mean()), inplace=True)
In [10]:
# Double check
len(df_Aussie[df_Aussie.price.isnull()])
Out[10]:
In [11]:
# There are a couple of titles that are duplicated and it causes minor problems when plotting.
df_Aussie.drop_duplicates(subset=['title'], inplace=True)
In [12]:
# One way we will filter the data is by province. Let's see how many there are
df_Aussie.province.value_counts()
Out[12]:
In [13]:
# It will be nice to color code the plot by the different varieties
df_Aussie.variety.value_counts()[:15]
Out[13]:
In [14]:
# Use the viridis pallette and create a column that contains the desired color
varieties = list(df_Aussie.variety.unique())
colors = viridis(len(varieties))
color_map = dict(zip(varieties, colors))
df_Aussie["variety_color"] = df_Aussie["variety"].map(color_map)
In [15]:
df_Aussie.head()
Out[15]:
In [16]:
# Save the filtered and processed file to be used in our bokeh plot
df_Aussie.to_csv("Aussie_Wines_Plotting.csv")
In [ ]: