"I think, therefore I am"
Questions we will answer on alcohol topic across countries
"Data is the new oil"
We will using the Global Information System on Alcohol and Health (GISAH) maintained by WHO to answer the questions.
The WHO Global Information System on Alcohol and Health (GISAH) provides easy and rapid access to a wide range of alcohol-related health indicators. It is an essential tool for assessing and monitoring the health situation and trends related to alcohol consumption, alcohol-related harm, and policy responses in countries.
You can see an overview at http://www.who.int/gho/alcohol/en/.
The datasets from GISAH are available at http://apps.who.int/gho/data/node.main.GISAH?lang=en&showonly=GISAH
We want the alcohol consumption by country
Recorded alcohol per capita consumption, 1960-1979 by country - http://apps.who.int/gho/data/node.main.A1025?lang=en&showonly=GISAH
Recorded alcohol per capita consumption, 1980-1999 by country - http://apps.who.int/gho/data/node.main.A1024?lang=en&showonly=GISAH
Recorded alcohol per capita consumption, 2000 onwards by country http://apps.who.int/gho/data/node.main.A1026?lang=en&showonly=GISAH
In [1]:
# Import the libraries we need, which is Pandas and Numpy
import pandas as pd
import numpy as np
In [2]:
df1 = pd.read_csv('data/drinks2000.csv')
In [3]:
df1.head()
Out[3]:
In [4]:
df1.shape
Out[4]:
In [5]:
years1 = list(range(2015, 1999, -1))
In [6]:
years1
Out[6]:
In [7]:
header1 = ['description']
In [8]:
header1.extend(years1)
In [9]:
header1
Out[9]:
In [10]:
df1.columns = header1
In [11]:
df1.head()
Out[11]:
In [12]:
df2 = pd.read_csv('data/drinks1980.csv')
years2 = list(range(1999, 1979, -1))
header2 = ['description']
header2.extend(years2)
df2.columns = header2
df2.head()
Out[12]:
In [13]:
df3 = pd.read_csv('data/drinks1960.csv')
years3 = list(range(1979, 1959, -1))
header3 = ['description']
header3.extend(years3)
df3.columns = header3
df3.head()
Out[13]:
In [ ]:
"Data is messy"
We will be performing the following operation on our Onion price to refine it
Other stuff you may need to do to refine are...
In [14]:
df1.head()
Out[14]:
In [15]:
df1 = pd.melt(df1, id_vars=['description'], var_name='year')
In [16]:
df1.head()
Out[16]:
In [17]:
df2 = pd.melt(df2, id_vars=['description'], var_name='year')
df3 = pd.melt(df3, id_vars=['description'], var_name='year')
In [18]:
df1.shape
Out[18]:
In [19]:
df2.shape
Out[19]:
In [20]:
df = df1.append(df2)
In [21]:
df.shape
Out[21]:
In [22]:
df = df.append(df3)
In [23]:
df.shape
Out[23]:
In [24]:
df.head()
Out[24]:
In [25]:
df['country'] = df.description.str.split(';').str[0]
In [26]:
df.head()
Out[26]:
In [27]:
df['beverage'] = df.description.str.split(";").str[-1]
In [28]:
df.tail()
Out[28]:
We can now drop the description column from our dataframe
In [29]:
df.drop('description', axis = 1, inplace= True)
In [30]:
df.head()
Out[30]:
In [31]:
df.dtypes
Out[31]:
In [32]:
df.year.unique()
Out[32]:
In [33]:
df.year = pd.to_numeric(df.year)
In [34]:
df.dtypes
Out[34]:
In [35]:
df.head()
Out[35]:
Lets check in the value whether we have numeric or not
In [36]:
df.value.unique()
Out[36]:
In [37]:
df[df.value.str.isnumeric() == False].shape
Out[37]:
We will use pd.to_numeric
which will coerce to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25
In [38]:
df.value = pd.to_numeric(df.value, errors='coerce')
In [39]:
df.value.unique()
Out[39]:
In [40]:
df.dtypes
Out[40]:
In [41]:
df.country.unique()
Out[41]:
In [42]:
df.beverage.unique()
Out[42]:
In [43]:
# Convert from an np array to a list
beverage_old = df.beverage.unique().tolist()
In [44]:
beverage_old
Out[44]:
In [45]:
# Create a new list with white space removed and shorter names
beverage_new = ['all', 'beer', 'wine', 'spirits', 'others']
In [46]:
beverage_new
Out[46]:
In [47]:
df.beverage = df.beverage.replace(beverage_old, beverage_new)
It is hard to think of alcohol in terms of 'litres of pure alcohol content'. It is easy to understand in terms of number of typical serving of drinks.
For one standard serving -
1 US fluid ounce (fl oz) = 0.0295735 litres (l) ~ 30ml
So for:
12% * 5 * 0.03
= 0.018
litres of pure alcohol5% * 12 * 0.03
= 0.018
litres of pure alcohol40% * 1.5 * 0.03
= 0.018
litres of pure alcohol
In [48]:
df.dtypes
Out[48]:
In [49]:
df['serving'] = round(df['value']/0.018, 0)
In [50]:
df.head()
Out[50]:
In [51]:
df2010 = df[df.year == 2010]
In [52]:
df2010.head()
Out[52]:
In [53]:
dfSing = df[df.country == 'Singapore']
In [54]:
df2010Sing = df[(df.year == 2010) & (df.country == 'Singapore')]
In [55]:
df2010Sing.head()
Out[55]:
In [56]:
# Let us create a pivot for just serving in 2010
df2010Serving = pd.pivot_table(df2010, values = "serving", columns = "beverage", index = "country")
In [57]:
df2010Serving = df2010Serving.reset_index()
In [58]:
df2010Serving.head()
Out[58]:
In [59]:
dfSing.head()
Out[59]:
In [60]:
# Let us create a pivot for just serving in 2010
dfSingServing = pd.pivot_table(dfSing, values = "serving", columns = "beverage", index = "year")
In [61]:
dfSingServing = dfSingServing.reset_index()
In [62]:
dfSingServing.head()
Out[62]:
"I don't know, what I don't know"
We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.
In [63]:
# Load the visualisation libraries - Matplotlib
import matplotlib.pyplot as plt
In [64]:
# Let us see the output plots in the notebook itself
%matplotlib inline
In [65]:
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)
In [66]:
# Sort them on Beer values
df2010Serving.sort_values(by = "spirits", ascending = False, inplace=True)
In [67]:
df2010Serving.head(10)
Out[67]:
In [68]:
# Plot the Data
df2010Serving.head(40).plot(kind ="barh", x = 'country', y = 'beer')
Out[68]:
In [69]:
# Plot the histogram
df2010Serving.beer.plot(kind ="hist",bins = 30, alpha=0.5)
Out[69]:
In [70]:
# Plots the Singapore Data
dfSingServing.plot(kind = "line", x = "year", y = ['all', 'beer', 'wine', 'spirits', 'others'])
Out[70]:
We want to test whether the alcohol consumption has really declined in recent times. To do that we will do a t-test, in which we will check whether the consumption before and after 1990 are really different.
https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test https://en.wikipedia.org/wiki/Welch%27s_t-test
In [71]:
dfSing.head()
Out[71]:
In [72]:
dfSingAll = dfSing[dfSing.beverage == 'all'].copy()
In [73]:
dfSingAll.head()
Out[73]:
In [74]:
# Create a new column
dfSingAll['split'] = dfSingAll.year < 1990
In [75]:
dfSingAll.head()
Out[75]:
In [76]:
# Let us plot the two samples
dfSingAll.hist(column = "serving", by = "split", sharex = True, sharey= True)
Out[76]:
In [77]:
from scipy import stats
np.random.seed(12345678)
In [78]:
sampleA = dfSingAll[dfSingAll.split == True].serving
sampleB = dfSingAll[dfSingAll.split == False].serving
In [79]:
sampleA.shape
Out[79]:
In [80]:
sampleB.shape
Out[80]:
In [81]:
stats.ttest_ind(sampleA, sampleB, equal_var = False, nan_policy = 'omit')
Out[81]:
In [82]:
df2010Serving.head()
Out[82]:
In [83]:
beerMean = df2010Serving.beer.mean()
beerMean
Out[83]:
In [84]:
wineMean = df2010Serving.wine.mean()
wineMean
Out[84]:
In [85]:
df2010Serving.plot(kind = "scatter", x ="beer", y= "wine", s = df2010Serving['all'], alpha = 0.7)
plt.axvline(beerMean, color='r')
plt.axhline(wineMean, color='r')
Out[85]: