1. Frame the Problem

"I think, therefore I am"

  • What type of questions can be answered?
  • Developing a hypothesis drive approach.
  • Making the case.

Questions we will answer on alcohol topic across countries

Descriptive

  • Which are the leading alcohol drinking countries by beer, wine and spirit?
  • What is the trend in alcohol consumption in Singapore across the years?

Exploratory / Inferential

  • Has alcohol consumption declined in Singapore in the recent years?

2. Acquire the Data

"Data is the new oil"

  • Download from an internal system
  • Obtained from client, or other 3rd party
  • Extracted from a web-based API
  • Scraped from a website
  • Extracted from a PDF file
  • Gathered manually and recorded

We will using the Global Information System on Alcohol and Health (GISAH) maintained by WHO to answer the questions.

The WHO Global Information System on Alcohol and Health (GISAH) provides easy and rapid access to a wide range of alcohol-related health indicators. It is an essential tool for assessing and monitoring the health situation and trends related to alcohol consumption, alcohol-related harm, and policy responses in countries.

You can see an overview at http://www.who.int/gho/alcohol/en/.

Principle: Load the Data

The datasets from GISAH are available at http://apps.who.int/gho/data/node.main.GISAH?lang=en&showonly=GISAH

We want the alcohol consumption by country


In [1]:
# Import the libraries we need, which is Pandas and Numpy
import pandas as pd
import numpy as np

In [2]:
df1 = pd.read_csv('data/drinks2000.csv')

In [3]:
df1.head()


Out[3]:
Country; Data Source; Beverage Types Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2015 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2014 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2013 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2012 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2011 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2010 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2009 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2008 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2007 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2006 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2005 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2004 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2003 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2002 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2001 Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2000
0 Afghanistan; Data source; All types NaN NaN 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 0.02 0.02 0.01 0.01 0 0
1 Afghanistan; Data source; Beer NaN NaN 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0 0 0 0
2 Afghanistan; Data source; Wine NaN NaN 0.00 0.00 0.00 0 0 0 0.01 0.01 0 0 0 0 0 0
3 Afghanistan; Data source; Spirits NaN NaN 0.00 0.00 0.00 0 0 0.02 0 0.01 0.01 0 0.01 0 0 0
4 Afghanistan; Data source; Other alcoholic beve... NaN NaN 0.00 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0

In [4]:
df1.shape


Out[4]:
(984, 17)

Principle: Fix the Column Header


In [5]:
years1 = list(range(2015, 1999, -1))

In [6]:
years1


Out[6]:
[2015,
 2014,
 2013,
 2012,
 2011,
 2010,
 2009,
 2008,
 2007,
 2006,
 2005,
 2004,
 2003,
 2002,
 2001,
 2000]

In [7]:
header1 = ['description']

In [8]:
header1.extend(years1)

In [9]:
header1


Out[9]:
['description',
 2015,
 2014,
 2013,
 2012,
 2011,
 2010,
 2009,
 2008,
 2007,
 2006,
 2005,
 2004,
 2003,
 2002,
 2001,
 2000]

In [10]:
df1.columns = header1

In [11]:
df1.head()


Out[11]:
description 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000
0 Afghanistan; Data source; All types NaN NaN 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 0.02 0.02 0.01 0.01 0 0
1 Afghanistan; Data source; Beer NaN NaN 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0 0 0 0
2 Afghanistan; Data source; Wine NaN NaN 0.00 0.00 0.00 0 0 0 0.01 0.01 0 0 0 0 0 0
3 Afghanistan; Data source; Spirits NaN NaN 0.00 0.00 0.00 0 0 0.02 0 0.01 0.01 0 0.01 0 0 0
4 Afghanistan; Data source; Other alcoholic beve... NaN NaN 0.00 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0

Exercise

  1. Load the drinks1960 and drinks1980 csv files and fix the column header

In [12]:
df2 = pd.read_csv('data/drinks1980.csv')
years2 = list(range(1999, 1979, -1))
header2 = ['description']
header2.extend(years2)
df2.columns = header2
df2.head()


Out[12]:
description 1999 1998 1997 1996 1995 1994 1993 1992 1991 ... 1989 1988 1987 1986 1985 1984 1983 1982 1981 1980
0 Afghanistan; Data source; All types 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 Afghanistan; Data source; Beer 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 Afghanistan; Data source; Wine 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Afghanistan; Data source; Spirits 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 Afghanistan; Data source; Other alcoholic beve... 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 21 columns


In [13]:
df3 = pd.read_csv('data/drinks1960.csv')
years3 = list(range(1979, 1959, -1))
header3 = ['description']
header3.extend(years3)
df3.columns = header3
df3.head()


Out[13]:
description 1979 1978 1977 1976 1975 1974 1973 1972 1971 ... 1969 1968 1967 1966 1965 1964 1963 1962 1961 1960
0 Afghanistan; Data source; All types 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 NaN
1 Afghanistan; Data source; Beer 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 NaN
2 Afghanistan; Data source; Wine 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 NaN
3 Afghanistan; Data source; Spirits 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 NaN
4 Afghanistan; Data source; Other alcoholic beve... 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 NaN

5 rows × 21 columns


In [ ]:

3. Refine the Data

"Data is messy"

We will be performing the following operation on our Onion price to refine it

  • Remove e.g. remove redundant data from the data frame
  • Derive e.g. Country and Beverage from the description field
  • Missing e.g. Check for missing or incomplete data
  • Merge e.g. Take the three dataframes and make them one
  • Filter e.g. exclude based on location

Other stuff you may need to do to refine are...

  • Parse e.g. extract date from year and month column
  • Quality e.g. Check for duplicates, accuracy, unusual data
  • Convert e.g. free text to coded value
  • Calculate e.g. percentages, proportion
  • Aggregate e.g. rollup by year, cluster by area
  • Sample e.g. extract a representative data
  • Summary e.g. show summary stats like mean

Principle: melt to convert from Wide format to Tall format

We will need to convert the data frame from wide format to tall format (and vice versa). This is needed as we want to combine the three data frame and we can only do that once we have the data in a tall format


In [14]:
df1.head()


Out[14]:
description 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000
0 Afghanistan; Data source; All types NaN NaN 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 0.02 0.02 0.01 0.01 0 0
1 Afghanistan; Data source; Beer NaN NaN 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0 0 0 0
2 Afghanistan; Data source; Wine NaN NaN 0.00 0.00 0.00 0 0 0 0.01 0.01 0 0 0 0 0 0
3 Afghanistan; Data source; Spirits NaN NaN 0.00 0.00 0.00 0 0 0.02 0 0.01 0.01 0 0.01 0 0 0
4 Afghanistan; Data source; Other alcoholic beve... NaN NaN 0.00 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0

In [15]:
df1 = pd.melt(df1, id_vars=['description'], var_name='year')

In [16]:
df1.head()


Out[16]:
description year value
0 Afghanistan; Data source; All types 2015 NaN
1 Afghanistan; Data source; Beer 2015 NaN
2 Afghanistan; Data source; Wine 2015 NaN
3 Afghanistan; Data source; Spirits 2015 NaN
4 Afghanistan; Data source; Other alcoholic beve... 2015 NaN

In [17]:
df2 = pd.melt(df2, id_vars=['description'], var_name='year')
df3 = pd.melt(df3, id_vars=['description'], var_name='year')

Principle: append one dataframe to another


In [18]:
df1.shape


Out[18]:
(15744, 3)

In [19]:
df2.shape


Out[19]:
(18900, 3)

In [20]:
df = df1.append(df2)

In [21]:
df.shape


Out[21]:
(34644, 3)

In [22]:
df = df.append(df3)

In [23]:
df.shape


Out[23]:
(48244, 3)

Principle: str to extract text from a strings

String manipulation is very common and we often need to extract a substring from a long string. In this case, we want to get the country and type of beverage from the description


In [24]:
df.head()


Out[24]:
description year value
0 Afghanistan; Data source; All types 2015 NaN
1 Afghanistan; Data source; Beer 2015 NaN
2 Afghanistan; Data source; Wine 2015 NaN
3 Afghanistan; Data source; Spirits 2015 NaN
4 Afghanistan; Data source; Other alcoholic beve... 2015 NaN

In [25]:
df['country'] = df.description.str.split(';').str[0]

In [26]:
df.head()


Out[26]:
description year value country
0 Afghanistan; Data source; All types 2015 NaN Afghanistan
1 Afghanistan; Data source; Beer 2015 NaN Afghanistan
2 Afghanistan; Data source; Wine 2015 NaN Afghanistan
3 Afghanistan; Data source; Spirits 2015 NaN Afghanistan
4 Afghanistan; Data source; Other alcoholic beve... 2015 NaN Afghanistan

In [27]:
df['beverage'] = df.description.str.split(";").str[-1]

In [28]:
df.tail()


Out[28]:
description year value country beverage
13595 Zimbabwe; Data source; All types 1960 NaN Zimbabwe All types
13596 Zimbabwe; Data source; Beer 1960 NaN Zimbabwe Beer
13597 Zimbabwe; Data source; Wine 1960 NaN Zimbabwe Wine
13598 Zimbabwe; Data source; Spirits 1960 NaN Zimbabwe Spirits
13599 Zimbabwe; Data source; Other alcoholic beverages 1960 NaN Zimbabwe Other alcoholic beverages

We can now drop the description column from our dataframe


In [29]:
df.drop('description', axis = 1, inplace= True)

In [30]:
df.head()


Out[30]:
year value country beverage
0 2015 NaN Afghanistan All types
1 2015 NaN Afghanistan Beer
2 2015 NaN Afghanistan Wine
3 2015 NaN Afghanistan Spirits
4 2015 NaN Afghanistan Other alcoholic beverages

Principle: Dealing with Missing Values

By “missing” it simply mean null or “not present for whatever reason”. Many data sets have missing data, either because it exists and was not collected or it never existed. Pandas default way for treating missing value is to mark it as NaN


In [31]:
df.dtypes


Out[31]:
year        object
value       object
country     object
beverage    object
dtype: object

In [32]:
df.year.unique()


Out[32]:
array([2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005,
       2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994,
       1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983,
       1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972,
       1971, 1970, 1969, 1968, 1967, 1966, 1965, 1964, 1963, 1962, 1961,
       1960], dtype=object)

In [33]:
df.year = pd.to_numeric(df.year)

In [34]:
df.dtypes


Out[34]:
year         int64
value       object
country     object
beverage    object
dtype: object

In [35]:
df.head()


Out[35]:
year value country beverage
0 2015 NaN Afghanistan All types
1 2015 NaN Afghanistan Beer
2 2015 NaN Afghanistan Wine
3 2015 NaN Afghanistan Spirits
4 2015 NaN Afghanistan Other alcoholic beverages

Lets check in the value whether we have numeric or not


In [36]:
df.value.unique()


Out[36]:
array([nan, 8.7, 3.29, ..., '20.56', '19.24', '17.24'], dtype=object)

In [37]:
df[df.value.str.isnumeric() == False].shape


Out[37]:
(34011, 4)

We will use pd.to_numeric which will coerce to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25


In [38]:
df.value = pd.to_numeric(df.value, errors='coerce')

In [39]:
df.value.unique()


Out[39]:
array([   nan,   8.7 ,   3.29, ...,  20.56,  19.24,  17.24])

In [40]:
df.dtypes


Out[40]:
year          int64
value       float64
country      object
beverage     object
dtype: object

In [41]:
df.country.unique()


Out[41]:
array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei Darussalam', 'Bulgaria',
       'Burkina Faso', 'Burundi', "Cote d'Ivoire", 'Cabo Verde',
       'Cambodia', 'Cameroon', 'Canada', 'Central African Republic',
       'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo',
       'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus',
       'Czech Republic', "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji',
       'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany',
       'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hungary',
       'Iceland', 'India', 'Indonesia', 'Iran (Islamic Republic of)',
       'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan',
       'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Kyrgyzstan',
       "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho',
       'Liberia', 'Libya', 'Lithuania', 'Luxembourg', 'Madagascar',
       'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
       'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico',
       'Micronesia (Federated States of)', 'Monaco', 'Mongolia',
       'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
       'Nauru', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua',
       'Niger', 'Nigeria', 'Niue', 'Norway', 'Oman', 'Pakistan', 'Palau',
       'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
       'Poland', 'Portugal', 'Qatar', 'Republic of Korea',
       'Republic of Moldova', 'Romania', 'Russian Federation', 'Rwanda',
       'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Vincent and the Grenadines', 'Samoa', 'San Marino',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia',
       'Solomon Islands', 'Somalia', 'South Africa', 'Spain', 'Sri Lanka',
       'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland',
       'Syrian Arab Republic', 'Tajikistan', 'Thailand',
       'The former Yugoslav republic of Macedonia', 'Timor-Leste', 'Togo',
       'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan',
       'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates',
       'United Kingdom of Great Britain and Northern Ireland',
       'United Republic of Tanzania', 'United States of America',
       'Uruguay', 'Uzbekistan', 'Vanuatu',
       'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Yemen', 'Zambia',
       'Zimbabwe'], dtype=object)

In [42]:
df.beverage.unique()


Out[42]:
array([' All types', ' Beer', ' Wine', ' Spirits',
       ' Other alcoholic beverages'], dtype=object)

In [43]:
# Convert from an np array to a list
beverage_old = df.beverage.unique().tolist()

In [44]:
beverage_old


Out[44]:
[' All types', ' Beer', ' Wine', ' Spirits', ' Other alcoholic beverages']

In [45]:
# Create a new list with white space removed and shorter names
beverage_new = ['all', 'beer', 'wine', 'spirits', 'others']

In [46]:
beverage_new


Out[46]:
['all', 'beer', 'wine', 'spirits', 'others']

In [47]:
df.beverage = df.beverage.replace(beverage_old, beverage_new)

Principle: mutate to create new variables

It is hard to think of alcohol in terms of 'litres of pure alcohol content'. It is easy to understand in terms of number of typical serving of drinks.

http://rethinkingdrinking.niaaa.nih.gov/How-much-is-too-much/what-counts-as-a-drink/whats-A-Standard-drink.aspx

For one standard serving -

  • One glasses of wine: 12% alcohol in 5fl oz of standard serving of wine glass (0.6 fl oz)
  • One can of beer: 5% alcohol in 12fl oz of a standard serving of beer can (0.6 fl oz)
  • One shot of spirits: 40% alcohol in 1.5fl oz of standard shot of spirit (0.6 fl oz)

1 US fluid ounce (fl oz) = 0.0295735 litres (l) ~ 30ml

So for:

  • 1 Standard serving of Wine = 12% * 5 * 0.03= 0.018 litres of pure alcohol
  • 1 Standard serving of Beer = 5% * 12 * 0.03= 0.018 litres of pure alcohol
  • 1 Standard serving of Spirit = 40% * 1.5 * 0.03 = 0.018 litres of pure alcohol

In [48]:
df.dtypes


Out[48]:
year          int64
value       float64
country      object
beverage     object
dtype: object

In [49]:
df['serving'] = round(df['value']/0.018, 0)

In [50]:
df.head()


Out[50]:
year value country beverage serving
0 2015 NaN Afghanistan all NaN
1 2015 NaN Afghanistan beer NaN
2 2015 NaN Afghanistan wine NaN
3 2015 NaN Afghanistan spirits NaN
4 2015 NaN Afghanistan others NaN

Principle: filter for rows in a dataframe

  • To select the rows from the dataframe


In [51]:
df2010 = df[df.year == 2010]

In [52]:
df2010.head()


Out[52]:
year value country beverage serving
4920 2010 0.01 Afghanistan all 1.0
4921 2010 0.01 Afghanistan beer 1.0
4922 2010 0.00 Afghanistan wine 0.0
4923 2010 0.00 Afghanistan spirits 0.0
4924 2010 0.00 Afghanistan others 0.0

In [53]:
dfSing = df[df.country == 'Singapore']

In [54]:
df2010Sing = df[(df.year == 2010) & (df.country == 'Singapore')]

In [55]:
df2010Sing.head()


Out[55]:
year value country beverage serving
5709 2010 1.84 Singapore all 102.0
5710 2010 1.29 Singapore beer 72.0
5711 2010 0.25 Singapore wine 14.0
5712 2010 0.27 Singapore spirits 15.0
5713 2010 0.03 Singapore others 2.0

PRINCIPLE: Pivot Table

Pivot table is a way to summarize data frame data into rows, columns and value


In [56]:
# Let us create a pivot for just serving in 2010
df2010Serving = pd.pivot_table(df2010, values = "serving", columns = "beverage", index = "country")

In [57]:
df2010Serving = df2010Serving.reset_index()

In [58]:
df2010Serving.head()


Out[58]:
beverage country all beer others spirits wine
0 Afghanistan 1.0 1.0 0.0 0.0 0.0
1 Albania 293.0 89.0 6.0 124.0 73.0
2 Algeria 25.0 9.0 0.0 2.0 13.0
3 Andorra 589.0 197.0 0.0 149.0 244.0
4 Angola 433.0 284.0 4.0 36.0 108.0

In [59]:
dfSing.head()


Out[59]:
year value country beverage serving
789 2015 1.79 Singapore all 99.0
790 2015 1.24 Singapore beer 69.0
791 2015 0.27 Singapore wine 15.0
792 2015 0.25 Singapore spirits 14.0
793 2015 0.03 Singapore others 2.0

In [60]:
# Let us create a pivot for just serving in 2010
dfSingServing = pd.pivot_table(dfSing, values = "serving", columns = "beverage", index = "year")

In [61]:
dfSingServing = dfSingServing.reset_index()

In [62]:
dfSingServing.head()


Out[62]:
beverage year all beer others spirits wine
0 1960 NaN NaN NaN NaN NaN
1 1961 122.0 84.0 10.0 26.0 2.0
2 1962 109.0 56.0 27.0 25.0 2.0
3 1963 123.0 64.0 27.0 29.0 2.0
4 1964 104.0 61.0 18.0 23.0 2.0

4. Explore the Data

"I don't know, what I don't know"

  • Understand Data Structure & Types
  • Explore single variable graphs - (Quantitative, Categorical)
  • Explore dual variable graphs - (Q & Q, Q & C, C & C)
  • Explore multi variable graphs

We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.


In [63]:
# Load the visualisation libraries - Matplotlib
import matplotlib.pyplot as plt

In [64]:
# Let us see the output plots in the notebook itself
%matplotlib inline

In [65]:
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

In [66]:
# Sort them on Beer values
df2010Serving.sort_values(by = "spirits", ascending = False, inplace=True)

In [67]:
df2010Serving.head(10)


Out[67]:
beverage country all beer others spirits wine
15 Belarus 802.0 139.0 248.0 374.0 42.0
50 Dominica 414.0 27.0 4.0 342.0 41.0
39 Cook Islands 596.0 115.0 0.0 322.0 159.0
72 Guyana 418.0 97.0 2.0 319.0 1.0
73 Haiti 320.0 1.0 0.0 318.0 1.0
144 Saint Lucia 604.0 182.0 33.0 308.0 81.0
141 Russian Federation 614.0 228.0 7.0 307.0 71.0
25 Bulgaria 600.0 223.0 2.0 279.0 96.0
44 Cyprus 629.0 210.0 0.0 270.0 148.0
155 Slovakia 563.0 190.0 19.0 256.0 98.0

In [68]:
# Plot the Data
df2010Serving.head(40).plot(kind ="barh", x = 'country', y = 'beer')


Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x10e5552e8>

In [69]:
# Plot the histogram
df2010Serving.beer.plot(kind ="hist",bins = 30, alpha=0.5)


Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d914240>

In [70]:
# Plots the Singapore Data
dfSingServing.plot(kind = "line", x = "year", y = ['all', 'beer', 'wine', 'spirits', 'others'])


Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x10dcf1e10>

Model the Data

"All models are wrong, Some of them are useful"

  • Statistical testing
  • The power and limits of models
  • Tradeoff between Prediction Accuracy and Model Interpretability
  • Assessing Model Accuracy
  • Regression models (Simple, Multiple)
  • Classification model

We want to test whether the alcohol consumption has really declined in recent times. To do that we will do a t-test, in which we will check whether the consumption before and after 1990 are really different.

https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test https://en.wikipedia.org/wiki/Welch%27s_t-test


In [71]:
dfSing.head()


Out[71]:
year value country beverage serving
789 2015 1.79 Singapore all 99.0
790 2015 1.24 Singapore beer 69.0
791 2015 0.27 Singapore wine 15.0
792 2015 0.25 Singapore spirits 14.0
793 2015 0.03 Singapore others 2.0

In [72]:
dfSingAll = dfSing[dfSing.beverage == 'all'].copy()

In [73]:
dfSingAll.head()


Out[73]:
year value country beverage serving
789 2015 1.79 Singapore all 99.0
1773 2014 1.83 Singapore all 102.0
2757 2013 1.83 Singapore all 102.0
3741 2012 1.89 Singapore all 105.0
4725 2011 1.80 Singapore all 100.0

In [74]:
# Create a new column
dfSingAll['split'] = dfSingAll.year < 1990

In [75]:
dfSingAll.head()


Out[75]:
year value country beverage serving split
789 2015 1.79 Singapore all 99.0 False
1773 2014 1.83 Singapore all 102.0 False
2757 2013 1.83 Singapore all 102.0 False
3741 2012 1.89 Singapore all 105.0 False
4725 2011 1.80 Singapore all 100.0 False

In [76]:
# Let us plot the two samples
dfSingAll.hist(column = "serving", by = "split", sharex = True, sharey= True)


Out[76]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x10dd951d0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x10e18b978>], dtype=object)

In [77]:
from scipy import stats
np.random.seed(12345678)

In [78]:
sampleA = dfSingAll[dfSingAll.split == True].serving
sampleB = dfSingAll[dfSingAll.split == False].serving

In [79]:
sampleA.shape


Out[79]:
(30,)

In [80]:
sampleB.shape


Out[80]:
(26,)

In [81]:
stats.ttest_ind(sampleA, sampleB, equal_var = False, nan_policy = 'omit')


Out[81]:
Ttest_indResult(statistic=5.7188435645402418, pvalue=6.8300746615946395e-07)

Share the Insight

"The goal is to turn data into insight"

  • Why do we need to communicate insight?
  • Types of communication - Exploration vs. Explanation
  • Explanation: Telling a story with data
  • Exploration: Building an interface for people to find stories

In [82]:
df2010Serving.head()


Out[82]:
beverage country all beer others spirits wine
15 Belarus 802.0 139.0 248.0 374.0 42.0
50 Dominica 414.0 27.0 4.0 342.0 41.0
39 Cook Islands 596.0 115.0 0.0 322.0 159.0
72 Guyana 418.0 97.0 2.0 319.0 1.0
73 Haiti 320.0 1.0 0.0 318.0 1.0

In [83]:
beerMean = df2010Serving.beer.mean()
beerMean


Out[83]:
108.95238095238095

In [84]:
wineMean = df2010Serving.wine.mean()
wineMean


Out[84]:
56.798941798941797

In [85]:
df2010Serving.plot(kind = "scatter", x ="beer", y= "wine", s = df2010Serving['all'], alpha = 0.7)
plt.axvline(beerMean, color='r')
plt.axhline(wineMean, color='r')


Out[85]:
<matplotlib.lines.Line2D at 0x111cf3240>