1. Frame the Problem

"I think, therefore I am"

What type of questions can be answered?
Developing a hypothesis drive approach.
Making the case.

Questions we will answer on alcohol topic across countries

Descriptive

Which are the leading alcohol drinking countries by beer, wine and spirit?
What is the trend in alcohol consumption in Singapore across the years?

Exploratory / Inferential

Has alcohol consumption declined in Singapore in the recent years?

2. Acquire the Data

"Data is the new oil"

Download from an internal system
Obtained from client, or other 3rd party
Extracted from a web-based API
Scraped from a website
Extracted from a PDF file
Gathered manually and recorded

We will using the Global Information System on Alcohol and Health (GISAH) maintained by WHO to answer the questions.

The WHO Global Information System on Alcohol and Health (GISAH) provides easy and rapid access to a wide range of alcohol-related health indicators. It is an essential tool for assessing and monitoring the health situation and trends related to alcohol consumption, alcohol-related harm, and policy responses in countries.

You can see an overview at http://www.who.int/gho/alcohol/en/.

Principle: Load the Data

The datasets from GISAH are available at http://apps.who.int/gho/data/node.main.GISAH?lang=en&showonly=GISAH

We want the alcohol consumption by country

Recorded alcohol per capita consumption, 1960-1979 by country - http://apps.who.int/gho/data/node.main.A1025?lang=en&showonly=GISAH
Recorded alcohol per capita consumption, 1980-1999 by country - http://apps.who.int/gho/data/node.main.A1024?lang=en&showonly=GISAH
Recorded alcohol per capita consumption, 2000 onwards by country http://apps.who.int/gho/data/node.main.A1026?lang=en&showonly=GISAH



In [1]:

    
# Import the libraries we need, which is Pandas and Numpy
import pandas as pd
import numpy as np



In [2]:

    
df1 = pd.read_csv('data/drinks2000.csv')



In [3]:

    
df1.head()









    Out[3]:






  
    
      
      Country; Data Source; Beverage Types
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2015
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2014
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2013
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2012
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2011
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2010
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2009
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2008
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2007
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2006
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2005
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2004
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2003
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2002
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2001
      Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2000
    
  
  
    
      0
      Afghanistan; Data source; All types
      NaN
      NaN
      0.01
      0.01
      0.01
      0.01
      0.01
      0.03
      0.02
      0.03
      0.02
      0.02
      0.01
      0.01
      0
      0
    
    
      1
      Afghanistan; Data source; Beer
      NaN
      NaN
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.02
      0
      0
      0
      0
    
    
      2
      Afghanistan; Data source; Wine
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0
      0.01
      0.01
      0
      0
      0
      0
      0
      0
    
    
      3
      Afghanistan; Data source; Spirits
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0.02
      0
      0.01
      0.01
      0
      0.01
      0
      0
      0
    
    
      4
      Afghanistan; Data source; Other alcoholic beve...
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0



In [4]:

    
df1.shape









    Out[4]:





(984, 17)

Principle: Fix the Column Header



In [5]:

    
years1 = list(range(2015, 1999, -1))



In [6]:

    
years1



In [7]:

    
header1 = ['description']



In [8]:

    
header1.extend(years1)



In [9]:

    
header1









    Out[9]:





['description',
 2015,
 2014,
 2013,
 2012,
 2011,
 2010,
 2009,
 2008,
 2007,
 2006,
 2005,
 2004,
 2003,
 2002,
 2001,
 2000]



In [10]:

    
df1.columns = header1



In [11]:

    
df1.head()









    Out[11]:






  
    
      
      description
      2015
      2014
      2013
      2012
      2011
      2010
      2009
      2008
      2007
      2006
      2005
      2004
      2003
      2002
      2001
      2000
    
  
  
    
      0
      Afghanistan; Data source; All types
      NaN
      NaN
      0.01
      0.01
      0.01
      0.01
      0.01
      0.03
      0.02
      0.03
      0.02
      0.02
      0.01
      0.01
      0
      0
    
    
      1
      Afghanistan; Data source; Beer
      NaN
      NaN
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.02
      0
      0
      0
      0
    
    
      2
      Afghanistan; Data source; Wine
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0
      0.01
      0.01
      0
      0
      0
      0
      0
      0
    
    
      3
      Afghanistan; Data source; Spirits
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0.02
      0
      0.01
      0.01
      0
      0.01
      0
      0
      0
    
    
      4
      Afghanistan; Data source; Other alcoholic beve...
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0

Exercise

Load the drinks1960 and drinks1980 csv files and fix the column header



In [12]:

    
df2 = pd.read_csv('data/drinks1980.csv')
years2 = list(range(1999, 1979, -1))
header2 = ['description']
header2.extend(years2)
df2.columns = header2
df2.head()









    Out[12]:






  
    
      
      description
      1999
      1998
      1997
      1996
      1995
      1994
      1993
      1992
      1991
      ...
      1989
      1988
      1987
      1986
      1985
      1984
      1983
      1982
      1981
      1980
    
  
  
    
      0
      Afghanistan; Data source; All types
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      Afghanistan; Data source; Beer
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      Afghanistan; Data source; Wine
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      Afghanistan; Data source; Spirits
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      Afghanistan; Data source; Other alcoholic beve...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 21 columns



In [13]:

    
df3 = pd.read_csv('data/drinks1960.csv')
years3 = list(range(1979, 1959, -1))
header3 = ['description']
header3.extend(years3)
df3.columns = header3
df3.head()









    Out[13]:






  
    
      
      description
      1979
      1978
      1977
      1976
      1975
      1974
      1973
      1972
      1971
      ...
      1969
      1968
      1967
      1966
      1965
      1964
      1963
      1962
      1961
      1960
    
  
  
    
      0
      Afghanistan; Data source; All types
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      NaN
    
    
      1
      Afghanistan; Data source; Beer
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      NaN
    
    
      2
      Afghanistan; Data source; Wine
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      NaN
    
    
      3
      Afghanistan; Data source; Spirits
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      NaN
    
    
      4
      Afghanistan; Data source; Other alcoholic beve...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      NaN
    
  

5 rows × 21 columns



In [ ]:

3. Refine the Data

"Data is messy"

We will be performing the following operation on our Onion price to refine it

Remove e.g. remove redundant data from the data frame
Derive e.g. Country and Beverage from the description field
Missing e.g. Check for missing or incomplete data
Merge e.g. Take the three dataframes and make them one
Filter e.g. exclude based on location

Other stuff you may need to do to refine are...

Parse e.g. extract date from year and month column
Quality e.g. Check for duplicates, accuracy, unusual data
Convert e.g. free text to coded value
Calculate e.g. percentages, proportion
Aggregate e.g. rollup by year, cluster by area
Sample e.g. extract a representative data
Summary e.g. show summary stats like mean

Principle: `melt` to convert from Wide format to Tall format

We will need to convert the data frame from wide format to tall format (and vice versa). This is needed as we want to combine the three data frame and we can only do that once we have the data in a tall format



In [14]:

    
df1.head()









    Out[14]:






  
    
      
      description
      2015
      2014
      2013
      2012
      2011
      2010
      2009
      2008
      2007
      2006
      2005
      2004
      2003
      2002
      2001
      2000
    
  
  
    
      0
      Afghanistan; Data source; All types
      NaN
      NaN
      0.01
      0.01
      0.01
      0.01
      0.01
      0.03
      0.02
      0.03
      0.02
      0.02
      0.01
      0.01
      0
      0
    
    
      1
      Afghanistan; Data source; Beer
      NaN
      NaN
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.01
      0.02
      0
      0
      0
      0
    
    
      2
      Afghanistan; Data source; Wine
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0
      0.01
      0.01
      0
      0
      0
      0
      0
      0
    
    
      3
      Afghanistan; Data source; Spirits
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0.02
      0
      0.01
      0.01
      0
      0.01
      0
      0
      0
    
    
      4
      Afghanistan; Data source; Other alcoholic beve...
      NaN
      NaN
      0.00
      0.00
      0.00
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0



In [15]:

    
df1 = pd.melt(df1, id_vars=['description'], var_name='year')



In [16]:

    
df1.head()









    Out[16]:






  
    
      
      description
      year
      value
    
  
  
    
      0
      Afghanistan; Data source; All types
      2015
      NaN
    
    
      1
      Afghanistan; Data source; Beer
      2015
      NaN
    
    
      2
      Afghanistan; Data source; Wine
      2015
      NaN
    
    
      3
      Afghanistan; Data source; Spirits
      2015
      NaN
    
    
      4
      Afghanistan; Data source; Other alcoholic beve...
      2015
      NaN



In [17]:

    
df2 = pd.melt(df2, id_vars=['description'], var_name='year')
df3 = pd.melt(df3, id_vars=['description'], var_name='year')

Principle: `append` one dataframe to another



In [18]:

    
df1.shape









    Out[18]:





(15744, 3)



In [19]:

    
df2.shape









    Out[19]:





(18900, 3)



In [20]:

    
df = df1.append(df2)



In [21]:

    
df.shape









    Out[21]:





(34644, 3)



In [22]:

    
df = df.append(df3)



In [23]:

    
df.shape









    Out[23]:





(48244, 3)

Principle: `str` to extract text from a strings

String manipulation is very common and we often need to extract a substring from a long string. In this case, we want to get the country and type of beverage from the description



In [24]:

    
df.head()









    Out[24]:






  
    
      
      description
      year
      value
    
  
  
    
      0
      Afghanistan; Data source; All types
      2015
      NaN
    
    
      1
      Afghanistan; Data source; Beer
      2015
      NaN
    
    
      2
      Afghanistan; Data source; Wine
      2015
      NaN
    
    
      3
      Afghanistan; Data source; Spirits
      2015
      NaN
    
    
      4
      Afghanistan; Data source; Other alcoholic beve...
      2015
      NaN



In [25]:

    
df['country'] = df.description.str.split(';').str[0]



In [26]:

    
df.head()









    Out[26]:






  
    
      
      description
      year
      value
      country
    
  
  
    
      0
      Afghanistan; Data source; All types
      2015
      NaN
      Afghanistan
    
    
      1
      Afghanistan; Data source; Beer
      2015
      NaN
      Afghanistan
    
    
      2
      Afghanistan; Data source; Wine
      2015
      NaN
      Afghanistan
    
    
      3
      Afghanistan; Data source; Spirits
      2015
      NaN
      Afghanistan
    
    
      4
      Afghanistan; Data source; Other alcoholic beve...
      2015
      NaN
      Afghanistan



In [27]:

    
df['beverage'] = df.description.str.split(";").str[-1]



In [28]:

    
df.tail()









    Out[28]:






  
    
      
      description
      year
      value
      country
      beverage
    
  
  
    
      13595
      Zimbabwe; Data source; All types
      1960
      NaN
      Zimbabwe
      All types
    
    
      13596
      Zimbabwe; Data source; Beer
      1960
      NaN
      Zimbabwe
      Beer
    
    
      13597
      Zimbabwe; Data source; Wine
      1960
      NaN
      Zimbabwe
      Wine
    
    
      13598
      Zimbabwe; Data source; Spirits
      1960
      NaN
      Zimbabwe
      Spirits
    
    
      13599
      Zimbabwe; Data source; Other alcoholic beverages
      1960
      NaN
      Zimbabwe
      Other alcoholic beverages

We can now drop the description column from our dataframe



In [29]:

    
df.drop('description', axis = 1, inplace= True)



In [30]:

    
df.head()









    Out[30]:






  
    
      
      year
      value
      country
      beverage
    
  
  
    
      0
      2015
      NaN
      Afghanistan
      All types
    
    
      1
      2015
      NaN
      Afghanistan
      Beer
    
    
      2
      2015
      NaN
      Afghanistan
      Wine
    
    
      3
      2015
      NaN
      Afghanistan
      Spirits
    
    
      4
      2015
      NaN
      Afghanistan
      Other alcoholic beverages

Principle: Dealing with Missing Values

By “missing” it simply mean null or “not present for whatever reason”. Many data sets have missing data, either because it exists and was not collected or it never existed. Pandas default way for treating missing value is to mark it as NaN



In [31]:

    
df.dtypes









    Out[31]:





year        object
value       object
country     object
beverage    object
dtype: object



In [32]:

    
df.year.unique()









    Out[32]:





array([2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005,
       2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994,
       1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983,
       1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972,
       1971, 1970, 1969, 1968, 1967, 1966, 1965, 1964, 1963, 1962, 1961,
       1960], dtype=object)



In [33]:

    
df.year = pd.to_numeric(df.year)



In [34]:

    
df.dtypes









    Out[34]:





year         int64
value       object
country     object
beverage    object
dtype: object



In [35]:

    
df.head()









    Out[35]:






  
    
      
      year
      value
      country
      beverage
    
  
  
    
      0
      2015
      NaN
      Afghanistan
      All types
    
    
      1
      2015
      NaN
      Afghanistan
      Beer
    
    
      2
      2015
      NaN
      Afghanistan
      Wine
    
    
      3
      2015
      NaN
      Afghanistan
      Spirits
    
    
      4
      2015
      NaN
      Afghanistan
      Other alcoholic beverages

Lets check in the value whether we have numeric or not



In [36]:

    
df.value.unique()









    Out[36]:





array([nan, 8.7, 3.29, ..., '20.56', '19.24', '17.24'], dtype=object)



In [37]:

    
df[df.value.str.isnumeric() == False].shape









    Out[37]:





(34011, 4)

We will use pd.to_numeric which will coerce to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25



In [38]:

    
df.value = pd.to_numeric(df.value, errors='coerce')



In [39]:

    
df.value.unique()









    Out[39]:





array([   nan,   8.7 ,   3.29, ...,  20.56,  19.24,  17.24])



In [40]:

    
df.dtypes









    Out[40]:





year          int64
value       float64
country      object
beverage     object
dtype: object



In [41]:

    
df.country.unique()









    Out[41]:





array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei Darussalam', 'Bulgaria',
       'Burkina Faso', 'Burundi', "Cote d'Ivoire", 'Cabo Verde',
       'Cambodia', 'Cameroon', 'Canada', 'Central African Republic',
       'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo',
       'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus',
       'Czech Republic', "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji',
       'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany',
       'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hungary',
       'Iceland', 'India', 'Indonesia', 'Iran (Islamic Republic of)',
       'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan',
       'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Kyrgyzstan',
       "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho',
       'Liberia', 'Libya', 'Lithuania', 'Luxembourg', 'Madagascar',
       'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
       'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico',
       'Micronesia (Federated States of)', 'Monaco', 'Mongolia',
       'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
       'Nauru', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua',
       'Niger', 'Nigeria', 'Niue', 'Norway', 'Oman', 'Pakistan', 'Palau',
       'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
       'Poland', 'Portugal', 'Qatar', 'Republic of Korea',
       'Republic of Moldova', 'Romania', 'Russian Federation', 'Rwanda',
       'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Vincent and the Grenadines', 'Samoa', 'San Marino',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia',
       'Solomon Islands', 'Somalia', 'South Africa', 'Spain', 'Sri Lanka',
       'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland',
       'Syrian Arab Republic', 'Tajikistan', 'Thailand',
       'The former Yugoslav republic of Macedonia', 'Timor-Leste', 'Togo',
       'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan',
       'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates',
       'United Kingdom of Great Britain and Northern Ireland',
       'United Republic of Tanzania', 'United States of America',
       'Uruguay', 'Uzbekistan', 'Vanuatu',
       'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Yemen', 'Zambia',
       'Zimbabwe'], dtype=object)



In [42]:

    
df.beverage.unique()









    Out[42]:





array([' All types', ' Beer', ' Wine', ' Spirits',
       ' Other alcoholic beverages'], dtype=object)



In [43]:

    
# Convert from an np array to a list
beverage_old = df.beverage.unique().tolist()



In [44]:

    
beverage_old









    Out[44]:





[' All types', ' Beer', ' Wine', ' Spirits', ' Other alcoholic beverages']



In [45]:

    
# Create a new list with white space removed and shorter names
beverage_new = ['all', 'beer', 'wine', 'spirits', 'others']



In [46]:

    
beverage_new









    Out[46]:





['all', 'beer', 'wine', 'spirits', 'others']



In [47]:

    
df.beverage = df.beverage.replace(beverage_old, beverage_new)

Principle: mutate to create new variables

It is hard to think of alcohol in terms of 'litres of pure alcohol content'. It is easy to understand in terms of number of typical serving of drinks.

http://rethinkingdrinking.niaaa.nih.gov/How-much-is-too-much/what-counts-as-a-drink/whats-A-Standard-drink.aspx

For one standard serving -

One glasses of wine: 12% alcohol in 5fl oz of standard serving of wine glass (0.6 fl oz)
One can of beer: 5% alcohol in 12fl oz of a standard serving of beer can (0.6 fl oz)
One shot of spirits: 40% alcohol in 1.5fl oz of standard shot of spirit (0.6 fl oz)

1 US fluid ounce (fl oz) = 0.0295735 litres (l) ~ 30ml

So for:

1 Standard serving of Wine = 12% * 5 * 0.03= 0.018 litres of pure alcohol
1 Standard serving of Beer = 5% * 12 * 0.03= 0.018 litres of pure alcohol
1 Standard serving of Spirit = 40% * 1.5 * 0.03 = 0.018 litres of pure alcohol



In [48]:

    
df.dtypes









    Out[48]:





year          int64
value       float64
country      object
beverage     object
dtype: object



In [49]:

    
df['serving'] = round(df['value']/0.018, 0)



In [50]:

    
df.head()









    Out[50]:






  
    
      
      year
      value
      country
      beverage
      serving
    
  
  
    
      0
      2015
      NaN
      Afghanistan
      all
      NaN
    
    
      1
      2015
      NaN
      Afghanistan
      beer
      NaN
    
    
      2
      2015
      NaN
      Afghanistan
      wine
      NaN
    
    
      3
      2015
      NaN
      Afghanistan
      spirits
      NaN
    
    
      4
      2015
      NaN
      Afghanistan
      others
      NaN

Principle: filter for rows in a dataframe

To select the rows from the dataframe



In [51]:

    
df2010 = df[df.year == 2010]



In [52]:

    
df2010.head()









    Out[52]:






  
    
      
      year
      value
      country
      beverage
      serving
    
  
  
    
      4920
      2010
      0.01
      Afghanistan
      all
      1.0
    
    
      4921
      2010
      0.01
      Afghanistan
      beer
      1.0
    
    
      4922
      2010
      0.00
      Afghanistan
      wine
      0.0
    
    
      4923
      2010
      0.00
      Afghanistan
      spirits
      0.0
    
    
      4924
      2010
      0.00
      Afghanistan
      others
      0.0



In [53]:

    
dfSing = df[df.country == 'Singapore']



In [54]:

    
df2010Sing = df[(df.year == 2010) & (df.country == 'Singapore')]



In [55]:

    
df2010Sing.head()









    Out[55]:






  
    
      
      year
      value
      country
      beverage
      serving
    
  
  
    
      5709
      2010
      1.84
      Singapore
      all
      102.0
    
    
      5710
      2010
      1.29
      Singapore
      beer
      72.0
    
    
      5711
      2010
      0.25
      Singapore
      wine
      14.0
    
    
      5712
      2010
      0.27
      Singapore
      spirits
      15.0
    
    
      5713
      2010
      0.03
      Singapore
      others
      2.0

PRINCIPLE: Pivot Table

Pivot table is a way to summarize data frame data into rows, columns and value



In [56]:

    
# Let us create a pivot for just serving in 2010
df2010Serving = pd.pivot_table(df2010, values = "serving", columns = "beverage", index = "country")



In [57]:

    
df2010Serving = df2010Serving.reset_index()



In [58]:

    
df2010Serving.head()









    Out[58]:






  
    
      beverage
      country
      all
      beer
      others
      spirits
      wine
    
  
  
    
      0
      Afghanistan
      1.0
      1.0
      0.0
      0.0
      0.0
    
    
      1
      Albania
      293.0
      89.0
      6.0
      124.0
      73.0
    
    
      2
      Algeria
      25.0
      9.0
      0.0
      2.0
      13.0
    
    
      3
      Andorra
      589.0
      197.0
      0.0
      149.0
      244.0
    
    
      4
      Angola
      433.0
      284.0
      4.0
      36.0
      108.0



In [59]:

    
dfSing.head()









    Out[59]:






  
    
      
      year
      value
      country
      beverage
      serving
    
  
  
    
      789
      2015
      1.79
      Singapore
      all
      99.0
    
    
      790
      2015
      1.24
      Singapore
      beer
      69.0
    
    
      791
      2015
      0.27
      Singapore
      wine
      15.0
    
    
      792
      2015
      0.25
      Singapore
      spirits
      14.0
    
    
      793
      2015
      0.03
      Singapore
      others
      2.0



In [60]:

    
# Let us create a pivot for just serving in 2010
dfSingServing = pd.pivot_table(dfSing, values = "serving", columns = "beverage", index = "year")



In [61]:

    
dfSingServing = dfSingServing.reset_index()



In [62]:

    
dfSingServing.head()

4. Explore the Data

"I don't know, what I don't know"

Understand Data Structure & Types
Explore single variable graphs - (Quantitative, Categorical)
Explore dual variable graphs - (Q & Q, Q & C, C & C)
Explore multi variable graphs

We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.



In [63]:

    
# Load the visualisation libraries - Matplotlib
import matplotlib.pyplot as plt



In [64]:

    
# Let us see the output plots in the notebook itself
%matplotlib inline



In [65]:

    
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)



In [66]:

    
# Sort them on Beer values
df2010Serving.sort_values(by = "spirits", ascending = False, inplace=True)



In [67]:

    
df2010Serving.head(10)









    Out[67]:






  
    
      beverage
      country
      all
      beer
      others
      spirits
      wine
    
  
  
    
      15
      Belarus
      802.0
      139.0
      248.0
      374.0
      42.0
    
    
      50
      Dominica
      414.0
      27.0
      4.0
      342.0
      41.0
    
    
      39
      Cook Islands
      596.0
      115.0
      0.0
      322.0
      159.0
    
    
      72
      Guyana
      418.0
      97.0
      2.0
      319.0
      1.0
    
    
      73
      Haiti
      320.0
      1.0
      0.0
      318.0
      1.0
    
    
      144
      Saint Lucia
      604.0
      182.0
      33.0
      308.0
      81.0
    
    
      141
      Russian Federation
      614.0
      228.0
      7.0
      307.0
      71.0
    
    
      25
      Bulgaria
      600.0
      223.0
      2.0
      279.0
      96.0
    
    
      44
      Cyprus
      629.0
      210.0
      0.0
      270.0
      148.0
    
    
      155
      Slovakia
      563.0
      190.0
      19.0
      256.0
      98.0



In [68]:

    
# Plot the Data
df2010Serving.head(40).plot(kind ="barh", x = 'country', y = 'beer')









    Out[68]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e5552e8>



In [69]:

    
# Plot the histogram
df2010Serving.beer.plot(kind ="hist",bins = 30, alpha=0.5)









    Out[69]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d914240>



In [70]:

    
# Plots the Singapore Data
dfSingServing.plot(kind = "line", x = "year", y = ['all', 'beer', 'wine', 'spirits', 'others'])









    Out[70]:





<matplotlib.axes._subplots.AxesSubplot at 0x10dcf1e10>

Model the Data

"All models are wrong, Some of them are useful"

Statistical testing
The power and limits of models
Tradeoff between Prediction Accuracy and Model Interpretability
Assessing Model Accuracy
Regression models (Simple, Multiple)
Classification model

We want to test whether the alcohol consumption has really declined in recent times. To do that we will do a t-test, in which we will check whether the consumption before and after 1990 are really different.

https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test https://en.wikipedia.org/wiki/Welch%27s_t-test



In [71]:

    
dfSing.head()









    Out[71]:






  
    
      
      year
      value
      country
      beverage
      serving
    
  
  
    
      789
      2015
      1.79
      Singapore
      all
      99.0
    
    
      790
      2015
      1.24
      Singapore
      beer
      69.0
    
    
      791
      2015
      0.27
      Singapore
      wine
      15.0
    
    
      792
      2015
      0.25
      Singapore
      spirits
      14.0
    
    
      793
      2015
      0.03
      Singapore
      others
      2.0



In [72]:

    
dfSingAll = dfSing[dfSing.beverage == 'all'].copy()



In [73]:

    
dfSingAll.head()









    Out[73]:






  
    
      
      year
      value
      country
      beverage
      serving
    
  
  
    
      789
      2015
      1.79
      Singapore
      all
      99.0
    
    
      1773
      2014
      1.83
      Singapore
      all
      102.0
    
    
      2757
      2013
      1.83
      Singapore
      all
      102.0
    
    
      3741
      2012
      1.89
      Singapore
      all
      105.0
    
    
      4725
      2011
      1.80
      Singapore
      all
      100.0



In [74]:

    
# Create a new column
dfSingAll['split'] = dfSingAll.year < 1990



In [75]:

    
dfSingAll.head()









    Out[75]:






  
    
      
      year
      value
      country
      beverage
      serving
      split
    
  
  
    
      789
      2015
      1.79
      Singapore
      all
      99.0
      False
    
    
      1773
      2014
      1.83
      Singapore
      all
      102.0
      False
    
    
      2757
      2013
      1.83
      Singapore
      all
      102.0
      False
    
    
      3741
      2012
      1.89
      Singapore
      all
      105.0
      False
    
    
      4725
      2011
      1.80
      Singapore
      all
      100.0
      False



In [76]:

    
# Let us plot the two samples
dfSingAll.hist(column = "serving", by = "split", sharex = True, sharey= True)









    Out[76]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x10dd951d0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x10e18b978>], dtype=object)



In [77]:

    
from scipy import stats
np.random.seed(12345678)



In [78]:

    
sampleA = dfSingAll[dfSingAll.split == True].serving
sampleB = dfSingAll[dfSingAll.split == False].serving



In [79]:

    
sampleA.shape









    Out[79]:





(30,)



In [80]:

    
sampleB.shape









    Out[80]:





(26,)



In [81]:

    
stats.ttest_ind(sampleA, sampleB, equal_var = False, nan_policy = 'omit')









    Out[81]:





Ttest_indResult(statistic=5.7188435645402418, pvalue=6.8300746615946395e-07)

"The goal is to turn data into insight"

Why do we need to communicate insight?
Types of communication - Exploration vs. Explanation
Explanation: Telling a story with data
Exploration: Building an interface for people to find stories



In [82]:

    
df2010Serving.head()









    Out[82]:






  
    
      beverage
      country
      all
      beer
      others
      spirits
      wine
    
  
  
    
      15
      Belarus
      802.0
      139.0
      248.0
      374.0
      42.0
    
    
      50
      Dominica
      414.0
      27.0
      4.0
      342.0
      41.0
    
    
      39
      Cook Islands
      596.0
      115.0
      0.0
      322.0
      159.0
    
    
      72
      Guyana
      418.0
      97.0
      2.0
      319.0
      1.0
    
    
      73
      Haiti
      320.0
      1.0
      0.0
      318.0
      1.0



In [83]:

    
beerMean = df2010Serving.beer.mean()
beerMean









    Out[83]:





108.95238095238095



In [84]:

    
wineMean = df2010Serving.wine.mean()
wineMean









    Out[84]:





56.798941798941797



In [85]:

    
df2010Serving.plot(kind = "scatter", x ="beer", y= "wine", s = df2010Serving['all'], alpha = 0.7)
plt.axvline(beerMean, color='r')
plt.axhline(wineMean, color='r')









    Out[85]:





<matplotlib.lines.Line2D at 0x111cf3240>

	Country; Data Source; Beverage Types	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2015	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2014	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2013	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2012	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2011	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2010	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2009	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2008	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2007	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2006	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2005	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2004	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2003	Recorded alcohol per capita (15+) consumption (in litres of pure alcohol); 2002
0	Afghanistan; Data source; All types	NaN	NaN	0.01	0.01	0.01	0.01	0.01	0.03	0.02	0.03	0.02	0.02	0.01	0.01
1	Afghanistan; Data source; Beer	NaN	NaN	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.02	0	0
2	Afghanistan; Data source; Wine	NaN	NaN	0.00	0.00	0.00	0	0	0	0.01	0.01	0	0	0	0
3	Afghanistan; Data source; Spirits	NaN	NaN	0.00	0.00	0.00	0	0	0.02	0	0.01	0.01	0	0.01	0
4	Afghanistan; Data source; Other alcoholic beve...	NaN	NaN	0.00	0.00	0.00	0	0	0	0	0	0	0	0	0

	description	year	value
0	Afghanistan; Data source; All types	2015	NaN
1	Afghanistan; Data source; Beer	2015	NaN
2	Afghanistan; Data source; Wine	2015	NaN
3	Afghanistan; Data source; Spirits	2015	NaN
4	Afghanistan; Data source; Other alcoholic beve...	2015	NaN

	description	year	value	country
0	Afghanistan; Data source; All types	2015	NaN	Afghanistan
1	Afghanistan; Data source; Beer	2015	NaN	Afghanistan
2	Afghanistan; Data source; Wine	2015	NaN	Afghanistan
3	Afghanistan; Data source; Spirits	2015	NaN	Afghanistan
4	Afghanistan; Data source; Other alcoholic beve...	2015	NaN	Afghanistan

	description	year	value	country	beverage
13595	Zimbabwe; Data source; All types	1960	NaN	Zimbabwe	All types
13596	Zimbabwe; Data source; Beer	1960	NaN	Zimbabwe	Beer
13597	Zimbabwe; Data source; Wine	1960	NaN	Zimbabwe	Wine
13598	Zimbabwe; Data source; Spirits	1960	NaN	Zimbabwe	Spirits
13599	Zimbabwe; Data source; Other alcoholic beverages	1960	NaN	Zimbabwe	Other alcoholic beverages

	year	value	country	beverage	serving
4920	2010	0.01	Afghanistan	all	1.0
4921	2010	0.01	Afghanistan	beer	1.0
4922	2010	0.00	Afghanistan	wine	0.0
4923	2010	0.00	Afghanistan	spirits	0.0
4924	2010	0.00	Afghanistan	others	0.0

	year	value	country	beverage	serving
5709	2010	1.84	Singapore	all	102.0
5710	2010	1.29	Singapore	beer	72.0
5711	2010	0.25	Singapore	wine	14.0
5712	2010	0.27	Singapore	spirits	15.0
5713	2010	0.03	Singapore	others	2.0

beverage	country	all	beer	others	spirits	wine
0	Afghanistan	1.0	1.0	0.0	0.0	0.0
1	Albania	293.0	89.0	6.0	124.0	73.0
2	Algeria	25.0	9.0	0.0	2.0	13.0
3	Andorra	589.0	197.0	0.0	149.0	244.0
4	Angola	433.0	284.0	4.0	36.0	108.0

	year	value	country	beverage	serving
789	2015	1.79	Singapore	all	99.0
790	2015	1.24	Singapore	beer	69.0
791	2015	0.27	Singapore	wine	15.0
792	2015	0.25	Singapore	spirits	14.0
793	2015	0.03	Singapore	others	2.0

beverage	year	all	beer	others	spirits	wine
0	1960	NaN	NaN	NaN	NaN	NaN
1	1961	122.0	84.0	10.0	26.0	2.0
2	1962	109.0	56.0	27.0	25.0	2.0
3	1963	123.0	64.0	27.0	29.0	2.0
4	1964	104.0	61.0	18.0	23.0	2.0

beverage	country	all	beer	others	spirits	wine
15	Belarus	802.0	139.0	248.0	374.0	42.0
50	Dominica	414.0	27.0	4.0	342.0	41.0
39	Cook Islands	596.0	115.0	0.0	322.0	159.0
72	Guyana	418.0	97.0	2.0	319.0	1.0
73	Haiti	320.0	1.0	0.0	318.0	1.0
144	Saint Lucia	604.0	182.0	33.0	308.0	81.0
141	Russian Federation	614.0	228.0	7.0	307.0	71.0
25	Bulgaria	600.0	223.0	2.0	279.0	96.0
44	Cyprus	629.0	210.0	0.0	270.0	148.0
155	Slovakia	563.0	190.0	19.0	256.0	98.0