1. Week 3 assignment

The dataset chosen is Gapminder. You can check the codebook clicking here.

As we saw in Week 2 assignment, a lot of countries have missing values in the variables chosen. There is no workaround here like in the videos because they are not "not aplicable" values, just values that the Gapminder was not able to compute, either the data is not available in the source country, either the source is not trusted. But, in order to try to fulfil the assignment requirements, I won't drop countries with missing values this time, and decide case by case what to do.

To use this program you must have Python 3.3+ and IPython Notebook 1.0+ installed.



In [1]:

    
%pylab inline

# This package is very useful to data analysis in Python.
import pandas as pd

# Read the csv file to a dataframe object.
df = pd.read_csv('data/gapminder.csv')

# Convert all number values to float.
df = df.convert_objects(convert_numeric=True)

# Define the Country as the unique id of the dataframe.
df.index = df.country
del df['country']

# List of the variables selected.
vars_sel = ['polityscore', 'oilperperson', 'relectricperperson', 'employrate', 
            'lifeexpectancy', 'armedforcesrate', 'urbanrate', 'femaleemployrate']


# Dataframe with only the variables selected.
dfs = df[vars_sel]









    



Populating the interactive namespace from numpy and matplotlib






    



/Users/sergio/anaconda3/lib/python3.4/site-packages/IPython/kernel/__main__.py:10: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

1.1 Calculate new variables

1.1.1 New variable `oil_consumed`

In order to fulfil the assignment let's create a new variable combining oilperperson and lifeexpectancy variables. Let's calculate the expected tonnes of oil a person can consume in his entire life, supposing this consumptions begins with 20 years. This new variable, oil_consumed, is then stated as:

$$oil\_consumed = oilperperson * (lifeexpectancy-20)$$



In [2]:

    
# Calculate the new variable.
df['oil_consumed'] = df.oilperperson * (df.lifeexpectancy - 20)

# Print an overview of this new variable.
df.oil_consumed.describe()









    Out[2]:





count     62.000000
mean      85.523760
std      110.124175
min        1.579984
25%       25.343760
50%       50.492618
75%       98.855828
max      747.488154
Name: oil_consumed, dtype: float64

1.1.2 New variable `womem_army_rate`

Another variable is the number of womem in the army. We take the assumption that this can be calculated as follows:

$$womem\_army\_rate = armedforcesrate * femaleemployrate/100$$



In [3]:

    
# Calculate the new variable.
df['womem_army_rate'] = df.armedforcesrate * df.femaleemployrate/100

# Print an overview of this new variable.
df.womem_army_rate.describe()









    Out[3]:





count    163.000000
mean       0.598049
std        0.760118
min        0.000000
25%        0.228521
50%        0.408831
75%        0.671941
max        5.957572
Name: womem_army_rate, dtype: float64

As we can see above, nearly 75% of countries army have less than 1% womem in their ranks.

1.1.3 New variable `urban_level`

This variable will be a categorization of the continuous variable urbanrate.



In [4]:

    
df['urban_level'] = pd.cut(df.urbanrate, [ 0, 20, 50, 70, 90,100], 
                           labels=['very low', 'low','medium','high', 'very high'])

1.2 Overview of the variables created

Here I show you an overview of the variables create. Here you can see the extreme values, mean, standard deviation, and and quatiles 25%, 50% (median) and 75%. Note that the count row actualy count the non missing values. If there weren't missing values, count would be 213, the number of countries in this dataset. Finally, the variabe urban_level is not shown here because it's categorical and pandas describe() function only display numeric data. Well see its Frequency Table in the properly section.



In [5]:

    
df[['oil_consumed','womem_army_rate']].describe()









    Out[5]:






  
    
      
      oil_consumed
      womem_army_rate
    
  
  
    
      count
      62.000000
      163.000000
    
    
      mean
      85.523760
      0.598049
    
    
      std
      110.124175
      0.760118
    
    
      min
      1.579984
      0.000000
    
    
      25%
      25.343760
      0.228521
    
    
      50%
      50.492618
      0.408831
    
    
      75%
      98.855828
      0.671941
    
    
      max
      747.488154
      5.957572

1.3 Counting the missing values in each variable

Pandas DataFrame has a nice feature to count the missin values. The function isnull() returns a boolean array where the value TRUE is set if the variable is missing and FALSE otherwise. As Python takes the TRUE value boolean value as the integer 1 and FALSE the integer 0, we just have to call sum() to wrap up the results.



In [12]:

    
df.isnull().sum()









    Out[12]:





incomeperperson          23
alcconsumption           26
armedforcesrate          49
breastcancerper100th     40
co2emissions             13
femaleemployrate         35
hivrate                  66
internetuserate          21
lifeexpectancy           22
oilperperson            150
polityscore              52
relectricperperson       77
suicideper100th          22
employrate               35
urbanrate                10
oil_consumed            151
womem_army_rate          50
urban_level              10
dtype: int64

The Gapminder Dataset has 213 countries listed. Let's calculate the percentage of missing countries in each variable.



In [7]:

    
# Count the missing variables.
missing_values = dfs.isnull().sum()

# Calculate the respective percentage.
percent = (missing_values/missing_values.sum()*100).round(2)

# Create a pandas.DataFrame object to pretty print the results.
mdf = pd.DataFrame({'Count':missing_values, 'Percentage': percent})
mdf.index.name = 'Variable'
mdf









    Out[7]:






  
    
      
      Count
      Percentage
    
    
      Variable
      
      
    
  
  
    
      polityscore
      52
      12.09
    
    
      oilperperson
      150
      34.88
    
    
      relectricperperson
      77
      17.91
    
    
      employrate
      35
      8.14
    
    
      lifeexpectancy
      22
      5.12
    
    
      armedforcesrate
      49
      11.40
    
    
      urbanrate
      10
      2.33
    
    
      femaleemployrate
      35
      8.14

We can see above that the oilperperson is the most missing varible, accounting to drop 150 countries. Maybe we shouldn't use use this variable anymore.

1.4 Frequency Tables



In [8]:

    
# Number of bins to bucket te continuous variables.
NUM_BUCKETS = 10

# Helper function to print the frequency values as an HTML table.
def print_freq_table(series):
    # Count the frequency of values. 
    #This is a pandas.Series object.
    x = series.value_counts()
    
    # Sort the table by the values taken, rather than 
    # the default descending order of frequencies.
    x = x.sort_index()
    
    # Convert the pandas.Series.object to pandas.DataFrame in order to
    # name the values and frequencies properly. 
    x = pd.DataFrame(x)
    x['% Frequency'] = (x/x.sum()).round(2)*100
    x.columns = ['Frequency', '% Frequency']
    x.index.name = 'Values'
    
    # Finally, return the object. If run in a IPython Notebook, it will
    # print a nice HTML table.
    return x

1.5 Frequency tables

1.5.1 Variable "oil_consumed"



In [9]:

    
print_freq_table(pd.cut(df.oil_consumed, NUM_BUCKETS))









    Out[9]:






  
    
      
      Frequency
      % Frequency
    
    
      Values
      
      
    
  
  
    
      (0.834, 76.171]
      38
      61
    
    
      (76.171, 150.762]
      16
      26
    
    
      (150.762, 225.352]
      3
      5
    
    
      (225.352, 299.943]
      3
      5
    
    
      (299.943, 374.534]
      1
      2
    
    
      (374.534, 449.125]
      0
      0
    
    
      (449.125, 523.716]
      0
      0
    
    
      (523.716, 598.307]
      0
      0
    
    
      (598.307, 672.897]
      0
      0
    
    
      (672.897, 747.488]
      1
      2

1.5.2 Variable "womem_army_rate"



In [10]:

    
NUM_BUCKETS = 10
print_freq_table(pd.cut(df.womem_army_rate, NUM_BUCKETS))









    Out[10]:






  
    
      
      Frequency
      % Frequency
    
    
      Values
      
      
    
  
  
    
      (-0.00596, 0.596]
      114
      70
    
    
      (0.596, 1.192]
      35
      21
    
    
      (1.192, 1.787]
      6
      4
    
    
      (1.787, 2.383]
      2
      1
    
    
      (2.383, 2.979]
      2
      1
    
    
      (2.979, 3.575]
      2
      1
    
    
      (3.575, 4.17]
      0
      0
    
    
      (4.17, 4.766]
      0
      0
    
    
      (4.766, 5.362]
      1
      1
    
    
      (5.362, 5.958]
      1
      1

1.5.3 Variable "urban_level"



In [11]:

    
print_freq_table(df['urban_level'])









    Out[11]:






  
    
      
      Frequency
      % Frequency
    
    
      Values
      
      
    
  
  
    
      very low
      13
      6
    
    
      low
      68
      33
    
    
      medium
      58
      29
    
    
      high
      45
      22
    
    
      very high
      19
      9

End of assignment.

	oil_consumed	womem_army_rate
count	62.000000	163.000000
mean	85.523760	0.598049
std	110.124175	0.760118
min	1.579984	0.000000
25%	25.343760	0.228521
50%	50.492618	0.408831
75%	98.855828	0.671941
max	747.488154	5.957572

	Count	Percentage
Variable
polityscore	52	12.09
oilperperson	150	34.88
relectricperperson	77	17.91
employrate	35	8.14
lifeexpectancy	22	5.12
armedforcesrate	49	11.40
urbanrate	10	2.33
femaleemployrate	35	8.14

	Frequency	% Frequency
Values
(0.834, 76.171]	38	61
(76.171, 150.762]	16	26
(150.762, 225.352]	3	5
(225.352, 299.943]	3	5
(299.943, 374.534]	1	2
(374.534, 449.125]	0	0
(449.125, 523.716]	0	0
(523.716, 598.307]	0	0
(598.307, 672.897]	0	0
(672.897, 747.488]	1	2

	Frequency	% Frequency
Values
(-0.00596, 0.596]	114	70
(0.596, 1.192]	35	21
(1.192, 1.787]	6	4
(1.787, 2.383]	2	1
(2.383, 2.979]	2	1
(2.979, 3.575]	2	1
(3.575, 4.17]	0	0
(4.17, 4.766]	0	0
(4.766, 5.362]	1	1
(5.362, 5.958]	1	1

Table of Contents