1. Week 3 assignment

The dataset chosen is Gapminder. You can check the codebook clicking here.

As we saw in Week 2 assignment, a lot of countries have missing values in the variables chosen. There is no workaround here like in the videos because they are not "not aplicable" values, just values that the Gapminder was not able to compute, either the data is not available in the source country, either the source is not trusted. But, in order to try to fulfil the assignment requirements, I won't drop countries with missing values this time, and decide case by case what to do.

To use this program you must have Python 3.3+ and IPython Notebook 1.0+ installed.


In [1]:
%pylab inline

# This package is very useful to data analysis in Python.
import pandas as pd

# Read the csv file to a dataframe object.
df = pd.read_csv('data/gapminder.csv')

# Convert all number values to float.
df = df.convert_objects(convert_numeric=True)

# Define the Country as the unique id of the dataframe.
df.index = df.country
del df['country']

# List of the variables selected.
vars_sel = ['polityscore', 'oilperperson', 'relectricperperson', 'employrate', 
            'lifeexpectancy', 'armedforcesrate', 'urbanrate', 'femaleemployrate']


# Dataframe with only the variables selected.
dfs = df[vars_sel]


Populating the interactive namespace from numpy and matplotlib
/Users/sergio/anaconda3/lib/python3.4/site-packages/IPython/kernel/__main__.py:10: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

1.1 Calculate new variables

1.1.1 New variable oil_consumed

In order to fulfil the assignment let's create a new variable combining oilperperson and lifeexpectancy variables. Let's calculate the expected tonnes of oil a person can consume in his entire life, supposing this consumptions begins with 20 years. This new variable, oil_consumed, is then stated as:

$$oil\_consumed = oilperperson * (lifeexpectancy-20)$$

In [2]:
# Calculate the new variable.
df['oil_consumed'] = df.oilperperson * (df.lifeexpectancy - 20)

# Print an overview of this new variable.
df.oil_consumed.describe()


Out[2]:
count     62.000000
mean      85.523760
std      110.124175
min        1.579984
25%       25.343760
50%       50.492618
75%       98.855828
max      747.488154
Name: oil_consumed, dtype: float64

1.1.2 New variable womem_army_rate

Another variable is the number of womem in the army. We take the assumption that this can be calculated as follows:

$$womem\_army\_rate = armedforcesrate * femaleemployrate/100$$

In [3]:
# Calculate the new variable.
df['womem_army_rate'] = df.armedforcesrate * df.femaleemployrate/100

# Print an overview of this new variable.
df.womem_army_rate.describe()


Out[3]:
count    163.000000
mean       0.598049
std        0.760118
min        0.000000
25%        0.228521
50%        0.408831
75%        0.671941
max        5.957572
Name: womem_army_rate, dtype: float64

As we can see above, nearly 75% of countries army have less than 1% womem in their ranks.

1.1.3 New variable urban_level

This variable will be a categorization of the continuous variable urbanrate.


In [4]:
df['urban_level'] = pd.cut(df.urbanrate, [ 0, 20, 50, 70, 90,100], 
                           labels=['very low', 'low','medium','high', 'very high'])

1.2 Overview of the variables created

Here I show you an overview of the variables create. Here you can see the extreme values, mean, standard deviation, and and quatiles 25%, 50% (median) and 75%. Note that the count row actualy count the non missing values. If there weren't missing values, count would be 213, the number of countries in this dataset. Finally, the variabe urban_level is not shown here because it's categorical and pandas describe() function only display numeric data. Well see its Frequency Table in the properly section.


In [5]:
df[['oil_consumed','womem_army_rate']].describe()


Out[5]:
oil_consumed womem_army_rate
count 62.000000 163.000000
mean 85.523760 0.598049
std 110.124175 0.760118
min 1.579984 0.000000
25% 25.343760 0.228521
50% 50.492618 0.408831
75% 98.855828 0.671941
max 747.488154 5.957572

1.3 Counting the missing values in each variable

Pandas DataFrame has a nice feature to count the missin values. The function isnull() returns a boolean array where the value TRUE is set if the variable is missing and FALSE otherwise. As Python takes the TRUE value boolean value as the integer 1 and FALSE the integer 0, we just have to call sum() to wrap up the results.


In [12]:
df.isnull().sum()


Out[12]:
incomeperperson          23
alcconsumption           26
armedforcesrate          49
breastcancerper100th     40
co2emissions             13
femaleemployrate         35
hivrate                  66
internetuserate          21
lifeexpectancy           22
oilperperson            150
polityscore              52
relectricperperson       77
suicideper100th          22
employrate               35
urbanrate                10
oil_consumed            151
womem_army_rate          50
urban_level              10
dtype: int64

The Gapminder Dataset has 213 countries listed. Let's calculate the percentage of missing countries in each variable.


In [7]:
# Count the missing variables.
missing_values = dfs.isnull().sum()

# Calculate the respective percentage.
percent = (missing_values/missing_values.sum()*100).round(2)

# Create a pandas.DataFrame object to pretty print the results.
mdf = pd.DataFrame({'Count':missing_values, 'Percentage': percent})
mdf.index.name = 'Variable'
mdf


Out[7]:
Count Percentage
Variable
polityscore 52 12.09
oilperperson 150 34.88
relectricperperson 77 17.91
employrate 35 8.14
lifeexpectancy 22 5.12
armedforcesrate 49 11.40
urbanrate 10 2.33
femaleemployrate 35 8.14

We can see above that the oilperperson is the most missing varible, accounting to drop 150 countries. Maybe we shouldn't use use this variable anymore.

1.4 Frequency Tables


In [8]:
# Number of bins to bucket te continuous variables.
NUM_BUCKETS = 10

# Helper function to print the frequency values as an HTML table.
def print_freq_table(series):
    # Count the frequency of values. 
    #This is a pandas.Series object.
    x = series.value_counts()
    
    # Sort the table by the values taken, rather than 
    # the default descending order of frequencies.
    x = x.sort_index()
    
    # Convert the pandas.Series.object to pandas.DataFrame in order to
    # name the values and frequencies properly. 
    x = pd.DataFrame(x)
    x['% Frequency'] = (x/x.sum()).round(2)*100
    x.columns = ['Frequency', '% Frequency']
    x.index.name = 'Values'
    
    # Finally, return the object. If run in a IPython Notebook, it will
    # print a nice HTML table.
    return x

1.5 Frequency tables

1.5.1 Variable "oil_consumed"


In [9]:
print_freq_table(pd.cut(df.oil_consumed, NUM_BUCKETS))


Out[9]:
Frequency % Frequency
Values
(0.834, 76.171] 38 61
(76.171, 150.762] 16 26
(150.762, 225.352] 3 5
(225.352, 299.943] 3 5
(299.943, 374.534] 1 2
(374.534, 449.125] 0 0
(449.125, 523.716] 0 0
(523.716, 598.307] 0 0
(598.307, 672.897] 0 0
(672.897, 747.488] 1 2

1.5.2 Variable "womem_army_rate"


In [10]:
NUM_BUCKETS = 10
print_freq_table(pd.cut(df.womem_army_rate, NUM_BUCKETS))


Out[10]:
Frequency % Frequency
Values
(-0.00596, 0.596] 114 70
(0.596, 1.192] 35 21
(1.192, 1.787] 6 4
(1.787, 2.383] 2 1
(2.383, 2.979] 2 1
(2.979, 3.575] 2 1
(3.575, 4.17] 0 0
(4.17, 4.766] 0 0
(4.766, 5.362] 1 1
(5.362, 5.958] 1 1

1.5.3 Variable "urban_level"


In [11]:
print_freq_table(df['urban_level'])


Out[11]:
Frequency % Frequency
Values
very low 13 6
low 68 33
medium 58 29
high 45 22
very high 19 9

End of assignment.