The dataset chosen is Gapminder. You can check the codebook clicking here.
As we saw in Week 2 assignment, a lot of countries have missing values in the variables chosen. There is no workaround here like in the videos because they are not "not aplicable" values, just values that the Gapminder was not able to compute, either the data is not available in the source country, either the source is not trusted. But, in order to try to fulfil the assignment requirements, I won't drop countries with missing values this time, and decide case by case what to do.
To use this program you must have Python 3.3+ and IPython Notebook 1.0+ installed.
In [1]:
%pylab inline
# This package is very useful to data analysis in Python.
import pandas as pd
# Read the csv file to a dataframe object.
df = pd.read_csv('data/gapminder.csv')
# Convert all number values to float.
df = df.convert_objects(convert_numeric=True)
# Define the Country as the unique id of the dataframe.
df.index = df.country
del df['country']
# List of the variables selected.
vars_sel = ['polityscore', 'oilperperson', 'relectricperperson', 'employrate',
'lifeexpectancy', 'armedforcesrate', 'urbanrate', 'femaleemployrate']
# Dataframe with only the variables selected.
dfs = df[vars_sel]
In order to fulfil the assignment let's create a new variable combining oilperperson
and lifeexpectancy
variables. Let's calculate the expected tonnes of oil a person can consume in his entire life, supposing this consumptions begins with 20 years. This new variable, oil_consumed
, is then stated as:
In [2]:
# Calculate the new variable.
df['oil_consumed'] = df.oilperperson * (df.lifeexpectancy - 20)
# Print an overview of this new variable.
df.oil_consumed.describe()
Out[2]:
Another variable is the number of womem in the army. We take the assumption that this can be calculated as follows:
$$womem\_army\_rate = armedforcesrate * femaleemployrate/100$$
In [3]:
# Calculate the new variable.
df['womem_army_rate'] = df.armedforcesrate * df.femaleemployrate/100
# Print an overview of this new variable.
df.womem_army_rate.describe()
Out[3]:
As we can see above, nearly 75% of countries army have less than 1% womem in their ranks.
This variable will be a categorization of the continuous variable urbanrate
.
In [4]:
df['urban_level'] = pd.cut(df.urbanrate, [ 0, 20, 50, 70, 90,100],
labels=['very low', 'low','medium','high', 'very high'])
Here I show you an overview of the variables create. Here you can see the extreme values, mean, standard deviation, and and quatiles 25%, 50% (median) and 75%. Note that the count
row actualy count the non missing values. If there weren't missing values, count would be 213, the number of countries in this dataset. Finally, the variabe urban_level
is not shown here because it's categorical and pandas describe()
function only display numeric data. Well see its Frequency Table in the properly section.
In [5]:
df[['oil_consumed','womem_army_rate']].describe()
Out[5]:
Pandas DataFrame has a nice feature to count the missin values. The function isnull()
returns a boolean array where the value TRUE
is set if the variable is missing and FALSE
otherwise. As Python takes the TRUE
value boolean value as the integer 1 and FALSE
the integer 0, we just have to call sum()
to wrap up the results.
In [12]:
df.isnull().sum()
Out[12]:
The Gapminder Dataset has 213 countries listed. Let's calculate the percentage of missing countries in each variable.
In [7]:
# Count the missing variables.
missing_values = dfs.isnull().sum()
# Calculate the respective percentage.
percent = (missing_values/missing_values.sum()*100).round(2)
# Create a pandas.DataFrame object to pretty print the results.
mdf = pd.DataFrame({'Count':missing_values, 'Percentage': percent})
mdf.index.name = 'Variable'
mdf
Out[7]:
We can see above that the oilperperson
is the most missing varible, accounting to drop 150 countries. Maybe we shouldn't use use this variable anymore.
In [8]:
# Number of bins to bucket te continuous variables.
NUM_BUCKETS = 10
# Helper function to print the frequency values as an HTML table.
def print_freq_table(series):
# Count the frequency of values.
#This is a pandas.Series object.
x = series.value_counts()
# Sort the table by the values taken, rather than
# the default descending order of frequencies.
x = x.sort_index()
# Convert the pandas.Series.object to pandas.DataFrame in order to
# name the values and frequencies properly.
x = pd.DataFrame(x)
x['% Frequency'] = (x/x.sum()).round(2)*100
x.columns = ['Frequency', '% Frequency']
x.index.name = 'Values'
# Finally, return the object. If run in a IPython Notebook, it will
# print a nice HTML table.
return x
In [9]:
print_freq_table(pd.cut(df.oil_consumed, NUM_BUCKETS))
Out[9]:
In [10]:
NUM_BUCKETS = 10
print_freq_table(pd.cut(df.womem_army_rate, NUM_BUCKETS))
Out[10]:
In [11]:
print_freq_table(df['urban_level'])
Out[11]:
End of assignment.