The dataset chosen was the Gapminder. You can check the codebook clicking here.
To use this program you must have Python 2.7+ and IPython Notebook 1.0+ installed.
In [1]:
# This package is very useful to data analysis in Python.
import pandas as pd
# Read the csv file to a dataframe object.
df = pd.read_csv('data/gapminder.csv')
# Convert all number values to float.
df = df.convert_objects(convert_numeric=True)
# Define the Country as the unique id of the dataframe.
df.index = df.country
del df['country']
# List of the variables selected.
vars_sel = ['polityscore', 'oilperperson', 'relectricperperson', 'employrate',
'lifeexpectancy', 'armedforcesrate', 'urbanrate', 'femaleemployrate']
# Dataframe with only the variables selected.
dfs = df[vars_sel]
# Number of countries before the removal of ones wiht missing variables.
n_countries_before = dfs.shape[0]
# Remove all countries that have at least one variable missing.
dfs = dfs.dropna()
# Number of countries after the removal of ones wiht missing variables.
n_countries_after = dfs.shape[0]
Here I show you the frequency tables for three variables I've choosen. Please, note that in this dataset almost all the variables take continuous values, what make the frequency tables show several lines wiht frequency of 1.
Below I constructed a helper function to pretty print the tables, using the HTML default output of pandas.DataFrames object when displayed in IPython Notebook.
In [2]:
# Helper function to print the frequency values as an HTML table.
def print_freq_table(series):
# Count the frequency of values.
#This is a pandas.Series object.
x = series.value_counts()
# Sort the table by the values taken, rather than
# the default descending order of frequencies.
x = x.sort_index()
# Convert the pandas.Series.object to pandas.DataFrame in order to
# name the values and frequencies properly.
x = pd.DataFrame(x)
x['% Frequency'] = (x/x.sum()).round(2)*100
x.columns = ['Frequency', '% Frequency']
x.index.name = 'Values'
# Finally, return the object. If run in a IPython Notebook, it will
# print a nice HTML table.
return x
This variable takes integer values between -10 and 10, and by the filter applied above which removed all countries with at least one variable missing, we see that give us a lot of countries, 60%, equal or above grade 8.
In [3]:
print_freq_table(dfs['polityscore'])
Out[3]:
The values take a continuous interval, thus letting the Frequency Table not a good tool to visualize the distribution of the values. Let's organize the values into 10 buckets, as to convert from continuous to categorical values. The result is 57% of countries with life expectancy above 74 years. This generates a suspect that filtering countries with any value missign set a bias in healthy countries being selected.
In [4]:
NUM_BUCKETS = 10
print_freq_table(pd.cut(dfs.lifeexpectancy, NUM_BUCKETS))
Out[4]:
Another variable that takes a continuous interval. This time 87% fo contries takes less than 2.47 tones per year per capita, and only one country takes the max bucket of oil consumption.
In [5]:
print_freq_table(pd.cut(dfs.oilperperson, NUM_BUCKETS))
Out[5]:
In this variable we see that the majority invest equal or less than 2.12% of their labor force in defense army.
In [6]:
print_freq_table(pd.cut(dfs.armedforcesrate, NUM_BUCKETS))
Out[6]:
Here I show you an overview of the variables selected. Here you can see the extreme values, mean, standard deviation, and and quatiles 25%, 50% (median) and 75%.
In [7]:
dfs.describe()
Out[7]:
End of the assignment.