1. Week 2 assignment

The dataset chosen was the Gapminder. You can check the codebook clicking here.

To use this program you must have Python 2.7+ and IPython Notebook 1.0+ installed.


In [1]:
# This package is very useful to data analysis in Python.
import pandas as pd

# Read the csv file to a dataframe object.
df = pd.read_csv('data/gapminder.csv')

# Convert all number values to float.
df = df.convert_objects(convert_numeric=True)

# Define the Country as the unique id of the dataframe.
df.index = df.country
del df['country']

# List of the variables selected.
vars_sel = ['polityscore', 'oilperperson', 'relectricperperson', 'employrate', 
            'lifeexpectancy', 'armedforcesrate', 'urbanrate', 'femaleemployrate']


# Dataframe with only the variables selected.
dfs = df[vars_sel]

# Number of countries before the removal of ones wiht missing variables.
n_countries_before = dfs.shape[0]

# Remove all countries that have at least one variable missing.
dfs = dfs.dropna()

# Number of countries after the removal of ones wiht missing variables.
n_countries_after = dfs.shape[0]


/Users/sergio/anaconda3/lib/python3.4/site-packages/IPython/kernel/__main__.py:8: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

1.1 Frequency tables

Here I show you the frequency tables for three variables I've choosen. Please, note that in this dataset almost all the variables take continuous values, what make the frequency tables show several lines wiht frequency of 1.

Below I constructed a helper function to pretty print the tables, using the HTML default output of pandas.DataFrames object when displayed in IPython Notebook.


In [2]:
# Helper function to print the frequency values as an HTML table.
def print_freq_table(series):
    # Count the frequency of values. 
    #This is a pandas.Series object.
    x = series.value_counts()
    
    # Sort the table by the values taken, rather than 
    # the default descending order of frequencies.
    x = x.sort_index()
    
    # Convert the pandas.Series.object to pandas.DataFrame in order to
    # name the values and frequencies properly. 
    x = pd.DataFrame(x)
    x['% Frequency'] = (x/x.sum()).round(2)*100
    x.columns = ['Frequency', '% Frequency']
    x.index.name = 'Values'
    
    # Finally, return the object. If run in a IPython Notebook, it will
    # print a nice HTML table.
    return x

1.1.1 Variable "polityscore"

This variable takes integer values between -10 and 10, and by the filter applied above which removed all countries with at least one variable missing, we see that give us a lot of countries, 60%, equal or above grade 8.


In [3]:
print_freq_table(dfs['polityscore'])


Out[3]:
Frequency % Frequency
Values
-10 2 3
-9 2 3
-8 1 2
-7 5 8
-6 1 2
-3 2 3
-2 1 2
2 1 2
4 2 3
5 3 5
6 1 2
7 3 5
8 7 11
9 6 10
10 24 39

1.1.2 Variable "lifeexpectancy"

The values take a continuous interval, thus letting the Frequency Table not a good tool to visualize the distribution of the values. Let's organize the values into 10 buckets, as to convert from continuous to categorical values. The result is 57% of countries with life expectancy above 74 years. This generates a suspect that filtering countries with any value missign set a bias in healthy countries being selected.


In [4]:
NUM_BUCKETS = 10
print_freq_table(pd.cut(dfs.lifeexpectancy, NUM_BUCKETS))


Out[4]:
Frequency % Frequency
Values
(52.766, 55.857] 1 2
(55.857, 58.916] 0 0
(58.916, 61.976] 0 0
(61.976, 65.0358] 1 2
(65.0358, 68.0955] 3 5
(68.0955, 71.155] 8 13
(71.155, 74.215] 13 21
(74.215, 77.275] 10 16
(77.275, 80.334] 9 15
(80.334, 83.394] 16 26

1.1.3 Variable "oilperperson"

Another variable that takes a continuous interval. This time 87% fo contries takes less than 2.47 tones per year per capita, and only one country takes the max bucket of oil consumption.


In [5]:
print_freq_table(pd.cut(dfs.oilperperson, NUM_BUCKETS))


Out[5]:
Frequency % Frequency
Values
(0.0201, 1.252] 38 62
(1.252, 2.472] 15 25
(2.472, 3.691] 3 5
(3.691, 4.911] 3 5
(4.911, 6.13] 0 0
(6.13, 7.35] 1 2
(7.35, 8.57] 0 0
(8.57, 9.789] 0 0
(9.789, 11.00901] 0 0
(11.00901, 12.229] 1 2

1.1.4 Variable "armedforcesrate"

In this variable we see that the majority invest equal or less than 2.12% of their labor force in defense army.


In [6]:
print_freq_table(pd.cut(dfs.armedforcesrate, NUM_BUCKETS))


Out[6]:
Frequency % Frequency
Values
(0.282, 0.899] 26 43
(0.899, 1.509] 16 26
(1.509, 2.12] 10 16
(2.12, 2.731] 4 7
(2.731, 3.341] 3 5
(3.341, 3.952] 1 2
(3.952, 4.563] 0 0
(4.563, 5.174] 0 0
(5.174, 5.784] 0 0
(5.784, 6.395] 1 2

1.2 Overview of the variables selected

Here I show you an overview of the variables selected. Here you can see the extreme values, mean, standard deviation, and and quatiles 25%, 50% (median) and 75%.


In [7]:
dfs.describe()


Out[7]:
polityscore oilperperson relectricperperson employrate lifeexpectancy armedforcesrate urbanrate femaleemployrate
count 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000
mean 5.032787 1.462429 1739.275506 57.868853 75.283246 1.289931 69.059672 46.413114
std 6.845844 1.851035 2086.279787 7.704253 5.708741 1.046546 16.971473 11.104631
min -10.000000 0.032281 68.115229 41.099998 52.797000 0.287892 27.140000 17.700001
25% 4.000000 0.504659 498.165305 52.500000 73.131000 0.560987 61.000000 41.700001
50% 8.000000 0.890813 921.562111 58.500000 75.446000 0.972378 70.360000 47.599998
75% 10.000000 1.593223 2051.802338 62.400002 80.414000 1.649451 81.460000 53.900002
max 10.000000 12.228645 11154.755033 76.000000 83.394000 6.394936 100.000000 68.900002

End of the assignment.