1. Week 2 assignment
- 1.1 Frequency tables
- 1.2 Overview of the variables selected

1. Week 2 assignment

The dataset chosen was the Gapminder. You can check the codebook clicking here.

To use this program you must have Python 2.7+ and IPython Notebook 1.0+ installed.



In [1]:

    
# This package is very useful to data analysis in Python.
import pandas as pd

# Read the csv file to a dataframe object.
df = pd.read_csv('data/gapminder.csv')

# Convert all number values to float.
df = df.convert_objects(convert_numeric=True)

# Define the Country as the unique id of the dataframe.
df.index = df.country
del df['country']

# List of the variables selected.
vars_sel = ['polityscore', 'oilperperson', 'relectricperperson', 'employrate', 
            'lifeexpectancy', 'armedforcesrate', 'urbanrate', 'femaleemployrate']


# Dataframe with only the variables selected.
dfs = df[vars_sel]

# Number of countries before the removal of ones wiht missing variables.
n_countries_before = dfs.shape[0]

# Remove all countries that have at least one variable missing.
dfs = dfs.dropna()

# Number of countries after the removal of ones wiht missing variables.
n_countries_after = dfs.shape[0]









    



/Users/sergio/anaconda3/lib/python3.4/site-packages/IPython/kernel/__main__.py:8: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

1.1 Frequency tables

Here I show you the frequency tables for three variables I've choosen. Please, note that in this dataset almost all the variables take continuous values, what make the frequency tables show several lines wiht frequency of 1.

Below I constructed a helper function to pretty print the tables, using the HTML default output of pandas.DataFrames object when displayed in IPython Notebook.



In [2]:

    
# Helper function to print the frequency values as an HTML table.
def print_freq_table(series):
    # Count the frequency of values. 
    #This is a pandas.Series object.
    x = series.value_counts()
    
    # Sort the table by the values taken, rather than 
    # the default descending order of frequencies.
    x = x.sort_index()
    
    # Convert the pandas.Series.object to pandas.DataFrame in order to
    # name the values and frequencies properly. 
    x = pd.DataFrame(x)
    x['% Frequency'] = (x/x.sum()).round(2)*100
    x.columns = ['Frequency', '% Frequency']
    x.index.name = 'Values'
    
    # Finally, return the object. If run in a IPython Notebook, it will
    # print a nice HTML table.
    return x

1.1.1 Variable "polityscore"

This variable takes integer values between -10 and 10, and by the filter applied above which removed all countries with at least one variable missing, we see that give us a lot of countries, 60%, equal or above grade 8.



In [3]:

    
print_freq_table(dfs['polityscore'])









    Out[3]:






  
    
      
      Frequency
      % Frequency
    
    
      Values
      
      
    
  
  
    
      -10
      2
      3
    
    
      -9
      2
      3
    
    
      -8
      1
      2
    
    
      -7
      5
      8
    
    
      -6
      1
      2
    
    
      -3
      2
      3
    
    
      -2
      1
      2
    
    
      2
      1
      2
    
    
      4
      2
      3
    
    
      5
      3
      5
    
    
      6
      1
      2
    
    
      7
      3
      5
    
    
      8
      7
      11
    
    
      9
      6
      10
    
    
      10
      24
      39

1.1.2 Variable "lifeexpectancy"

The values take a continuous interval, thus letting the Frequency Table not a good tool to visualize the distribution of the values. Let's organize the values into 10 buckets, as to convert from continuous to categorical values. The result is 57% of countries with life expectancy above 74 years. This generates a suspect that filtering countries with any value missign set a bias in healthy countries being selected.



In [4]:

    
NUM_BUCKETS = 10
print_freq_table(pd.cut(dfs.lifeexpectancy, NUM_BUCKETS))









    Out[4]:






  
    
      
      Frequency
      % Frequency
    
    
      Values
      
      
    
  
  
    
      (52.766, 55.857]
      1
      2
    
    
      (55.857, 58.916]
      0
      0
    
    
      (58.916, 61.976]
      0
      0
    
    
      (61.976, 65.0358]
      1
      2
    
    
      (65.0358, 68.0955]
      3
      5
    
    
      (68.0955, 71.155]
      8
      13
    
    
      (71.155, 74.215]
      13
      21
    
    
      (74.215, 77.275]
      10
      16
    
    
      (77.275, 80.334]
      9
      15
    
    
      (80.334, 83.394]
      16
      26

1.1.3 Variable "oilperperson"

Another variable that takes a continuous interval. This time 87% fo contries takes less than 2.47 tones per year per capita, and only one country takes the max bucket of oil consumption.



In [5]:

    
print_freq_table(pd.cut(dfs.oilperperson, NUM_BUCKETS))









    Out[5]:






  
    
      
      Frequency
      % Frequency
    
    
      Values
      
      
    
  
  
    
      (0.0201, 1.252]
      38
      62
    
    
      (1.252, 2.472]
      15
      25
    
    
      (2.472, 3.691]
      3
      5
    
    
      (3.691, 4.911]
      3
      5
    
    
      (4.911, 6.13]
      0
      0
    
    
      (6.13, 7.35]
      1
      2
    
    
      (7.35, 8.57]
      0
      0
    
    
      (8.57, 9.789]
      0
      0
    
    
      (9.789, 11.00901]
      0
      0
    
    
      (11.00901, 12.229]
      1
      2

1.1.4 Variable "armedforcesrate"

In this variable we see that the majority invest equal or less than 2.12% of their labor force in defense army.



In [6]:

    
print_freq_table(pd.cut(dfs.armedforcesrate, NUM_BUCKETS))









    Out[6]:






  
    
      
      Frequency
      % Frequency
    
    
      Values
      
      
    
  
  
    
      (0.282, 0.899]
      26
      43
    
    
      (0.899, 1.509]
      16
      26
    
    
      (1.509, 2.12]
      10
      16
    
    
      (2.12, 2.731]
      4
      7
    
    
      (2.731, 3.341]
      3
      5
    
    
      (3.341, 3.952]
      1
      2
    
    
      (3.952, 4.563]
      0
      0
    
    
      (4.563, 5.174]
      0
      0
    
    
      (5.174, 5.784]
      0
      0
    
    
      (5.784, 6.395]
      1
      2

1.2 Overview of the variables selected

Here I show you an overview of the variables selected. Here you can see the extreme values, mean, standard deviation, and and quatiles 25%, 50% (median) and 75%.



In [7]:

    
dfs.describe()









    Out[7]:






  
    
      
      polityscore
      oilperperson
      relectricperperson
      employrate
      lifeexpectancy
      armedforcesrate
      urbanrate
      femaleemployrate
    
  
  
    
      count
      61.000000
      61.000000
      61.000000
      61.000000
      61.000000
      61.000000
      61.000000
      61.000000
    
    
      mean
      5.032787
      1.462429
      1739.275506
      57.868853
      75.283246
      1.289931
      69.059672
      46.413114
    
    
      std
      6.845844
      1.851035
      2086.279787
      7.704253
      5.708741
      1.046546
      16.971473
      11.104631
    
    
      min
      -10.000000
      0.032281
      68.115229
      41.099998
      52.797000
      0.287892
      27.140000
      17.700001
    
    
      25%
      4.000000
      0.504659
      498.165305
      52.500000
      73.131000
      0.560987
      61.000000
      41.700001
    
    
      50%
      8.000000
      0.890813
      921.562111
      58.500000
      75.446000
      0.972378
      70.360000
      47.599998
    
    
      75%
      10.000000
      1.593223
      2051.802338
      62.400002
      80.414000
      1.649451
      81.460000
      53.900002
    
    
      max
      10.000000
      12.228645
      11154.755033
      76.000000
      83.394000
      6.394936
      100.000000
      68.900002

End of the assignment.

	Frequency	% Frequency
Values
-10	2	3
-9	2	3
-8	1	2
-7	5	8
-6	1	2
-3	2	3
-2	1	2
2	1	2
4	2	3
5	3	5
6	1	2
7	3	5
8	7	11
9	6	10
10	24	39

	Frequency	% Frequency
Values
(52.766, 55.857]	1	2
(55.857, 58.916]	0	0
(58.916, 61.976]	0	0
(61.976, 65.0358]	1	2
(65.0358, 68.0955]	3	5
(68.0955, 71.155]	8	13
(71.155, 74.215]	13	21
(74.215, 77.275]	10	16
(77.275, 80.334]	9	15
(80.334, 83.394]	16	26

	Frequency	% Frequency
Values
(0.0201, 1.252]	38	62
(1.252, 2.472]	15	25
(2.472, 3.691]	3	5
(3.691, 4.911]	3	5
(4.911, 6.13]	0	0
(6.13, 7.35]	1	2
(7.35, 8.57]	0	0
(8.57, 9.789]	0	0
(9.789, 11.00901]	0	0
(11.00901, 12.229]	1	2

	Frequency	% Frequency
Values
(0.282, 0.899]	26	43
(0.899, 1.509]	16	26
(1.509, 2.12]	10	16
(2.12, 2.731]	4	7
(2.731, 3.341]	3	5
(3.341, 3.952]	1	2
(3.952, 4.563]	0	0
(4.563, 5.174]	0	0
(5.174, 5.784]	0	0
(5.784, 6.395]	1	2

	polityscore	oilperperson	relectricperperson	employrate	lifeexpectancy	armedforcesrate	urbanrate	femaleemployrate
count	61.000000	61.000000	61.000000	61.000000	61.000000	61.000000	61.000000	61.000000
mean	5.032787	1.462429	1739.275506	57.868853	75.283246	1.289931	69.059672	46.413114
std	6.845844	1.851035	2086.279787	7.704253	5.708741	1.046546	16.971473	11.104631
min	-10.000000	0.032281	68.115229	41.099998	52.797000	0.287892	27.140000	17.700001
25%	4.000000	0.504659	498.165305	52.500000	73.131000	0.560987	61.000000	41.700001
50%	8.000000	0.890813	921.562111	58.500000	75.446000	0.972378	70.360000	47.599998
75%	10.000000	1.593223	2051.802338	62.400002	80.414000	1.649451	81.460000	53.900002
max	10.000000	12.228645	11154.755033	76.000000	83.394000	6.394936	100.000000	68.900002

Table of Contents