Writing your first program - Python

Following is the Python program I wrote to fulfill the second assignment of the Data Management and Visualization online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question

Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.

So for this second assignment, the frequencies of the three following variables will be analyzed:

  • Internet Usage Rate (per 100 people)
  • Suicide Rate (per 100 000 people)
  • Employment Rate (% of the population of age 15+)

Data analysis


In [1]:
# Load a useful Python libraries for handling data
import pandas as pd
from IPython.display import Markdown, display

In [2]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')

display(Markdown("General information on the Gapminder data"))
display(Markdown("Number of countries: {0}".format(len(data))))
display(Markdown("Number of variables: {0}".format(len(data.columns))))


General information on the Gapminder data

Number of countries: 213

Number of variables: 15


In [3]:
display(Markdown("The first records of the data."))
data.head()


The first records of the data.

Out[3]:
incomeperperson alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate internetuserate lifeexpectancy oilperperson polityscore relectricperperson suicideper100th employrate urbanrate
country
Afghanistan .03 .5696534 26.8 75944000 25.6000003814697 3.65412162280064 48.673 0 6.68438529968262 55.7000007629394 24.04
Albania 1914.99655094922 7.29 1.0247361 57.4 223747333.333333 42.0999984741211 44.9899469578783 76.918 9 636.341383366604 7.69932985305786 51.4000015258789 46.72
Algeria 2231.99333515006 .69 2.306817 23.5 2932108666.66667 31.7000007629394 .1 12.5000733055148 73.131 .42009452521537 2 590.509814347428 4.8487696647644 50.5 65.22
Andorra 21943.3398976022 10.17 81 5.36217880249023 88.92
Angola 1381.00426770244 5.57 1.4613288 23.1 248358000 69.4000015258789 2 9.99995388324075 51.093 -2 172.999227388199 14.5546770095825 75.6999969482422 56.7

In [4]:
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
    data[variable] = pd.to_numeric(data[variable], errors='coerce')

We will now have a look at the frequencies of the variables.

Internet use rate frequencies

First for the Internet usage, the counts are (including data missing):


In [5]:
data['internetuserate'].value_counts(sort=False, dropna=False)


Out[5]:
 0.720009      1
 1.400061      1
 2.100213      1
 3.654122      1
 4.999875      1
 5.999836      1
NaN           21
 7.232224      1
 29.879921     1
 9.999954      1
 1.259934      1
 11.090765     1
 12.500073     1
 13.598876     1
 14.830736     1
 15.899970     1
 90.703555     1
 0.829997      1
 18.795114     1
 2.300027      1
 20.001710     1
 31.050013     1
 16.780037     1
 24.999946     1
 25.899797     1
 26.740025     1
 2.259976      1
 3.129962      1
 29.999940     1
 76.587538     1
              ..
 27.851822     1
 1.699985      1
 42.692335     1
 81.338393     1
 12.006692     1
 2.450362      1
 81.590397     1
 80.000000     1
 36.422772     1
 39.820178     1
 13.000111     1
 48.516818     1
 28.289701     1
 9.007736      1
 53.024745     1
 43.055067     1
 61.987413     1
 7.930096      1
 75.200000     1
 44.585355     1
 14.000247     1
 53.740217     1
 0.210066      1
 44.570074     1
 40.020095     1
 2.471948      1
 6.965038      1
 31.568098     1
 20.663156     1
 28.999477     1
Name: internetuserate, dtype: int64

This is useless as the variable does not take discrete values. So before researching the frequency count on the data, I will group the data in intervals of 5% using the cut function.


In [6]:
import numpy as np
display(Markdown("Internet Use Rate (min, max) = ({0:.2f}, {1:.2f})".format(data['internetuserate'].min(), data['internetuserate'].max())))


Internet Use Rate (min, max) = (0.21, 95.64)


In [7]:
internetuserate_bins = pd.cut(data['internetuserate'], bins=np.linspace(0, 100., num=21))

Counts of Internet Use Rate:


In [8]:
internetuserate_bins.value_counts(sort=False, dropna=False)


Out[8]:
(0, 5]       26
(5, 10]      23
(10, 15]     19
(15, 20]      8
(20, 25]      6
(25, 30]     11
(30, 35]      8
(35, 40]     10
(40, 45]     17
(45, 50]      8
(50, 55]      7
(55, 60]      2
(60, 65]      7
(65, 70]      7
(70, 75]      8
(75, 80]      8
(80, 85]     10
(85, 90]      2
(90, 95]      4
(95, 100]     1
NaN          21
dtype: int64

Percentages of Internet Use Rate:


In [9]:
internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False)


Out[9]:
(0, 5]       0.122066
(5, 10]      0.107981
(10, 15]     0.089202
(15, 20]     0.037559
(20, 25]     0.028169
(25, 30]     0.051643
(30, 35]     0.037559
(35, 40]     0.046948
(40, 45]     0.079812
(45, 50]     0.037559
(50, 55]     0.032864
(55, 60]     0.009390
(60, 65]     0.032864
(65, 70]     0.032864
(70, 75]     0.037559
(75, 80]     0.037559
(80, 85]     0.046948
(85, 90]     0.009390
(90, 95]     0.018779
(95, 100]    0.004695
NaN          0.098592
dtype: float64

In [10]:
display(Markdown("Cumulative sum for Internet use rate percentages"))
internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False).cumsum()


Cumulative sum for Internet use rate percentages

Out[10]:
(0, 5]       0.122066
(5, 10]      0.230047
(10, 15]     0.319249
(15, 20]     0.356808
(20, 25]     0.384977
(25, 30]     0.436620
(30, 35]     0.474178
(35, 40]     0.521127
(40, 45]     0.600939
(45, 50]     0.638498
(50, 55]     0.671362
(55, 60]     0.680751
(60, 65]     0.713615
(65, 70]     0.746479
(70, 75]     0.784038
(75, 80]     0.821596
(80, 85]     0.868545
(85, 90]     0.877934
(90, 95]     0.896714
(95, 100]    0.901408
NaN          1.000000
dtype: float64

Suicide rate per 100,000 frequencies


In [11]:
display(Markdown("Suicide rate (min, max) = ({0:.2f}, {1:.2f})".format(data['suicideper100th'].min(), data['suicideper100th'].max())))


Suicide rate (min, max) = (0.20, 35.75)

Counts of Suicide Rate:


In [12]:
suiciderate_bins = pd.cut(data['suicideper100th'], bins=np.linspace(0, 40., num=21))
suiciderate_bins.value_counts(sort=False, dropna=False)


Out[12]:
(0, 2]      11
(2, 4]      16
(4, 6]      32
(6, 8]      29
(8, 10]     26
(10, 12]    24
(12, 14]    18
(14, 16]    13
(16, 18]     4
(18, 20]     4
(20, 22]     4
(22, 24]     2
(24, 26]     1
(26, 28]     3
(28, 30]     2
(30, 32]     0
(32, 34]     1
(34, 36]     1
(36, 38]     0
(38, 40]     0
NaN         22
dtype: int64

Percentages of Suicide Rate:


In [13]:
suiciderate_bins.value_counts(sort=False, normalize=True, dropna=False)


Out[13]:
(0, 2]      0.051643
(2, 4]      0.075117
(4, 6]      0.150235
(6, 8]      0.136150
(8, 10]     0.122066
(10, 12]    0.112676
(12, 14]    0.084507
(14, 16]    0.061033
(16, 18]    0.018779
(18, 20]    0.018779
(20, 22]    0.018779
(22, 24]    0.009390
(24, 26]    0.004695
(26, 28]    0.014085
(28, 30]    0.009390
(30, 32]    0.000000
(32, 34]    0.004695
(34, 36]    0.004695
(36, 38]    0.000000
(38, 40]    0.000000
NaN         0.103286
dtype: float64

Employment rate frequencies


In [14]:
display(Markdown("Employment rate (min, max) = ({0:.2f}, {1:.2f})".format(data['employrate'].min(), data['employrate'].max())))


Employment rate (min, max) = (32.00, 83.20)

Counts of Employment Rate:


In [15]:
employment_bins = pd.cut(data['employrate'], bins=np.linspace(0, 100., num=21))
employment_bins.value_counts(sort=False, dropna=False)


Out[15]:
(0, 5]        0
(5, 10]       0
(10, 15]      0
(15, 20]      0
(20, 25]      0
(25, 30]      0
(30, 35]      2
(35, 40]      3
(40, 45]     14
(45, 50]     18
(50, 55]     23
(55, 60]     44
(60, 65]     32
(65, 70]     15
(70, 75]     13
(75, 80]      8
(80, 85]      6
(85, 90]      0
(90, 95]      0
(95, 100]     0
NaN          35
dtype: int64

Percentages of Employment Rate:


In [16]:
employment_bins.value_counts(sort=False, normalize=True, dropna=False)


Out[16]:
(0, 5]       0.000000
(5, 10]      0.000000
(10, 15]     0.000000
(15, 20]     0.000000
(20, 25]     0.000000
(25, 30]     0.000000
(30, 35]     0.009390
(35, 40]     0.014085
(40, 45]     0.065728
(45, 50]     0.084507
(50, 55]     0.107981
(55, 60]     0.206573
(60, 65]     0.150235
(65, 70]     0.070423
(70, 75]     0.061033
(75, 80]     0.037559
(80, 85]     0.028169
(85, 90]     0.000000
(90, 95]     0.000000
(95, 100]    0.000000
NaN          0.164319
dtype: float64

Summary

The Gapminder data based provides information for 213 countries. Unfortunately for all three variables analyzed here (Internet Use Rate, Suicide rate and employment rate) data are missing; e.g. the internet use rate is unknown for 21 countries.

The distribution of the variables is as follow:

  • Internet Use Rate per 100 people
    • Data missing for 21 countries
    • Rate ranges from 0.21 to 95.64
    • The majority of the countries have a rate below 50
  • Suicide Rate per 100 000
    • Data missing for 22 countries
    • Rate ranges from 0.2 to 35.75
    • The rate is more often between 4 and 12
  • Employment Rate for age 15+
    • Data missing for 35 countries
    • Rate ranges from 32 to 83.2
    • For most of the countries the rate lies between 50 and 65

A possible refinement could to analyze only the country with an high employment rate (like > 60%) to suppress the economical influence. The data filtering would then be done as follow:


In [17]:
high_employment_set = data[data['employrate'] > 60.0]
high_employment_set.head()


Out[17]:
incomeperperson alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate internetuserate lifeexpectancy oilperperson polityscore relectricperperson suicideper100th employrate urbanrate
country
Angola 1381.00426770244 5.57 1.4613288 23.1 248358000 69.4000015258789 2 9.999954 51.093 -2 172.999227388199 14.554677 75.699997 56.7
Australia 25249.98606148 10.21 .4862799 83.2 12970092666.6667 54.5999984741211 .1 75.895654 81.907 1.91302610912404 10 2825.39109539914 8.470030 61.500000 88.74
Azerbaijan 2344.89691619809 13.34 1.9767462 31.5 511107666.666667 56.2000007629394 .1 46.679702 70.739 .35917260997566 -7 921.562110759901 1.380965 60.900002 51.92
Bahamas 19630.5405471267 8.65 .5452863 54.4 137555000 60.7000007629394 3.1 42.984580 75.62 3.374416 66.599998 83.7
Bahrain 12505.2125447354 4.19 5.2311429 40.2 503994333.333333 30.2000007629394 54.992809 75.057 -7 7314.3556367342 4.414990 60.400002 88.52

If you are interested by the subject, follow me on Tumblr.