Assignment: Making Data Management Decisions - Python

Following is the Python program I wrote to fulfill the third assignment of the Data Management and Visualization online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question

Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.

So for this third assignment, the three following variables will be analyzed:

  • Internet Usage Rate (per 100 people)
  • Suicide Rate (per 100 000 people)
  • Unemployment Rate (% of the population of age 15+)

Data management

For the question, I'm interested in the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN no special data treatment is needed.


In [1]:
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
from IPython.display import Markdown, display

In [2]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')

General information on the Gapminder data


In [3]:
display(Markdown("Number of countries: {}".format(len(data))))
display(Markdown("Number of variables: {}".format(len(data.columns))))


Number of countries: 213

Number of variables: 15


In [4]:
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
    data[variable] = pd.to_numeric(data[variable], errors='coerce')

But the unemployment rate is not provided directly. In the database, the employment rate (% of the popluation) is available. So the unemployement rate will be computed as 100 - employment rate:


In [5]:
data['unemployrate'] = 100. - data['employrate']

The first records of the data restricted to the three analyzed variables are:


In [6]:
subdata = data[['internetuserate', 'suicideper100th', 'unemployrate']]
subdata.head(10)


Out[6]:
internetuserate suicideper100th unemployrate
country
Afghanistan 3.654122 6.684385 44.299999
Albania 44.989947 7.699330 48.599998
Algeria 12.500073 4.848770 49.500000
Andorra 81.000000 5.362179 NaN
Angola 9.999954 14.554677 24.300003
Antigua and Barbuda 80.645455 2.161843 NaN
Argentina 36.000335 7.765584 41.599998
Armenia 44.001025 3.741588 59.900002
Aruba 41.800889 NaN NaN
Australia 75.895654 8.470030 38.500000

Data analysis

We will now have a look at the frequencies of the variables after grouping them as all three are continuous variables. I will group the data in intervals using the cut function.

Internet use rate frequencies


In [7]:
display(Markdown("Internet Use Rate (min, max) = ({0:.2f}, {1:.2f})".format(subdata['internetuserate'].min(), subdata['internetuserate'].max())))


Internet Use Rate (min, max) = (0.21, 95.64)


In [8]:
internetuserate_bins = pd.cut(subdata['internetuserate'], 
                              bins=np.linspace(0, 100., num=21))

counts1 = internetuserate_bins.value_counts(sort=False, dropna=False)
percentage1 = internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
    'Counts' : counts1,
    'Cumulative counts' : counts1.cumsum(),
    'Percentages' : percentage1,
    'Cumulative percentages' : percentage1.cumsum()
}

internetrate_summary = pd.DataFrame(data_struct)
internetrate_summary.index.name = 'Internet use rate (per 100 people)'
(internetrate_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
                     .style.set_precision(3)
                           .set_properties(**{'text-align':'right'}))


Out[8]:
Counts Cumulative counts Percentages Cumulative percentages
Internet use rate (per 100 people)
(0, 5] 26 26 0.122 0.122
(5, 10] 23 49 0.108 0.23
(10, 15] 19 68 0.0892 0.319
(15, 20] 8 76 0.0376 0.357
(20, 25] 6 82 0.0282 0.385
(25, 30] 11 93 0.0516 0.437
(30, 35] 8 101 0.0376 0.474
(35, 40] 10 111 0.0469 0.521
(40, 45] 17 128 0.0798 0.601
(45, 50] 8 136 0.0376 0.638
(50, 55] 7 143 0.0329 0.671
(55, 60] 2 145 0.00939 0.681
(60, 65] 7 152 0.0329 0.714
(65, 70] 7 159 0.0329 0.746
(70, 75] 8 167 0.0376 0.784
(75, 80] 8 175 0.0376 0.822
(80, 85] 10 185 0.0469 0.869
(85, 90] 2 187 0.00939 0.878
(90, 95] 4 191 0.0188 0.897
(95, 100] 1 192 0.00469 0.901
nan 21 213 0.0986 1

Suicide per 100,000 people frequencies


In [9]:
display(Markdown("Suicide per 100,000 people (min, max) = ({:.2f}, {:.2f})".format(subdata['suicideper100th'].min(), subdata['suicideper100th'].max())))


Suicide per 100,000 people (min, max) = (0.20, 35.75)


In [10]:
suiciderate_bins = pd.cut(subdata['suicideper100th'], 
                          bins=np.linspace(0, 40., num=21))

counts2 = suiciderate_bins.value_counts(sort=False, dropna=False)
percentage2 = suiciderate_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
    'Counts' : counts2,
    'Cumulative counts' : counts2.cumsum(),
    'Percentages' : percentage2,
    'Cumulative percentages' : percentage2.cumsum()
}

suiciderate_summary = pd.DataFrame(data_struct)
suiciderate_summary.index.name = 'Suicide (per 100 000 people)'
(suiciderate_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
                     .style.set_precision(3)
                           .set_properties(**{'text-align':'right'}))


Out[10]:
Counts Cumulative counts Percentages Cumulative percentages
Suicide (per 100 000 people)
(0, 2] 11 11 0.0516 0.0516
(2, 4] 16 27 0.0751 0.127
(4, 6] 32 59 0.15 0.277
(6, 8] 29 88 0.136 0.413
(8, 10] 26 114 0.122 0.535
(10, 12] 24 138 0.113 0.648
(12, 14] 18 156 0.0845 0.732
(14, 16] 13 169 0.061 0.793
(16, 18] 4 173 0.0188 0.812
(18, 20] 4 177 0.0188 0.831
(20, 22] 4 181 0.0188 0.85
(22, 24] 2 183 0.00939 0.859
(24, 26] 1 184 0.00469 0.864
(26, 28] 3 187 0.0141 0.878
(28, 30] 2 189 0.00939 0.887
(30, 32] 0 189 0 0.887
(32, 34] 1 190 0.00469 0.892
(34, 36] 1 191 0.00469 0.897
(36, 38] 0 191 0 0.897
(38, 40] 0 191 0 0.897
nan 22 213 0.103 1

Unemployment rate frequencies


In [11]:
display(Markdown("Unemployment rate (min, max) = ({0:.2f}, {1:.2f})".format(subdata['unemployrate'].min(), subdata['unemployrate'].max())))


Unemployment rate (min, max) = (16.80, 68.00)


In [12]:
unemployment_bins = pd.cut(subdata['unemployrate'], 
                         bins=np.linspace(0, 100., num=21))


counts3 = unemployment_bins.value_counts(sort=False, dropna=False)
percentage3 = unemployment_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
    'Counts' : counts3,
    'Cumulative counts' : counts3.cumsum(),
    'Percentages' : percentage3,
    'Cumulative percentages' : percentage3.cumsum()
}

unemployment_summary = pd.DataFrame(data_struct)
unemployment_summary.index.name = 'Unemployement rate (% population age 15+)'
(unemployment_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
                     .style.set_precision(3)
                           .set_properties(**{'text-align':'right'}))


Out[12]:
Counts Cumulative counts Percentages Cumulative percentages
Unemployement rate (% population age 15+)
(0, 5] 0 0 0 0
(5, 10] 0 0 0 0
(10, 15] 0 0 0 0
(15, 20] 6 6 0.0282 0.0282
(20, 25] 8 14 0.0376 0.0657
(25, 30] 13 27 0.061 0.127
(30, 35] 18 45 0.0845 0.211
(35, 40] 29 74 0.136 0.347
(40, 45] 44 118 0.207 0.554
(45, 50] 23 141 0.108 0.662
(50, 55] 18 159 0.0845 0.746
(55, 60] 14 173 0.0657 0.812
(60, 65] 3 176 0.0141 0.826
(65, 70] 2 178 0.00939 0.836
(70, 75] 0 178 0 0.836
(75, 80] 0 178 0 0.836
(80, 85] 0 178 0 0.836
(85, 90] 0 178 0 0.836
(90, 95] 0 178 0 0.836
(95, 100] 0 178 0 0.836
nan 35 213 0.164 1

Summary

The Gapminder data based provides information for 213 countries.

As the unemployment rate is not provided directly in the database, it was computed as 100 - employment rate.

The distributions of the variables are as follow:

  • Internet Use Rate per 100 people
    • Data missing for 21 countries
    • Rate ranges from 0.21 to 95.64
    • The majority of the countries (64%) have a rate below 50
  • Suicide Rate per 100 000
    • Data missing for 22 countries
    • Rate ranges from 0.2 to 35.75
    • The rate is more often between 4 and 12
  • Unemployment Rate for age 15+
    • Data missing for 35 countries
    • Rate ranges from 16.8 to 68
    • For the majority of the countries the rate lies below 45

From those data, I was surprised that so few people have access to the internet especially now that smartphones are cheap.

Another astonishing facts is the high unemployment rate, I was expected much less; especially in so called developped countries. But I presume that long school time and retirement can explain those high values as people of age 15+ are considered here.

If you are interested by the subject, follow me on Tumblr.