Following is the Python program I wrote to fulfill the third assignment of the Data Management and Visualization online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.
So for this third assignment, the three following variables will be analyzed:
For the question, I'm interested in the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN
no special data treatment is needed.
In [1]:
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
from IPython.display import Markdown, display
In [2]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')
General information on the Gapminder data
In [3]:
display(Markdown("Number of countries: {}".format(len(data))))
display(Markdown("Number of variables: {}".format(len(data.columns))))
In [4]:
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
data[variable] = pd.to_numeric(data[variable], errors='coerce')
But the unemployment rate is not provided directly. In the database, the employment rate (% of the popluation) is available. So the unemployement rate will be computed as 100 - employment rate
:
In [5]:
data['unemployrate'] = 100. - data['employrate']
The first records of the data restricted to the three analyzed variables are:
In [6]:
subdata = data[['internetuserate', 'suicideper100th', 'unemployrate']]
subdata.head(10)
Out[6]:
In [7]:
display(Markdown("Internet Use Rate (min, max) = ({0:.2f}, {1:.2f})".format(subdata['internetuserate'].min(), subdata['internetuserate'].max())))
In [8]:
internetuserate_bins = pd.cut(subdata['internetuserate'],
bins=np.linspace(0, 100., num=21))
counts1 = internetuserate_bins.value_counts(sort=False, dropna=False)
percentage1 = internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
'Counts' : counts1,
'Cumulative counts' : counts1.cumsum(),
'Percentages' : percentage1,
'Cumulative percentages' : percentage1.cumsum()
}
internetrate_summary = pd.DataFrame(data_struct)
internetrate_summary.index.name = 'Internet use rate (per 100 people)'
(internetrate_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
.style.set_precision(3)
.set_properties(**{'text-align':'right'}))
Out[8]:
In [9]:
display(Markdown("Suicide per 100,000 people (min, max) = ({:.2f}, {:.2f})".format(subdata['suicideper100th'].min(), subdata['suicideper100th'].max())))
In [10]:
suiciderate_bins = pd.cut(subdata['suicideper100th'],
bins=np.linspace(0, 40., num=21))
counts2 = suiciderate_bins.value_counts(sort=False, dropna=False)
percentage2 = suiciderate_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
'Counts' : counts2,
'Cumulative counts' : counts2.cumsum(),
'Percentages' : percentage2,
'Cumulative percentages' : percentage2.cumsum()
}
suiciderate_summary = pd.DataFrame(data_struct)
suiciderate_summary.index.name = 'Suicide (per 100 000 people)'
(suiciderate_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
.style.set_precision(3)
.set_properties(**{'text-align':'right'}))
Out[10]:
In [11]:
display(Markdown("Unemployment rate (min, max) = ({0:.2f}, {1:.2f})".format(subdata['unemployrate'].min(), subdata['unemployrate'].max())))
In [12]:
unemployment_bins = pd.cut(subdata['unemployrate'],
bins=np.linspace(0, 100., num=21))
counts3 = unemployment_bins.value_counts(sort=False, dropna=False)
percentage3 = unemployment_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
'Counts' : counts3,
'Cumulative counts' : counts3.cumsum(),
'Percentages' : percentage3,
'Cumulative percentages' : percentage3.cumsum()
}
unemployment_summary = pd.DataFrame(data_struct)
unemployment_summary.index.name = 'Unemployement rate (% population age 15+)'
(unemployment_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
.style.set_precision(3)
.set_properties(**{'text-align':'right'}))
Out[12]:
The Gapminder data based provides information for 213 countries.
As the unemployment rate is not provided directly in the database, it was computed as 100 - employment rate
.
The distributions of the variables are as follow:
From those data, I was surprised that so few people have access to the internet especially now that smartphones are cheap.
Another astonishing facts is the high unemployment rate, I was expected much less; especially in so called developped countries. But I presume that long school time and retirement can explain those high values as people of age 15+ are considered here.
If you are interested by the subject, follow me on Tumblr.