Following is the Python program I wrote to fulfill the second assignment of the Data Management and Visualization online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.
So for this second assignment, the frequencies of the three following variables will be analyzed:
In [1]:
# Load a useful Python libraries for handling data
import pandas as pd
from IPython.display import Markdown, display
In [2]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')
display(Markdown("General information on the Gapminder data"))
display(Markdown("Number of countries: {0}".format(len(data))))
display(Markdown("Number of variables: {0}".format(len(data.columns))))
In [3]:
display(Markdown("The first records of the data."))
data.head()
Out[3]:
In [4]:
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
data[variable] = pd.to_numeric(data[variable], errors='coerce')
In [5]:
data['internetuserate'].value_counts(sort=False, dropna=False)
Out[5]:
This is useless as the variable does not take discrete values. So before researching the frequency count on the data, I will group the data in intervals of 5% using the cut
function.
In [6]:
import numpy as np
display(Markdown("Internet Use Rate (min, max) = ({0:.2f}, {1:.2f})".format(data['internetuserate'].min(), data['internetuserate'].max())))
In [7]:
internetuserate_bins = pd.cut(data['internetuserate'], bins=np.linspace(0, 100., num=21))
Counts of Internet Use Rate:
In [8]:
internetuserate_bins.value_counts(sort=False, dropna=False)
Out[8]:
Percentages of Internet Use Rate:
In [9]:
internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False)
Out[9]:
In [10]:
display(Markdown("Cumulative sum for Internet use rate percentages"))
internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False).cumsum()
Out[10]:
In [11]:
display(Markdown("Suicide rate (min, max) = ({0:.2f}, {1:.2f})".format(data['suicideper100th'].min(), data['suicideper100th'].max())))
Counts of Suicide Rate:
In [12]:
suiciderate_bins = pd.cut(data['suicideper100th'], bins=np.linspace(0, 40., num=21))
suiciderate_bins.value_counts(sort=False, dropna=False)
Out[12]:
Percentages of Suicide Rate:
In [13]:
suiciderate_bins.value_counts(sort=False, normalize=True, dropna=False)
Out[13]:
In [14]:
display(Markdown("Employment rate (min, max) = ({0:.2f}, {1:.2f})".format(data['employrate'].min(), data['employrate'].max())))
Counts of Employment Rate:
In [15]:
employment_bins = pd.cut(data['employrate'], bins=np.linspace(0, 100., num=21))
employment_bins.value_counts(sort=False, dropna=False)
Out[15]:
Percentages of Employment Rate:
In [16]:
employment_bins.value_counts(sort=False, normalize=True, dropna=False)
Out[16]:
The Gapminder data based provides information for 213 countries. Unfortunately for all three variables analyzed here (Internet Use Rate, Suicide rate and employment rate) data are missing; e.g. the internet use rate is unknown for 21 countries.
The distribution of the variables is as follow:
A possible refinement could to analyze only the country with an high employment rate (like > 60%) to suppress the economical influence. The data filtering would then be done as follow:
In [17]:
high_employment_set = data[data['employrate'] > 60.0]
high_employment_set.head()
Out[17]:
If you are interested by the subject, follow me on Tumblr.