Assignment: Creating Graphs for Your Data - Python

Following is the Python program I wrote to fulfill the four assignment of the Data Management and Visualization online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question

Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.

So for this assignment, the three following variables will be analyzed:

  • Internet Usage Rate (per 100 people)
  • Suicide Rate (per 100 000 people)
  • Unemployment Rate (% of the population of age 15+)

Data management

For the question, I'm interested in the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN no special data treatment is needed.


In [1]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display

In [2]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')

General information on the Gapminder data


In [3]:
display(Markdown("Number of countries: {}".format(len(data))))
display(Markdown("Number of variables: {}".format(len(data.columns))))


Number of countries: 213

Number of variables: 15


In [4]:
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
    data[variable] = pd.to_numeric(data[variable], errors='coerce')

But the unemployment rate is not provided directly. In the database, the employment rate (% of the popluation) is available. So the unemployement rate will be computed as 100 - employment rate:


In [5]:
data['unemployrate'] = 100. - data['employrate']

The first records of the data restricted to the three analyzed variables are:


In [6]:
subdata = data[['internetuserate', 'suicideper100th', 'unemployrate']]
subdata.head(10)


Out[6]:
internetuserate suicideper100th unemployrate
country
Afghanistan 3.654122 6.684385 44.299999
Albania 44.989947 7.699330 48.599998
Algeria 12.500073 4.848770 49.500000
Andorra 81.000000 5.362179 NaN
Angola 9.999954 14.554677 24.300003
Antigua and Barbuda 80.645455 2.161843 NaN
Argentina 36.000335 7.765584 41.599998
Armenia 44.001025 3.741588 59.900002
Aruba 41.800889 NaN NaN
Australia 75.895654 8.470030 38.500000

Data analysis

We will now have a look at the frequencies of the variables after grouping them as all three are continuous variables. I will group the data in intervals using the cut function.

Internet use rate distribution

As the internet use rate is a quantitative variable, the distplot of seaborn package will be used.


In [7]:
sns.distplot(subdata['internetuserate'].dropna(), kde=False)
plt.xlabel('Internet use rate (%)')
_ = plt.title('Internet use rate distribution in the Gapminder data set')



In [8]:
subdata['internetuserate'].describe()


Out[8]:
count    192.000000
mean      35.632716
std       27.780285
min        0.210066
25%        9.999604
50%       31.810121
75%       56.416046
max       95.638113
Name: internetuserate, dtype: float64

From the bar chart and the descriptive information, there is a obvious concentration of countries having less than 20% internet use rate. A quarter of the countries have even 10% or less internet use rate.

The mean is 35.6% with a important standard deviation of 27.8%.

The first mode is about 8%. And there is also a second smaller mode around 40%.

This distribution is bimodal skewed-right.

The distribution range is high; roughly 95%.

Example of analysis for a categorical variable

As shown during the last assignement, a categorical variable from the internet use rate can be constructed using the function cut.

So to train myself with categorical variable, here follows the same chart and description but for the categorical version of the quantitative variable.


In [9]:
internetuserate_bins = pd.cut(subdata['internetuserate'], 
                              bins=np.linspace(0, 100., num=11))

counts1 = internetuserate_bins.value_counts(sort=False, dropna=False)
percentage1 = internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
    'Counts' : counts1,
    'Cumulative counts' : counts1.cumsum(),
    'Percentages' : percentage1,
    'Cumulative percentages' : percentage1.cumsum()
}

internetrate_summary = pd.DataFrame(data_struct)
internetrate_summary.index.name = 'Internet use rate (per 100 people)'
(internetrate_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
                     .style.set_precision(3)
                           .set_properties(**{'text-align':'right'}))


Out[9]:
Counts Cumulative counts Percentages Cumulative percentages
Internet use rate (per 100 people)
(0, 10] 49 49 0.23 0.23
(10, 20] 27 76 0.127 0.357
(20, 30] 17 93 0.0798 0.437
(30, 40] 18 111 0.0845 0.521
(40, 50] 25 136 0.117 0.638
(50, 60] 9 145 0.0423 0.681
(60, 70] 14 159 0.0657 0.746
(70, 80] 16 175 0.0751 0.822
(80, 90] 12 187 0.0563 0.878
(90, 100] 5 192 0.0235 0.901
nan 21 213 0.0986 1

In [10]:
sns.countplot(internetuserate_bins)
plt.xlabel('Internet use rate (%)')
_ = plt.title('Internet use rate distribution in the Gapminder data set')



In [11]:
internetuserate_bins.describe()


Out[11]:
count         192
unique         10
top       (0, 10]
freq           49
Name: internetuserate, dtype: object

The conclusions drawn form the quantitative variable are of course confirmed here.

Suicide per 100,000 people distribution

As the suicide per 100,000 people is a quantitative variable, the distplot of seaborn package will be used.


In [12]:
sns.distplot(subdata['suicideper100th'].dropna(), kde=False)
plt.xlabel('Suicide per 100 000 people (-)')
_ = plt.title('Suicide per 100 000 people in the Gapminder data set')



In [13]:
subdata['suicideper100th'].describe()


Out[13]:
count    191.000000
mean       9.640839
std        6.300178
min        0.201449
25%        4.988449
50%        8.262893
75%       12.328551
max       35.752872
Name: suicideper100th, dtype: float64

From the bar chart and the descriptive information, 75% of the countries have less than 12 suicide per 100,000 people.

The mean is 9.6 with a standard deviation of 6.3.

This distribution is unimodal and skewed-right.

The distribution range is about 35.

Unemployment rate distribution

As the unemployement rate is a quantitative variable, the distplot of seaborn package will be used.


In [14]:
sns.distplot(subdata['unemployrate'].dropna(), kde=False)
plt.xlabel('Unemployement rate (% population age 15+) (-)')
_ = plt.title('Unemployement rate (% population age 15+) in the Gapminder data set')



In [15]:
subdata['unemployrate'].describe()


Out[15]:
count    178.000000
mean      41.364045
std       10.519454
min       16.800003
25%       35.025000
50%       41.300001
75%       48.775000
max       68.000000
Name: unemployrate, dtype: float64

There is a clear peak around the mean (~42%).

The mean is 41.4% with a standard deviation of 10.5%.

This distribution is unimodal and symmetric.

The distribution range is roughly 51%.

Graphing decisions

Now that the univariate distribution as be plotted and described, the bivariate graphics will be plotted in order to test our research hypothesis.

Let's first focus on the primary research question;

  • The explanatory variable is the internet use rate (quantitative variable)
  • The response variable is the suicide per 100,000 people (quantitative variable)

Due to the variable types, a scatterplot seems to be the graphical solution to apply.


In [16]:
sns.regplot(x='internetuserate', y='suicideper100th', data=subdata)
plt.xlabel('Internet use rate (%)')
plt.ylabel('Suicide per 100 000 people (-)')
_ = plt.title('Scatterplot for the association between the Internet use rate and suicide per 100,000 people')


The regression line is only slightly positive. So it is unclear whether there is a link or not. But as most of the countries have no or very low internet use rate, an effect is maybe seen only on the countries having the higher internet use rate.


In [17]:
subdata2 = subdata.assign(internet_grp4 = pd.qcut(subdata.internetuserate, 4, 
                                       labels=["1=25th%tile", "2=50th%tile", 
                                               "3=75th%tile", "4=100th%tile"]))
sns.factorplot(x='internet_grp4', y='suicideper100th', data=subdata2, 
               kind="bar", ci=None)
plt.xlabel('Internet use rate (%)')
plt.ylabel('Suicide per 100 000 people (-)')
_ = plt.title('Average suicide per 100,000 people per internet use rate quartile')


By grouping the data in quartile, the hypothesis of an increasing of suicide with internet use rate may be valid only for countries with a heavy use of internet.

It is time to look at the second potential explanatory variable : the unemployment rate. As that variable is quantitative again, the scatterplot will be used.


In [18]:
sns.regplot(x='unemployrate', y='suicideper100th', data=subdata)
plt.xlabel('Unemployment rate (%)')
plt.ylabel('Suicide per 100 000 people (-)')
_ = plt.title('Scatterplot for the association between the unemployment rate and suicide per 100,000 people')


It seems there is no correlation between unemployment rate and suicide.

Summary

The Gapminder data based provides information for 213 countries.

As the unemployment rate is not provided directly in the database, it was computed as 100 - employment rate.

The distributions of the variables are as follow:

  • Internet Use Rate per 100 people
    • Data missing for 21 countries
    • Rate ranges from 0.21 to 95.64
    • The majority of the countries (64%) have a rate below 50
    • The distribution is bimodal (first mode ~8% and second ~40%) and skewed-right
  • Suicide Rate per 100 000
    • Data missing for 22 countries
    • Rate ranges from 0.2 to 35.75
    • The rate is more often between 4 and 12
    • The distribution is unimode (mode ~9) and skewed-right
  • Unemployment Rate for age 15+
    • Data missing for 35 countries
    • Rate ranges from 16.8 to 68
    • For the majority of the countries the rate lies below 45
    • The distribution is unimode (mean ~41.4) and symmetric

From the bivariate graphics, the internet use rate may have a slight effect on suicide. But if it's the case, it seems only true for countries having an important internet use rate. And the unemployment rate seems to have no influence on suicide. I have to admit those conclusions surprised me as I was a priori thinking that unemployment rate would have a stronger effect on suicide compare to internet use rate.

If you are interested by the subject, follow me on Tumblr.