Writing your first program - Python

Following is the Python program I wrote to fulfill the second assignment of the Data Management and Visualization online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question

Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.

So for this second assignment, the frequencies of the three following variables will be analyzed:

Internet Usage Rate (per 100 people)
Suicide Rate (per 100 000 people)
Employment Rate (% of the population of age 15+)

Data analysis



In [1]:

    
# Load a useful Python libraries for handling data
import pandas as pd
from IPython.display import Markdown, display



In [2]:

    
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')

display(Markdown("General information on the Gapminder data"))
display(Markdown("Number of countries: {0}".format(len(data))))
display(Markdown("Number of variables: {0}".format(len(data.columns))))









    




General information on the Gapminder data







    




Number of countries: 213







    




Number of variables: 15



In [3]:

    
display(Markdown("The first records of the data."))
data.head()









    




The first records of the data.







    Out[3]:






  
    
      
      incomeperperson
      alcconsumption
      armedforcesrate
      breastcancerper100th
      co2emissions
      femaleemployrate
      hivrate
      internetuserate
      lifeexpectancy
      oilperperson
      polityscore
      relectricperperson
      suicideper100th
      employrate
      urbanrate
    
    
      country
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Afghanistan
      
      .03
      .5696534
      26.8
      75944000
      25.6000003814697
      
      3.65412162280064
      48.673
      
      0
      
      6.68438529968262
      55.7000007629394
      24.04
    
    
      Albania
      1914.99655094922
      7.29
      1.0247361
      57.4
      223747333.333333
      42.0999984741211
      
      44.9899469578783
      76.918
      
      9
      636.341383366604
      7.69932985305786
      51.4000015258789
      46.72
    
    
      Algeria
      2231.99333515006
      .69
      2.306817
      23.5
      2932108666.66667
      31.7000007629394
      .1
      12.5000733055148
      73.131
      .42009452521537
      2
      590.509814347428
      4.8487696647644
      50.5
      65.22
    
    
      Andorra
      21943.3398976022
      10.17
      
      
      
      
      
      81
      
      
      
      
      5.36217880249023
      
      88.92
    
    
      Angola
      1381.00426770244
      5.57
      1.4613288
      23.1
      248358000
      69.4000015258789
      2
      9.99995388324075
      51.093
      
      -2
      172.999227388199
      14.5546770095825
      75.6999969482422
      56.7



In [4]:

    
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
    data[variable] = pd.to_numeric(data[variable], errors='coerce')

We will now have a look at the frequencies of the variables.

Internet use rate frequencies

First for the Internet usage, the counts are (including data missing):



In [5]:

    
data['internetuserate'].value_counts(sort=False, dropna=False)









    Out[5]:





 0.720009      1
 1.400061      1
 2.100213      1
 3.654122      1
 4.999875      1
 5.999836      1
NaN           21
 7.232224      1
 29.879921     1
 9.999954      1
 1.259934      1
 11.090765     1
 12.500073     1
 13.598876     1
 14.830736     1
 15.899970     1
 90.703555     1
 0.829997      1
 18.795114     1
 2.300027      1
 20.001710     1
 31.050013     1
 16.780037     1
 24.999946     1
 25.899797     1
 26.740025     1
 2.259976      1
 3.129962      1
 29.999940     1
 76.587538     1
              ..
 27.851822     1
 1.699985      1
 42.692335     1
 81.338393     1
 12.006692     1
 2.450362      1
 81.590397     1
 80.000000     1
 36.422772     1
 39.820178     1
 13.000111     1
 48.516818     1
 28.289701     1
 9.007736      1
 53.024745     1
 43.055067     1
 61.987413     1
 7.930096      1
 75.200000     1
 44.585355     1
 14.000247     1
 53.740217     1
 0.210066      1
 44.570074     1
 40.020095     1
 2.471948      1
 6.965038      1
 31.568098     1
 20.663156     1
 28.999477     1
Name: internetuserate, dtype: int64

This is useless as the variable does not take discrete values. So before researching the frequency count on the data, I will group the data in intervals of 5% using the cut function.



In [6]:

    
import numpy as np
display(Markdown("Internet Use Rate (min, max) = ({0:.2f}, {1:.2f})".format(data['internetuserate'].min(), data['internetuserate'].max())))









    




Internet Use Rate (min, max) = (0.21, 95.64)



In [7]:

    
internetuserate_bins = pd.cut(data['internetuserate'], bins=np.linspace(0, 100., num=21))

Counts of Internet Use Rate:



In [8]:

    
internetuserate_bins.value_counts(sort=False, dropna=False)









    Out[8]:





(0, 5]       26
(5, 10]      23
(10, 15]     19
(15, 20]      8
(20, 25]      6
(25, 30]     11
(30, 35]      8
(35, 40]     10
(40, 45]     17
(45, 50]      8
(50, 55]      7
(55, 60]      2
(60, 65]      7
(65, 70]      7
(70, 75]      8
(75, 80]      8
(80, 85]     10
(85, 90]      2
(90, 95]      4
(95, 100]     1
NaN          21
dtype: int64

Percentages of Internet Use Rate:



In [9]:

    
internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False)









    Out[9]:





(0, 5]       0.122066
(5, 10]      0.107981
(10, 15]     0.089202
(15, 20]     0.037559
(20, 25]     0.028169
(25, 30]     0.051643
(30, 35]     0.037559
(35, 40]     0.046948
(40, 45]     0.079812
(45, 50]     0.037559
(50, 55]     0.032864
(55, 60]     0.009390
(60, 65]     0.032864
(65, 70]     0.032864
(70, 75]     0.037559
(75, 80]     0.037559
(80, 85]     0.046948
(85, 90]     0.009390
(90, 95]     0.018779
(95, 100]    0.004695
NaN          0.098592
dtype: float64



In [10]:

    
display(Markdown("Cumulative sum for Internet use rate percentages"))
internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False).cumsum()









    




Cumulative sum for Internet use rate percentages







    Out[10]:





(0, 5]       0.122066
(5, 10]      0.230047
(10, 15]     0.319249
(15, 20]     0.356808
(20, 25]     0.384977
(25, 30]     0.436620
(30, 35]     0.474178
(35, 40]     0.521127
(40, 45]     0.600939
(45, 50]     0.638498
(50, 55]     0.671362
(55, 60]     0.680751
(60, 65]     0.713615
(65, 70]     0.746479
(70, 75]     0.784038
(75, 80]     0.821596
(80, 85]     0.868545
(85, 90]     0.877934
(90, 95]     0.896714
(95, 100]    0.901408
NaN          1.000000
dtype: float64

Suicide rate per 100,000 frequencies



In [11]:

    
display(Markdown("Suicide rate (min, max) = ({0:.2f}, {1:.2f})".format(data['suicideper100th'].min(), data['suicideper100th'].max())))









    




Suicide rate (min, max) = (0.20, 35.75)

Counts of Suicide Rate:



In [12]:

    
suiciderate_bins = pd.cut(data['suicideper100th'], bins=np.linspace(0, 40., num=21))
suiciderate_bins.value_counts(sort=False, dropna=False)









    Out[12]:





(0, 2]      11
(2, 4]      16
(4, 6]      32
(6, 8]      29
(8, 10]     26
(10, 12]    24
(12, 14]    18
(14, 16]    13
(16, 18]     4
(18, 20]     4
(20, 22]     4
(22, 24]     2
(24, 26]     1
(26, 28]     3
(28, 30]     2
(30, 32]     0
(32, 34]     1
(34, 36]     1
(36, 38]     0
(38, 40]     0
NaN         22
dtype: int64

Percentages of Suicide Rate:



In [13]:

    
suiciderate_bins.value_counts(sort=False, normalize=True, dropna=False)









    Out[13]:





(0, 2]      0.051643
(2, 4]      0.075117
(4, 6]      0.150235
(6, 8]      0.136150
(8, 10]     0.122066
(10, 12]    0.112676
(12, 14]    0.084507
(14, 16]    0.061033
(16, 18]    0.018779
(18, 20]    0.018779
(20, 22]    0.018779
(22, 24]    0.009390
(24, 26]    0.004695
(26, 28]    0.014085
(28, 30]    0.009390
(30, 32]    0.000000
(32, 34]    0.004695
(34, 36]    0.004695
(36, 38]    0.000000
(38, 40]    0.000000
NaN         0.103286
dtype: float64

Employment rate frequencies



In [14]:

    
display(Markdown("Employment rate (min, max) = ({0:.2f}, {1:.2f})".format(data['employrate'].min(), data['employrate'].max())))









    




Employment rate (min, max) = (32.00, 83.20)

Counts of Employment Rate:



In [15]:

    
employment_bins = pd.cut(data['employrate'], bins=np.linspace(0, 100., num=21))
employment_bins.value_counts(sort=False, dropna=False)









    Out[15]:





(0, 5]        0
(5, 10]       0
(10, 15]      0
(15, 20]      0
(20, 25]      0
(25, 30]      0
(30, 35]      2
(35, 40]      3
(40, 45]     14
(45, 50]     18
(50, 55]     23
(55, 60]     44
(60, 65]     32
(65, 70]     15
(70, 75]     13
(75, 80]      8
(80, 85]      6
(85, 90]      0
(90, 95]      0
(95, 100]     0
NaN          35
dtype: int64

Percentages of Employment Rate:



In [16]:

    
employment_bins.value_counts(sort=False, normalize=True, dropna=False)









    Out[16]:





(0, 5]       0.000000
(5, 10]      0.000000
(10, 15]     0.000000
(15, 20]     0.000000
(20, 25]     0.000000
(25, 30]     0.000000
(30, 35]     0.009390
(35, 40]     0.014085
(40, 45]     0.065728
(45, 50]     0.084507
(50, 55]     0.107981
(55, 60]     0.206573
(60, 65]     0.150235
(65, 70]     0.070423
(70, 75]     0.061033
(75, 80]     0.037559
(80, 85]     0.028169
(85, 90]     0.000000
(90, 95]     0.000000
(95, 100]    0.000000
NaN          0.164319
dtype: float64

Summary

The Gapminder data based provides information for 213 countries. Unfortunately for all three variables analyzed here (Internet Use Rate, Suicide rate and employment rate) data are missing; e.g. the internet use rate is unknown for 21 countries.

The distribution of the variables is as follow:

Internet Use Rate per 100 people
- Data missing for 21 countries
- Rate ranges from 0.21 to 95.64
- The majority of the countries have a rate below 50
Suicide Rate per 100 000
- Data missing for 22 countries
- Rate ranges from 0.2 to 35.75
- The rate is more often between 4 and 12
Employment Rate for age 15+
- Data missing for 35 countries
- Rate ranges from 32 to 83.2
- For most of the countries the rate lies between 50 and 65

A possible refinement could to analyze only the country with an high employment rate (like > 60%) to suppress the economical influence. The data filtering would then be done as follow:



In [17]:

    
high_employment_set = data[data['employrate'] > 60.0]
high_employment_set.head()









    Out[17]:






  
    
      
      incomeperperson
      alcconsumption
      armedforcesrate
      breastcancerper100th
      co2emissions
      femaleemployrate
      hivrate
      internetuserate
      lifeexpectancy
      oilperperson
      polityscore
      relectricperperson
      suicideper100th
      employrate
      urbanrate
    
    
      country
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Angola
      1381.00426770244
      5.57
      1.4613288
      23.1
      248358000
      69.4000015258789
      2
      9.999954
      51.093
      
      -2
      172.999227388199
      14.554677
      75.699997
      56.7
    
    
      Australia
      25249.98606148
      10.21
      .4862799
      83.2
      12970092666.6667
      54.5999984741211
      .1
      75.895654
      81.907
      1.91302610912404
      10
      2825.39109539914
      8.470030
      61.500000
      88.74
    
    
      Azerbaijan
      2344.89691619809
      13.34
      1.9767462
      31.5
      511107666.666667
      56.2000007629394
      .1
      46.679702
      70.739
      .35917260997566
      -7
      921.562110759901
      1.380965
      60.900002
      51.92
    
    
      Bahamas
      19630.5405471267
      8.65
      .5452863
      54.4
      137555000
      60.7000007629394
      3.1
      42.984580
      75.62
      
      
      
      3.374416
      66.599998
      83.7
    
    
      Bahrain
      12505.2125447354
      4.19
      5.2311429
      40.2
      503994333.333333
      30.2000007629394
      
      54.992809
      75.057
      
      -7
      7314.3556367342
      4.414990
      60.400002
      88.52

If you are interested by the subject, follow me on Tumblr.

	incomeperperson	alcconsumption	armedforcesrate	breastcancerper100th	co2emissions	femaleemployrate	hivrate	internetuserate	lifeexpectancy	oilperperson	polityscore	relectricperperson	suicideper100th	employrate	urbanrate
country
Afghanistan		.03	.5696534	26.8	75944000	25.6000003814697		3.65412162280064	48.673		0		6.68438529968262	55.7000007629394	24.04
Albania	1914.99655094922	7.29	1.0247361	57.4	223747333.333333	42.0999984741211		44.9899469578783	76.918		9	636.341383366604	7.69932985305786	51.4000015258789	46.72
Algeria	2231.99333515006	.69	2.306817	23.5	2932108666.66667	31.7000007629394	.1	12.5000733055148	73.131	.42009452521537	2	590.509814347428	4.8487696647644	50.5	65.22
Andorra	21943.3398976022	10.17						81					5.36217880249023		88.92
Angola	1381.00426770244	5.57	1.4613288	23.1	248358000	69.4000015258789	2	9.99995388324075	51.093		-2	172.999227388199	14.5546770095825	75.6999969482422	56.7