Assignment: Making Data Management Decisions - Python

Following is the Python program I wrote to fulfill the third assignment of the Data Management and Visualization online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question

Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.

So for this third assignment, the three following variables will be analyzed:

Internet Usage Rate (per 100 people)
Suicide Rate (per 100 000 people)
Unemployment Rate (% of the population of age 15+)

Data management

For the question, I'm interested in the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN no special data treatment is needed.



In [1]:

    
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
from IPython.display import Markdown, display



In [2]:

    
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')

General information on the Gapminder data



In [3]:

    
display(Markdown("Number of countries: {}".format(len(data))))
display(Markdown("Number of variables: {}".format(len(data.columns))))









    




Number of countries: 213







    




Number of variables: 15



In [4]:

    
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
    data[variable] = pd.to_numeric(data[variable], errors='coerce')

But the unemployment rate is not provided directly. In the database, the employment rate (% of the popluation) is available. So the unemployement rate will be computed as 100 - employment rate:



In [5]:

    
data['unemployrate'] = 100. - data['employrate']

The first records of the data restricted to the three analyzed variables are:



In [6]:

    
subdata = data[['internetuserate', 'suicideper100th', 'unemployrate']]
subdata.head(10)









    Out[6]:






  
    
      
      internetuserate
      suicideper100th
      unemployrate
    
    
      country
      
      
      
    
  
  
    
      Afghanistan
      3.654122
      6.684385
      44.299999
    
    
      Albania
      44.989947
      7.699330
      48.599998
    
    
      Algeria
      12.500073
      4.848770
      49.500000
    
    
      Andorra
      81.000000
      5.362179
      NaN
    
    
      Angola
      9.999954
      14.554677
      24.300003
    
    
      Antigua and Barbuda
      80.645455
      2.161843
      NaN
    
    
      Argentina
      36.000335
      7.765584
      41.599998
    
    
      Armenia
      44.001025
      3.741588
      59.900002
    
    
      Aruba
      41.800889
      NaN
      NaN
    
    
      Australia
      75.895654
      8.470030
      38.500000

Data analysis

We will now have a look at the frequencies of the variables after grouping them as all three are continuous variables. I will group the data in intervals using the cut function.

Internet use rate frequencies



In [7]:

    
display(Markdown("Internet Use Rate (min, max) = ({0:.2f}, {1:.2f})".format(subdata['internetuserate'].min(), subdata['internetuserate'].max())))









    




Internet Use Rate (min, max) = (0.21, 95.64)



In [8]:

    
internetuserate_bins = pd.cut(subdata['internetuserate'], 
                              bins=np.linspace(0, 100., num=21))

counts1 = internetuserate_bins.value_counts(sort=False, dropna=False)
percentage1 = internetuserate_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
    'Counts' : counts1,
    'Cumulative counts' : counts1.cumsum(),
    'Percentages' : percentage1,
    'Cumulative percentages' : percentage1.cumsum()
}

internetrate_summary = pd.DataFrame(data_struct)
internetrate_summary.index.name = 'Internet use rate (per 100 people)'
(internetrate_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
                     .style.set_precision(3)
                           .set_properties(**{'text-align':'right'}))









    Out[8]:





        

        
        

        
            
            
                
                
                
                Counts
                
                Cumulative counts
                
                Percentages
                
                Cumulative percentages
                
            
            
            
                
                Internet use rate (per 100 people)
                
                
                
                
                
                
                
                
                
            
            
        
        
            
            
                
                
                    (0, 5]
                
                
                    26
                
                
                    26
                
                
                    0.122
                
                
                    0.122
                
            
            
            
                
                
                    (5, 10]
                
                
                    23
                
                
                    49
                
                
                    0.108
                
                
                    0.23
                
            
            
            
                
                
                    (10, 15]
                
                
                    19
                
                
                    68
                
                
                    0.0892
                
                
                    0.319
                
            
            
            
                
                
                    (15, 20]
                
                
                    8
                
                
                    76
                
                
                    0.0376
                
                
                    0.357
                
            
            
            
                
                
                    (20, 25]
                
                
                    6
                
                
                    82
                
                
                    0.0282
                
                
                    0.385
                
            
            
            
                
                
                    (25, 30]
                
                
                    11
                
                
                    93
                
                
                    0.0516
                
                
                    0.437
                
            
            
            
                
                
                    (30, 35]
                
                
                    8
                
                
                    101
                
                
                    0.0376
                
                
                    0.474
                
            
            
            
                
                
                    (35, 40]
                
                
                    10
                
                
                    111
                
                
                    0.0469
                
                
                    0.521
                
            
            
            
                
                
                    (40, 45]
                
                
                    17
                
                
                    128
                
                
                    0.0798
                
                
                    0.601
                
            
            
            
                
                
                    (45, 50]
                
                
                    8
                
                
                    136
                
                
                    0.0376
                
                
                    0.638
                
            
            
            
                
                
                    (50, 55]
                
                
                    7
                
                
                    143
                
                
                    0.0329
                
                
                    0.671
                
            
            
            
                
                
                    (55, 60]
                
                
                    2
                
                
                    145
                
                
                    0.00939
                
                
                    0.681
                
            
            
            
                
                
                    (60, 65]
                
                
                    7
                
                
                    152
                
                
                    0.0329
                
                
                    0.714
                
            
            
            
                
                
                    (65, 70]
                
                
                    7
                
                
                    159
                
                
                    0.0329
                
                
                    0.746
                
            
            
            
                
                
                    (70, 75]
                
                
                    8
                
                
                    167
                
                
                    0.0376
                
                
                    0.784
                
            
            
            
                
                
                    (75, 80]
                
                
                    8
                
                
                    175
                
                
                    0.0376
                
                
                    0.822
                
            
            
            
                
                
                    (80, 85]
                
                
                    10
                
                
                    185
                
                
                    0.0469
                
                
                    0.869
                
            
            
            
                
                
                    (85, 90]
                
                
                    2
                
                
                    187
                
                
                    0.00939
                
                
                    0.878
                
            
            
            
                
                
                    (90, 95]
                
                
                    4
                
                
                    191
                
                
                    0.0188
                
                
                    0.897
                
            
            
            
                
                
                    (95, 100]
                
                
                    1
                
                
                    192
                
                
                    0.00469
                
                
                    0.901
                
            
            
            
                
                
                    nan
                
                
                    21
                
                
                    213
                
                
                    0.0986
                
                
                    1

Suicide per 100,000 people frequencies



In [9]:

    
display(Markdown("Suicide per 100,000 people (min, max) = ({:.2f}, {:.2f})".format(subdata['suicideper100th'].min(), subdata['suicideper100th'].max())))









    




Suicide per 100,000 people (min, max) = (0.20, 35.75)



In [10]:

    
suiciderate_bins = pd.cut(subdata['suicideper100th'], 
                          bins=np.linspace(0, 40., num=21))

counts2 = suiciderate_bins.value_counts(sort=False, dropna=False)
percentage2 = suiciderate_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
    'Counts' : counts2,
    'Cumulative counts' : counts2.cumsum(),
    'Percentages' : percentage2,
    'Cumulative percentages' : percentage2.cumsum()
}

suiciderate_summary = pd.DataFrame(data_struct)
suiciderate_summary.index.name = 'Suicide (per 100 000 people)'
(suiciderate_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
                     .style.set_precision(3)
                           .set_properties(**{'text-align':'right'}))









    Out[10]:





        

        
        

        
            
            
                
                
                
                Counts
                
                Cumulative counts
                
                Percentages
                
                Cumulative percentages
                
            
            
            
                
                Suicide (per 100 000 people)
                
                
                
                
                
                
                
                
                
            
            
        
        
            
            
                
                
                    (0, 2]
                
                
                    11
                
                
                    11
                
                
                    0.0516
                
                
                    0.0516
                
            
            
            
                
                
                    (2, 4]
                
                
                    16
                
                
                    27
                
                
                    0.0751
                
                
                    0.127
                
            
            
            
                
                
                    (4, 6]
                
                
                    32
                
                
                    59
                
                
                    0.15
                
                
                    0.277
                
            
            
            
                
                
                    (6, 8]
                
                
                    29
                
                
                    88
                
                
                    0.136
                
                
                    0.413
                
            
            
            
                
                
                    (8, 10]
                
                
                    26
                
                
                    114
                
                
                    0.122
                
                
                    0.535
                
            
            
            
                
                
                    (10, 12]
                
                
                    24
                
                
                    138
                
                
                    0.113
                
                
                    0.648
                
            
            
            
                
                
                    (12, 14]
                
                
                    18
                
                
                    156
                
                
                    0.0845
                
                
                    0.732
                
            
            
            
                
                
                    (14, 16]
                
                
                    13
                
                
                    169
                
                
                    0.061
                
                
                    0.793
                
            
            
            
                
                
                    (16, 18]
                
                
                    4
                
                
                    173
                
                
                    0.0188
                
                
                    0.812
                
            
            
            
                
                
                    (18, 20]
                
                
                    4
                
                
                    177
                
                
                    0.0188
                
                
                    0.831
                
            
            
            
                
                
                    (20, 22]
                
                
                    4
                
                
                    181
                
                
                    0.0188
                
                
                    0.85
                
            
            
            
                
                
                    (22, 24]
                
                
                    2
                
                
                    183
                
                
                    0.00939
                
                
                    0.859
                
            
            
            
                
                
                    (24, 26]
                
                
                    1
                
                
                    184
                
                
                    0.00469
                
                
                    0.864
                
            
            
            
                
                
                    (26, 28]
                
                
                    3
                
                
                    187
                
                
                    0.0141
                
                
                    0.878
                
            
            
            
                
                
                    (28, 30]
                
                
                    2
                
                
                    189
                
                
                    0.00939
                
                
                    0.887
                
            
            
            
                
                
                    (30, 32]
                
                
                    0
                
                
                    189
                
                
                    0
                
                
                    0.887
                
            
            
            
                
                
                    (32, 34]
                
                
                    1
                
                
                    190
                
                
                    0.00469
                
                
                    0.892
                
            
            
            
                
                
                    (34, 36]
                
                
                    1
                
                
                    191
                
                
                    0.00469
                
                
                    0.897
                
            
            
            
                
                
                    (36, 38]
                
                
                    0
                
                
                    191
                
                
                    0
                
                
                    0.897
                
            
            
            
                
                
                    (38, 40]
                
                
                    0
                
                
                    191
                
                
                    0
                
                
                    0.897
                
            
            
            
                
                
                    nan
                
                
                    22
                
                
                    213
                
                
                    0.103
                
                
                    1

Unemployment rate frequencies



In [11]:

    
display(Markdown("Unemployment rate (min, max) = ({0:.2f}, {1:.2f})".format(subdata['unemployrate'].min(), subdata['unemployrate'].max())))









    




Unemployment rate (min, max) = (16.80, 68.00)



In [12]:

    
unemployment_bins = pd.cut(subdata['unemployrate'], 
                         bins=np.linspace(0, 100., num=21))


counts3 = unemployment_bins.value_counts(sort=False, dropna=False)
percentage3 = unemployment_bins.value_counts(sort=False, normalize=True, dropna=False)
data_struct = {
    'Counts' : counts3,
    'Cumulative counts' : counts3.cumsum(),
    'Percentages' : percentage3,
    'Cumulative percentages' : percentage3.cumsum()
}

unemployment_summary = pd.DataFrame(data_struct)
unemployment_summary.index.name = 'Unemployement rate (% population age 15+)'
(unemployment_summary[['Counts', 'Cumulative counts', 'Percentages', 'Cumulative percentages']]
                     .style.set_precision(3)
                           .set_properties(**{'text-align':'right'}))









    Out[12]:





        

        
        

        
            
            
                
                
                
                Counts
                
                Cumulative counts
                
                Percentages
                
                Cumulative percentages
                
            
            
            
                
                Unemployement rate (% population age 15+)
                
                
                
                
                
                
                
                
                
            
            
        
        
            
            
                
                
                    (0, 5]
                
                
                    0
                
                
                    0
                
                
                    0
                
                
                    0
                
            
            
            
                
                
                    (5, 10]
                
                
                    0
                
                
                    0
                
                
                    0
                
                
                    0
                
            
            
            
                
                
                    (10, 15]
                
                
                    0
                
                
                    0
                
                
                    0
                
                
                    0
                
            
            
            
                
                
                    (15, 20]
                
                
                    6
                
                
                    6
                
                
                    0.0282
                
                
                    0.0282
                
            
            
            
                
                
                    (20, 25]
                
                
                    8
                
                
                    14
                
                
                    0.0376
                
                
                    0.0657
                
            
            
            
                
                
                    (25, 30]
                
                
                    13
                
                
                    27
                
                
                    0.061
                
                
                    0.127
                
            
            
            
                
                
                    (30, 35]
                
                
                    18
                
                
                    45
                
                
                    0.0845
                
                
                    0.211
                
            
            
            
                
                
                    (35, 40]
                
                
                    29
                
                
                    74
                
                
                    0.136
                
                
                    0.347
                
            
            
            
                
                
                    (40, 45]
                
                
                    44
                
                
                    118
                
                
                    0.207
                
                
                    0.554
                
            
            
            
                
                
                    (45, 50]
                
                
                    23
                
                
                    141
                
                
                    0.108
                
                
                    0.662
                
            
            
            
                
                
                    (50, 55]
                
                
                    18
                
                
                    159
                
                
                    0.0845
                
                
                    0.746
                
            
            
            
                
                
                    (55, 60]
                
                
                    14
                
                
                    173
                
                
                    0.0657
                
                
                    0.812
                
            
            
            
                
                
                    (60, 65]
                
                
                    3
                
                
                    176
                
                
                    0.0141
                
                
                    0.826
                
            
            
            
                
                
                    (65, 70]
                
                
                    2
                
                
                    178
                
                
                    0.00939
                
                
                    0.836
                
            
            
            
                
                
                    (70, 75]
                
                
                    0
                
                
                    178
                
                
                    0
                
                
                    0.836
                
            
            
            
                
                
                    (75, 80]
                
                
                    0
                
                
                    178
                
                
                    0
                
                
                    0.836
                
            
            
            
                
                
                    (80, 85]
                
                
                    0
                
                
                    178
                
                
                    0
                
                
                    0.836
                
            
            
            
                
                
                    (85, 90]
                
                
                    0
                
                
                    178
                
                
                    0
                
                
                    0.836
                
            
            
            
                
                
                    (90, 95]
                
                
                    0
                
                
                    178
                
                
                    0
                
                
                    0.836
                
            
            
            
                
                
                    (95, 100]
                
                
                    0
                
                
                    178
                
                
                    0
                
                
                    0.836
                
            
            
            
                
                
                    nan
                
                
                    35
                
                
                    213
                
                
                    0.164
                
                
                    1

Summary

The Gapminder data based provides information for 213 countries.

As the unemployment rate is not provided directly in the database, it was computed as 100 - employment rate.

The distributions of the variables are as follow:

Internet Use Rate per 100 people
- Data missing for 21 countries
- Rate ranges from 0.21 to 95.64
- The majority of the countries (64%) have a rate below 50
Suicide Rate per 100 000
- Data missing for 22 countries
- Rate ranges from 0.2 to 35.75
- The rate is more often between 4 and 12
Unemployment Rate for age 15+
- Data missing for 35 countries
- Rate ranges from 16.8 to 68
- For the majority of the countries the rate lies below 45

From those data, I was surprised that so few people have access to the internet especially now that smartphones are cheap.

Another astonishing facts is the high unemployment rate, I was expected much less; especially in so called developped countries. But I presume that long school time and retirement can explain those high values as people of age 15+ are considered here.

If you are interested by the subject, follow me on Tumblr.

	internetuserate	suicideper100th	unemployrate
country
Afghanistan	3.654122	6.684385	44.299999
Albania	44.989947	7.699330	48.599998
Algeria	12.500073	4.848770	49.500000
Andorra	81.000000	5.362179	NaN
Angola	9.999954	14.554677	24.300003
Antigua and Barbuda	80.645455	2.161843	NaN
Argentina	36.000335	7.765584	41.599998
Armenia	44.001025	3.741588	59.900002
Aruba	41.800889	NaN	NaN
Australia	75.895654	8.470030	38.500000

	Counts	Cumulative counts	Percentages	Cumulative percentages
Internet use rate (per 100 people)
(0, 5]	26	26	0.122	0.122
(5, 10]	23	49	0.108	0.23
(10, 15]	19	68	0.0892	0.319
(15, 20]	8	76	0.0376	0.357
(20, 25]	6	82	0.0282	0.385
(25, 30]	11	93	0.0516	0.437
(30, 35]	8	101	0.0376	0.474
(35, 40]	10	111	0.0469	0.521
(40, 45]	17	128	0.0798	0.601
(45, 50]	8	136	0.0376	0.638
(50, 55]	7	143	0.0329	0.671
(55, 60]	2	145	0.00939	0.681
(60, 65]	7	152	0.0329	0.714
(65, 70]	7	159	0.0329	0.746
(70, 75]	8	167	0.0376	0.784
(75, 80]	8	175	0.0376	0.822
(80, 85]	10	185	0.0469	0.869
(85, 90]	2	187	0.00939	0.878
(90, 95]	4	191	0.0188	0.897
(95, 100]	1	192	0.00469	0.901
nan	21	213	0.0986	1

	Counts	Cumulative counts	Percentages	Cumulative percentages
Suicide (per 100 000 people)
(0, 2]	11	11	0.0516	0.0516
(2, 4]	16	27	0.0751	0.127
(4, 6]	32	59	0.15	0.277
(6, 8]	29	88	0.136	0.413
(8, 10]	26	114	0.122	0.535
(10, 12]	24	138	0.113	0.648
(12, 14]	18	156	0.0845	0.732
(14, 16]	13	169	0.061	0.793
(16, 18]	4	173	0.0188	0.812
(18, 20]	4	177	0.0188	0.831
(20, 22]	4	181	0.0188	0.85
(22, 24]	2	183	0.00939	0.859
(24, 26]	1	184	0.00469	0.864
(26, 28]	3	187	0.0141	0.878
(28, 30]	2	189	0.00939	0.887
(30, 32]	0	189	0	0.887
(32, 34]	1	190	0.00469	0.892
(34, 36]	1	191	0.00469	0.897
(36, 38]	0	191	0	0.897
(38, 40]	0	191	0	0.897
nan	22	213	0.103	1

	Counts	Cumulative counts	Percentages	Cumulative percentages
Unemployement rate (% population age 15+)
(0, 5]	0	0	0	0
(5, 10]	0	0	0	0
(10, 15]	0	0	0	0
(15, 20]	6	6	0.0282	0.0282
(20, 25]	8	14	0.0376	0.0657
(25, 30]	13	27	0.061	0.127
(30, 35]	18	45	0.0845	0.211
(35, 40]	29	74	0.136	0.347
(40, 45]	44	118	0.207	0.554
(45, 50]	23	141	0.108	0.662
(50, 55]	18	159	0.0845	0.746
(55, 60]	14	173	0.0657	0.812
(60, 65]	3	176	0.0141	0.826
(65, 70]	2	178	0.00939	0.836
(70, 75]	0	178	0	0.836
(75, 80]	0	178	0	0.836
(80, 85]	0	178	0	0.836
(85, 90]	0	178	0	0.836
(90, 95]	0	178	0	0.836
(95, 100]	0	178	0	0.836
nan	35	213	0.164	1