In [2]:

    
# Data Source: https://www.kaggle.com/worldbank/world-development-indicators
# Folder: 'world-development-indicators'

Matplotlib: Exploring

Data Visualization

World Development Indicators

This week, we will be using an open dataset from Kaggle. It is The World Development Indicators dataset obtained from the World Bank containing over a thousand annual indicators of economic development from hundreds of countries around the world.

This is a slightly modified version of the original dataset from The World Bank

List of the available indicators and a list of the available countries.

Step 1: Initial exploration of the Dataset



In [3]:

    
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt



In [4]:

    
data = pd.read_csv(r'C:\Users\hrao\Documents\Personal\HK\Python\world-development-indicators\Indicators.csv')
data.shape









    Out[4]:





(5656458, 6)

This is a really large dataset, at least in terms of the number of rows. But with 6 columns, what does this hold?



In [5]:

    
data.head(10)









    Out[5]:







  
    
      
      CountryName
      CountryCode
      IndicatorName
      IndicatorCode
      Year
      Value
    
  
  
    
      0
      Arab World
      ARB
      Adolescent fertility rate (births per 1,000 wo...
      SP.ADO.TFRT
      1960
      1.335609e+02
    
    
      1
      Arab World
      ARB
      Age dependency ratio (% of working-age populat...
      SP.POP.DPND
      1960
      8.779760e+01
    
    
      2
      Arab World
      ARB
      Age dependency ratio, old (% of working-age po...
      SP.POP.DPND.OL
      1960
      6.634579e+00
    
    
      3
      Arab World
      ARB
      Age dependency ratio, young (% of working-age ...
      SP.POP.DPND.YG
      1960
      8.102333e+01
    
    
      4
      Arab World
      ARB
      Arms exports (SIPRI trend indicator values)
      MS.MIL.XPRT.KD
      1960
      3.000000e+06
    
    
      5
      Arab World
      ARB
      Arms imports (SIPRI trend indicator values)
      MS.MIL.MPRT.KD
      1960
      5.380000e+08
    
    
      6
      Arab World
      ARB
      Birth rate, crude (per 1,000 people)
      SP.DYN.CBRT.IN
      1960
      4.769789e+01
    
    
      7
      Arab World
      ARB
      CO2 emissions (kt)
      EN.ATM.CO2E.KT
      1960
      5.956399e+04
    
    
      8
      Arab World
      ARB
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      1960
      6.439635e-01
    
    
      9
      Arab World
      ARB
      CO2 emissions from gaseous fuel consumption (%...
      EN.ATM.CO2E.GF.ZS
      1960
      5.041292e+00

Looks like it has different indicators for different countries with the year and value of the indicator.

How many UNIQUE country names are there ?



In [6]:

    
countries = data['CountryName'].unique().tolist()
len(countries)









    Out[6]:





247

Are there same number of country codes ?



In [7]:

    
# How many unique country codes are there ? (should be the same #)
countryCodes = data['CountryCode'].unique().tolist()
len(countryCodes)









    Out[7]:





247

Are there many indicators or few ?



In [8]:

    
# How many unique indicators are there ? (should be the same #)
indicators = data['IndicatorName'].unique().tolist()
len(indicators)









    Out[8]:





1344

How many years of data do we have ?



In [9]:

    
# How many years of data do we have ?
years = data['Year'].unique().tolist()
len(years)









    Out[9]:





56

What's the range of years?



In [10]:

    
print(min(years)," to ",max(years))









    



1960  to  2015

Matplotlib: Basic Plotting, Part 1

Lets pick a country and an indicator to explore: CO2 Emissions per capita and the USA



In [11]:

    
# select CO2 emissions for the United States
hist_indicator = 'CO2 emissions \(metric'
hist_country = 'USA'

mask1 = data['IndicatorName'].str.contains(hist_indicator) 
mask2 = data['CountryCode'].str.contains(hist_country)

# stage is just those indicators matching the USA for country code and CO2 emissions over time.
stage = data[mask1 & mask2]



In [12]:

    
stage.head()









    Out[12]:







  
    
      
      CountryName
      CountryCode
      IndicatorName
      IndicatorCode
      Year
      Value
    
  
  
    
      22232
      United States
      USA
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      1960
      15.999779
    
    
      48708
      United States
      USA
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      1961
      15.681256
    
    
      77087
      United States
      USA
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      1962
      16.013937
    
    
      105704
      United States
      USA
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      1963
      16.482762
    
    
      134742
      United States
      USA
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      1964
      16.968119

Let's see how emissions have changed over time using MatplotLib



In [13]:

    
# get the years
years = stage['Year'].values
# get the values 
co2 = stage['Value'].values

# create
plt.bar(years,co2)
plt.show()

Turns out emissions per capita have dropped a bit over time, but let's make this graphic a bit more appealing before we continue to explore it.



In [14]:

    
# switch to a line plot
plt.plot(stage['Year'].values, stage['Value'].values)

# Label the axes
plt.xlabel('Year')
plt.ylabel(stage['IndicatorName'].iloc[0])

#label the figure
plt.title('CO2 Emissions in USA')

# to make more honest, start they y axis at 0
plt.axis([1959, 2011,0,25])

plt.show()

Using Histograms to explore the distribution of values

We could also visualize this data as a histogram to better explore the ranges of values in CO2 production per year.



In [15]:

    
# If you want to just include those within one standard deviation fo the mean, you could do the following
# lower = stage['Value'].mean() - stage['Value'].std()
# upper = stage['Value'].mean() + stage['Value'].std()
# hist_data = [x for x in stage[:10000]['Value'] if x>lower and x<upper ]

# Otherwise, let's look at all the data
hist_data = stage['Value'].values



In [16]:

    
print(len(hist_data))



In [17]:

    
# the histogram of the data
plt.hist(hist_data, 10, normed=False, facecolor='green')

plt.xlabel(stage['IndicatorName'].iloc[0])
plt.ylabel('# of Years')
plt.title('Histogram Example')

plt.grid(True)

plt.show()

So the USA has many years where it produced between 19-20 metric tons per capita with outliers on either side.

But how do the USA's numbers relate to those of other countries?



In [18]:

    
# select CO2 emissions for all countries in 2011
hist_indicator = 'CO2 emissions \(metric'
hist_year = 2011

mask1 = data['IndicatorName'].str.contains(hist_indicator) 
mask2 = data['Year'].isin([hist_year])

# apply our mask
co2_2011 = data[mask1 & mask2]
co2_2011.head()









    Out[18]:







  
    
      
      CountryName
      CountryCode
      IndicatorName
      IndicatorCode
      Year
      Value
    
  
  
    
      5026275
      Arab World
      ARB
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      2011
      4.724500
    
    
      5026788
      Caribbean small states
      CSS
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      2011
      9.692960
    
    
      5027295
      Central Europe and the Baltics
      CEB
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      2011
      6.911131
    
    
      5027870
      East Asia & Pacific (all income levels)
      EAS
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      2011
      5.859548
    
    
      5028456
      East Asia & Pacific (developing only)
      EAP
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      2011
      5.302499

For how many countries do we have CO2 per capita emissions data in 2011



In [19]:

    
print(len(co2_2011))



In [20]:

    
# let's plot a histogram of the emmissions per capita by country

# subplots returns a touple with the figure, axis attributes.
fig, ax = plt.subplots()

ax.annotate("USA",
            xy=(18, 5), xycoords='data',
            xytext=(18, 30), textcoords='data',
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"),
            )

plt.hist(co2_2011['Value'], 10, normed=False, facecolor='green')

plt.xlabel(stage['IndicatorName'].iloc[0])
plt.ylabel('# of Countries')
plt.title('Histogram of CO2 Emissions Per Capita')

#plt.axis([10, 22, 0, 14])
plt.grid(True)

plt.show()

So the USA, at ~18 CO2 emissions (metric tons per capital) is quite high among all countries.

An interesting next step, which we'll save for you, would be to explore how this relates to other industrialized nations and to look at the outliers with those values in the 40s!

Matplotlib: Basic Plotting, Part 2

Relationship between GPD and CO2 Emissions in USA



In [21]:

    
# select GDP Per capita emissions for the United States
hist_indicator = 'GDP per capita \(constant 2005'
hist_country = 'USA'

mask1 = data['IndicatorName'].str.contains(hist_indicator) 
mask2 = data['CountryCode'].str.contains(hist_country)

# stage is just those indicators matching the USA for country code and CO2 emissions over time.
gdp_stage = data[mask1 & mask2]

#plot gdp_stage vs stage



In [22]:

    
gdp_stage.head(2)









    Out[22]:







  
    
      
      CountryName
      CountryCode
      IndicatorName
      IndicatorCode
      Year
      Value
    
  
  
    
      22282
      United States
      USA
      GDP per capita (constant 2005 US$)
      NY.GDP.PCAP.KD
      1960
      15482.707760
    
    
      48759
      United States
      USA
      GDP per capita (constant 2005 US$)
      NY.GDP.PCAP.KD
      1961
      15578.409657



In [23]:

    
stage.head(2)









    Out[23]:







  
    
      
      CountryName
      CountryCode
      IndicatorName
      IndicatorCode
      Year
      Value
    
  
  
    
      22232
      United States
      USA
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      1960
      15.999779
    
    
      48708
      United States
      USA
      CO2 emissions (metric tons per capita)
      EN.ATM.CO2E.PC
      1961
      15.681256



In [24]:

    
# switch to a line plot
plt.plot(gdp_stage['Year'].values, gdp_stage['Value'].values)

# Label the axes
plt.xlabel('Year')
plt.ylabel(gdp_stage['IndicatorName'].iloc[0])

#label the figure
plt.title('GDP Per Capita USA')

# to make more honest, start they y axis at 0
#plt.axis([1959, 2011,0,25])

plt.show()

So although we've seen a decline in the CO2 emissions per capita, it does not seem to translate to a decline in GDP per capita

ScatterPlot for comparing GDP against CO2 emissions (per capita)

First, we'll need to make sure we're looking at the same time frames



In [25]:

    
print("GDP Min Year = ", gdp_stage['Year'].min(), "max: ", gdp_stage['Year'].max())
print("CO2 Min Year = ", stage['Year'].min(), "max: ", stage['Year'].max())









    



GDP Min Year =  1960 max:  2014
CO2 Min Year =  1960 max:  2011

We have 3 extra years of GDP data, so let's trim those off so the scatterplot has equal length arrays to compare (this is actually required by scatterplot)



In [26]:

    
gdp_stage_trunc = gdp_stage[gdp_stage['Year'] < 2012]
print(len(gdp_stage_trunc))
print(len(stage))



In [27]:

    
%matplotlib inline
import matplotlib.pyplot as plt

fig, axis = plt.subplots()
# Grid lines, Xticks, Xlabel, Ylabel

axis.yaxis.grid(True)
axis.set_title('CO2 Emissions vs. GDP \(per capita\)',fontsize=10)
axis.set_xlabel(gdp_stage_trunc['IndicatorName'].iloc[0],fontsize=10)
axis.set_ylabel(stage['IndicatorName'].iloc[0],fontsize=10)

X = gdp_stage_trunc['Value']
Y = stage['Value']

axis.scatter(X, Y)
plt.show()

This doesn't look like a strong relationship. We can test this by looking at correlation.



In [28]:

    
np.corrcoef(gdp_stage_trunc['Value'],stage['Value'])









    Out[28]:





array([[ 1.        ,  0.07676005],
       [ 0.07676005,  1.        ]])

A correlation of 0.07 is pretty weak, but you'll learn more about correlation in the next course.

You could continue to explore this to see if other countries have a closer relationship between CO2 emissions and GDP. Perhaps it is stronger for developing countries?

Want more ?

Matplotlib Examples Library

http://matplotlib.org/examples/index.html



In [29]:

    
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

	CountryName	CountryCode	IndicatorName	IndicatorCode	Year	Value
0	Arab World	ARB	Adolescent fertility rate (births per 1,000 wo...	SP.ADO.TFRT	1960	1.335609e+02
1	Arab World	ARB	Age dependency ratio (% of working-age populat...	SP.POP.DPND	1960	8.779760e+01
2	Arab World	ARB	Age dependency ratio, old (% of working-age po...	SP.POP.DPND.OL	1960	6.634579e+00
3	Arab World	ARB	Age dependency ratio, young (% of working-age ...	SP.POP.DPND.YG	1960	8.102333e+01
4	Arab World	ARB	Arms exports (SIPRI trend indicator values)	MS.MIL.XPRT.KD	1960	3.000000e+06
5	Arab World	ARB	Arms imports (SIPRI trend indicator values)	MS.MIL.MPRT.KD	1960	5.380000e+08
6	Arab World	ARB	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	1960	4.769789e+01
7	Arab World	ARB	CO2 emissions (kt)	EN.ATM.CO2E.KT	1960	5.956399e+04
8	Arab World	ARB	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	1960	6.439635e-01
9	Arab World	ARB	CO2 emissions from gaseous fuel consumption (%...	EN.ATM.CO2E.GF.ZS	1960	5.041292e+00

	CountryName	CountryCode	IndicatorName	IndicatorCode	Year	Value
22232	United States	USA	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	1960	15.999779
48708	United States	USA	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	1961	15.681256
77087	United States	USA	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	1962	16.013937
105704	United States	USA	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	1963	16.482762
134742	United States	USA	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	1964	16.968119

	CountryName	CountryCode	IndicatorName	IndicatorCode	Year	Value
5026275	Arab World	ARB	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	2011	4.724500
5026788	Caribbean small states	CSS	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	2011	9.692960
5027295	Central Europe and the Baltics	CEB	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	2011	6.911131
5027870	East Asia & Pacific (all income levels)	EAS	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	2011	5.859548
5028456	East Asia & Pacific (developing only)	EAP	CO2 emissions (metric tons per capita)	EN.ATM.CO2E.PC	2011	5.302499

	CountryName	CountryCode	IndicatorName	IndicatorCode	Year	Value
22282	United States	USA	GDP per capita (constant 2005 US$)	NY.GDP.PCAP.KD	1960	15482.707760
48759	United States	USA	GDP per capita (constant 2005 US$)	NY.GDP.PCAP.KD	1961	15578.409657