In [2]:
# Data Source: https://www.kaggle.com/worldbank/world-development-indicators
# Folder: 'world-development-indicators'
Matplotlib: Exploring
Data Visualization
This week, we will be using an open dataset from Kaggle. It is The World Development Indicators dataset obtained from the World Bank containing over a thousand annual indicators of economic development from hundreds of countries around the world.
This is a slightly modified version of the original dataset from The World Bank List of the available indicators and a list of the available countries. |
In [3]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
In [4]:
data = pd.read_csv(r'C:\Users\hrao\Documents\Personal\HK\Python\world-development-indicators\Indicators.csv')
data.shape
Out[4]:
This is a really large dataset, at least in terms of the number of rows. But with 6 columns, what does this hold?
In [5]:
data.head(10)
Out[5]:
Looks like it has different indicators for different countries with the year and value of the indicator.
In [6]:
countries = data['CountryName'].unique().tolist()
len(countries)
Out[6]:
In [7]:
# How many unique country codes are there ? (should be the same #)
countryCodes = data['CountryCode'].unique().tolist()
len(countryCodes)
Out[7]:
In [8]:
# How many unique indicators are there ? (should be the same #)
indicators = data['IndicatorName'].unique().tolist()
len(indicators)
Out[8]:
In [9]:
# How many years of data do we have ?
years = data['Year'].unique().tolist()
len(years)
Out[9]:
In [10]:
print(min(years)," to ",max(years))
Matplotlib: Basic Plotting, Part 1
In [11]:
# select CO2 emissions for the United States
hist_indicator = 'CO2 emissions \(metric'
hist_country = 'USA'
mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['CountryCode'].str.contains(hist_country)
# stage is just those indicators matching the USA for country code and CO2 emissions over time.
stage = data[mask1 & mask2]
In [12]:
stage.head()
Out[12]:
In [13]:
# get the years
years = stage['Year'].values
# get the values
co2 = stage['Value'].values
# create
plt.bar(years,co2)
plt.show()
Turns out emissions per capita have dropped a bit over time, but let's make this graphic a bit more appealing before we continue to explore it.
In [14]:
# switch to a line plot
plt.plot(stage['Year'].values, stage['Value'].values)
# Label the axes
plt.xlabel('Year')
plt.ylabel(stage['IndicatorName'].iloc[0])
#label the figure
plt.title('CO2 Emissions in USA')
# to make more honest, start they y axis at 0
plt.axis([1959, 2011,0,25])
plt.show()
In [15]:
# If you want to just include those within one standard deviation fo the mean, you could do the following
# lower = stage['Value'].mean() - stage['Value'].std()
# upper = stage['Value'].mean() + stage['Value'].std()
# hist_data = [x for x in stage[:10000]['Value'] if x>lower and x<upper ]
# Otherwise, let's look at all the data
hist_data = stage['Value'].values
In [16]:
print(len(hist_data))
In [17]:
# the histogram of the data
plt.hist(hist_data, 10, normed=False, facecolor='green')
plt.xlabel(stage['IndicatorName'].iloc[0])
plt.ylabel('# of Years')
plt.title('Histogram Example')
plt.grid(True)
plt.show()
So the USA has many years where it produced between 19-20 metric tons per capita with outliers on either side.
In [18]:
# select CO2 emissions for all countries in 2011
hist_indicator = 'CO2 emissions \(metric'
hist_year = 2011
mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['Year'].isin([hist_year])
# apply our mask
co2_2011 = data[mask1 & mask2]
co2_2011.head()
Out[18]:
For how many countries do we have CO2 per capita emissions data in 2011
In [19]:
print(len(co2_2011))
In [20]:
# let's plot a histogram of the emmissions per capita by country
# subplots returns a touple with the figure, axis attributes.
fig, ax = plt.subplots()
ax.annotate("USA",
xy=(18, 5), xycoords='data',
xytext=(18, 30), textcoords='data',
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"),
)
plt.hist(co2_2011['Value'], 10, normed=False, facecolor='green')
plt.xlabel(stage['IndicatorName'].iloc[0])
plt.ylabel('# of Countries')
plt.title('Histogram of CO2 Emissions Per Capita')
#plt.axis([10, 22, 0, 14])
plt.grid(True)
plt.show()
So the USA, at ~18 CO2 emissions (metric tons per capital) is quite high among all countries.
An interesting next step, which we'll save for you, would be to explore how this relates to other industrialized nations and to look at the outliers with those values in the 40s!
Matplotlib: Basic Plotting, Part 2
In [21]:
# select GDP Per capita emissions for the United States
hist_indicator = 'GDP per capita \(constant 2005'
hist_country = 'USA'
mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['CountryCode'].str.contains(hist_country)
# stage is just those indicators matching the USA for country code and CO2 emissions over time.
gdp_stage = data[mask1 & mask2]
#plot gdp_stage vs stage
In [22]:
gdp_stage.head(2)
Out[22]:
In [23]:
stage.head(2)
Out[23]:
In [24]:
# switch to a line plot
plt.plot(gdp_stage['Year'].values, gdp_stage['Value'].values)
# Label the axes
plt.xlabel('Year')
plt.ylabel(gdp_stage['IndicatorName'].iloc[0])
#label the figure
plt.title('GDP Per Capita USA')
# to make more honest, start they y axis at 0
#plt.axis([1959, 2011,0,25])
plt.show()
So although we've seen a decline in the CO2 emissions per capita, it does not seem to translate to a decline in GDP per capita
In [25]:
print("GDP Min Year = ", gdp_stage['Year'].min(), "max: ", gdp_stage['Year'].max())
print("CO2 Min Year = ", stage['Year'].min(), "max: ", stage['Year'].max())
We have 3 extra years of GDP data, so let's trim those off so the scatterplot has equal length arrays to compare (this is actually required by scatterplot)
In [26]:
gdp_stage_trunc = gdp_stage[gdp_stage['Year'] < 2012]
print(len(gdp_stage_trunc))
print(len(stage))
In [27]:
%matplotlib inline
import matplotlib.pyplot as plt
fig, axis = plt.subplots()
# Grid lines, Xticks, Xlabel, Ylabel
axis.yaxis.grid(True)
axis.set_title('CO2 Emissions vs. GDP \(per capita\)',fontsize=10)
axis.set_xlabel(gdp_stage_trunc['IndicatorName'].iloc[0],fontsize=10)
axis.set_ylabel(stage['IndicatorName'].iloc[0],fontsize=10)
X = gdp_stage_trunc['Value']
Y = stage['Value']
axis.scatter(X, Y)
plt.show()
This doesn't look like a strong relationship. We can test this by looking at correlation.
In [28]:
np.corrcoef(gdp_stage_trunc['Value'],stage['Value'])
Out[28]:
A correlation of 0.07 is pretty weak, but you'll learn more about correlation in the next course.
You could continue to explore this to see if other countries have a closer relationship between CO2 emissions and GDP. Perhaps it is stronger for developing countries?
In [29]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;