Week 5 - Data Visualization

(2017-07-29 21:27)

5.1. Introduction to Data Visualization

5.1.1. Data Visualization

5.1.2. Role of Visualization

"The ability to take data - to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it - that's going to be a hugely important skill in the next decades... Because now we really do have essentially free and ubiquitous data."
- Hal Varian, Google's Chief Economist

"The representation and presentation of data to facilitate understanding." [Kirk, 2016]

5.1.3. Types of Visualizations

Two key categories:

  • Conceptual od data-driven: Demand-supply curve
  • Declarative or exploratory

5.1.4. Key Design Principles

Good data visualization is:

  • Trustworthy
  • Accessible
  • Elegant

5.1.5. Visualization Discussion

5.2. Matplotlib and Other Libraries

5.2.1. Notebooks for Week 5

5.2.2. Matplotlib

Why Matplotlib: "It tries to make easy things easy and hard things possible". There are other libraries: Seaborn, ggplot, Altair, Bokeh, Plotly, Folium.

5.2.3. World Development Indicators


In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('./Indicators.csv.zip')
data.shape


Out[2]:
(5656458, 6)

In [3]:
data.head(10)


Out[3]:
CountryName CountryCode IndicatorName IndicatorCode Year Value
0 Arab World ARB Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 1960 1.335609e+02
1 Arab World ARB Age dependency ratio (% of working-age populat... SP.POP.DPND 1960 8.779760e+01
2 Arab World ARB Age dependency ratio, old (% of working-age po... SP.POP.DPND.OL 1960 6.634579e+00
3 Arab World ARB Age dependency ratio, young (% of working-age ... SP.POP.DPND.YG 1960 8.102333e+01
4 Arab World ARB Arms exports (SIPRI trend indicator values) MS.MIL.XPRT.KD 1960 3.000000e+06
5 Arab World ARB Arms imports (SIPRI trend indicator values) MS.MIL.MPRT.KD 1960 5.380000e+08
6 Arab World ARB Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 1960 4.769789e+01
7 Arab World ARB CO2 emissions (kt) EN.ATM.CO2E.KT 1960 5.956399e+04
8 Arab World ARB CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 1960 6.439635e-01
9 Arab World ARB CO2 emissions from gaseous fuel consumption (%... EN.ATM.CO2E.GF.ZS 1960 5.041292e+00

In [4]:
# How many unique country names are there?

countries = data['CountryName'].unique().tolist()
len(countries)


Out[4]:
247

In [5]:
# Are there same number of country codes?

countryCodes = data['CountryCode'].unique().tolist()
len(countryCodes)


Out[5]:
247

In [6]:
# Are there many indicators or few?

indicators = data['IndicatorName'].unique().tolist()
len(indicators)


Out[6]:
1344

In [7]:
# How many years of data do we have?

years = data['Year'].unique().tolist()
len(years)


Out[7]:
56

In [8]:
# What is the range of years?

print(min(years), 'to', max(years))


1960 to 2015

5.2.4. Basic Plotting in Matplotlib: Part 1

Let's pick a country and an indicator to explore: CO2 Emissions per capita and the USA


In [9]:
# select CO2 emissions for USA
hist_indicator = 'CO2 emissions \(metric'
hist_country = 'USA'

mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['CountryCode'].str.contains(hist_country)

# stage is just those indicators matching the USA for country code and Indicator
stage = data[mask1 & mask2]
stage.head()


Out[9]:
CountryName CountryCode IndicatorName IndicatorCode Year Value
22232 United States USA CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 1960 15.999779
48708 United States USA CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 1961 15.681256
77087 United States USA CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 1962 16.013937
105704 United States USA CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 1963 16.482762
134742 United States USA CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 1964 16.968119

Let's see how emissions have changed over time using Matplotlib


In [10]:
# get years
years = stage['Year'].values
# get values
co2 = stage['Value'].values

# create
plt.bar(years, co2)
plt.show()


Let's make the graphic more appealing


In [11]:
# switch to a line plot
plt.plot(stage['Year'].values, stage['Value'].values)

# label the axes
plt.xlabel('Year')
plt.ylabel(stage['IndicatorName'].iloc[0])

# label the figure
plt.title('CO2 Emissions in USA')

# to make it more honest, start the y-axis at 0
plt.axis([1969, 2011, 0, 25])

plt.show()


Using histograms to explore the distribution of values


In [12]:
# if you want to just include tose within one s.d.:
# lower = stage['Value'].mean() - stage['Value'].std()
# upper = stage['Value'].mean() + stage['Value'].std()
# hist_data = [x for x in stage[:10000]['Value'] if x>lower and x<upper]

# Otherwise, let's look at all the data
hist_data = stage['Value'].values

In [13]:
len(hist_data)


Out[13]:
52

In [14]:
# histogram of the data
plt.hist(hist_data, bins = 10, normed = False, facecolor = 'green')

plt.xlabel(stage['IndicatorName'].iloc[0])
plt.ylabel('# of Years')
plt.title('Histogram Example')

plt.grid(True)
plt.show()


But how do the USA's numbers relate to those of other countries?


In [15]:
# select CO2 emissions for all countries in 2011
hist_indicator = 'CO2 emissions \(metric'
hist_year = 2011

mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['Year'].isin([hist_year])

# apply our mask
co2_2011 = data[mask1 & mask2]
co2_2011.head()


Out[15]:
CountryName CountryCode IndicatorName IndicatorCode Year Value
5026275 Arab World ARB CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 4.724500
5026788 Caribbean small states CSS CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 9.692960
5027295 Central Europe and the Baltics CEB CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 6.911131
5027870 East Asia & Pacific (all income levels) EAS CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 5.859548
5028456 East Asia & Pacific (developing only) EAP CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 5.302499

In [16]:
len(co2_2011)


Out[16]:
232

In [17]:
# let's plot a histogram of the emissions per captia by country

# subplots returns a touple with the figure, axis attributes
fig, ax = plt.subplots()

ax.annotate('USA', xy = (18, 5), xycoords = 'data', xytext = (18, 30), textcoords = 'data',
           arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3'),)

plt.hist(co2_2011['Value'], 10, normed = False, facecolor = 'green')

plt.xlabel(stage['IndicatorName'].iloc[0])
plt.ylabel('# of Countries')
plt.title('Histogram of CO2 Emissions Per Capita')

plt.grid(True)

plt.show()


5.2.5. Basic Plotting in Matplotlib: Part 2

Relationship between GDP and CO2 Emissions in USA


In [18]:
# select GDP per capita emissions for USA
hist_indicator = 'GDP per capita \(constant 2005'
hist_country = 'USA'

mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['CountryCode'].str.contains(hist_country)

# stage is just those indicators matching USA and GDP
gdp_stage = data[mask1 & mask2]
gdp_stage.head()


Out[18]:
CountryName CountryCode IndicatorName IndicatorCode Year Value
22282 United States USA GDP per capita (constant 2005 US$) NY.GDP.PCAP.KD 1960 15482.707760
48759 United States USA GDP per capita (constant 2005 US$) NY.GDP.PCAP.KD 1961 15578.409657
77142 United States USA GDP per capita (constant 2005 US$) NY.GDP.PCAP.KD 1962 16276.426685
105760 United States USA GDP per capita (constant 2005 US$) NY.GDP.PCAP.KD 1963 16749.789436
134798 United States USA GDP per capita (constant 2005 US$) NY.GDP.PCAP.KD 1964 17476.822248

In [19]:
# switch to a line plot
plt.plot(gdp_stage['Year'].values, gdp_stage['Value'].values)

# label the axes
plt.xlabel('Year')
plt.ylabel(data['IndicatorName'].iloc[0])

# label the figure
plt.title('GDP Per Capita USA')

plt.show()


Scatter plot comparing GDP against CO2 emissions


In [20]:
# make sure we're looking at the same time frames

print('GDP Min Year:', gdp_stage['Year'].min(), ', Max Year:', gdp_stage['Year'].max())
print('CO2 Min Year:', stage['Year'].min(), 'Max Year:', stage['Year'].max())


GDP Min Year: 1960 , Max Year: 2014
CO2 Min Year: 1960 Max Year: 2011

In [21]:
gdp_stage_trunc = gdp_stage[gdp_stage['Year'] < 2012]
print(len(gdp_stage_trunc))
print(len(stage))


52
52

In [22]:
%matplotlib inline

fig, axis = plt.subplots()

# Grid lines, xticks, xlabels, ylabel
axis.yaxis.grid(True)
axis.set_title('CO2 Emissions vs. GDP \(per capita\)', fontsize = 10)
axis.set_xlabel(gdp_stage_trunc['IndicatorName'].iloc[0], fontsize = 10)
axis.set_ylabel(stage['IndicatorName'].iloc[0], fontsize = 10)

X = gdp_stage_trunc['Value']
Y = stage['Value']

axis.scatter(X, Y)
plt.show()



In [23]:
np.corrcoef(gdp_stage_trunc['Value'], stage['Value'])


Out[23]:
array([[ 1.        ,  0.07676005],
       [ 0.07676005,  1.        ]])

A correlationship of 0.07 is pretty weak!

5.2.6. Matplotlib Additional Examples

3D plots, bubble plots with color code, boxplots in the additional notebooks.


In [24]:
country_geo = './world-countries.json'

In [25]:
data.head()


Out[25]:
CountryName CountryCode IndicatorName IndicatorCode Year Value
0 Arab World ARB Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 1960 1.335609e+02
1 Arab World ARB Age dependency ratio (% of working-age populat... SP.POP.DPND 1960 8.779760e+01
2 Arab World ARB Age dependency ratio, old (% of working-age po... SP.POP.DPND.OL 1960 6.634579e+00
3 Arab World ARB Age dependency ratio, young (% of working-age ... SP.POP.DPND.YG 1960 8.102333e+01
4 Arab World ARB Arms exports (SIPRI trend indicator values) MS.MIL.XPRT.KD 1960 3.000000e+06

In [26]:
# pull out CO2 emissions for 2011

hist_indicator = 'CO2 emissions \(metric'
hist_year = 2011

mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['Year'].isin([hist_year])

# apply our masks
stage = data[mask1 & mask2]
stage.head()


Out[26]:
CountryName CountryCode IndicatorName IndicatorCode Year Value
5026275 Arab World ARB CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 4.724500
5026788 Caribbean small states CSS CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 9.692960
5027295 Central Europe and the Baltics CEB CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 6.911131
5027870 East Asia & Pacific (all income levels) EAS CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 5.859548
5028456 East Asia & Pacific (developing only) EAP CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC 2011 5.302499

In [27]:
# create a frame with just the country code and the values we want to plot
plot_data = stage[['CountryCode', 'Value']]
plot_data.head()


Out[27]:
CountryCode Value
5026275 ARB 4.724500
5026788 CSS 9.692960
5027295 CEB 6.911131
5027870 EAS 5.859548
5028456 EAP 5.302499

In [28]:
# label for the legend
hist_indicator = stage.iloc[0]['IndicatorName']

Visualize CO2 emissions per capita using Folium


In [29]:
# setup a folium map at a high-level zoom @Alok - what is the 100,0
import folium

map = folium.Map(location = [100, 0], zoom_start = 1.5)

In [30]:
# choropleth maps bind Pandas DataFrames and json geometrics
map.choropleth(geo_path = country_geo, data = plot_data,
              columns = ['CountryCode', 'Value'],
              key_on = 'feature.id',
              fill_color = 'YlGnBu', fill_opacity = 0.7, line_opacity = 0.2,
              legend_name = hist_indicator)

In [31]:
map.save('plot_data.html')

In [32]:
# import the folium interactive html file
from IPython.display import HTML

HTML('<iframe src = plot_data.html width = 900 height = 700> </iframe>')


Out[32]:

5.2.8. Visualization Libraries

Specific use cases:

Specialized statistical plots, like automatically fitting a linear regression with confidence interval or like scatter plots color-coded by category.
    seaborn: it builds on top of Matplotlib and it can also be used as a replacement for matplotlib just for an easier way to specify color palettes and plotting aestetics

Grammar of graphics plotting, if you find the interface of Matplotlib too verbose, Python provides packages based on a different paradigm of plot syntax based on R's ggplot2
    ggplot: it provides similar functionality to Matplotlib and is also based on Matplotlib but provides a different interface.
    altair: it has a simpler interface compared to ggplot and generates Javascript based plots easily embeddable into the Jupyter Notebook or exported as PNG.

Interactive plots, i.e. pan, zoom that work in the Jupyter Notebooks but also can be exported as Javascript to work standalone on a webpage.
    bokeh: maintained by Continuum Analytics, the company behind Anaconda
    plotly: is both a library and a cloud service where you can store and share your visualizations (it has free/paid accounts)

Interactive map visualization

*folium: Creates HTML pages that include the Leaflet.js javascript plotting library to display data on top of maps. *plotly: it supports color-coded country/world maps embedded in the Jupyter Notebook.

Realtime plots that update with streaming data, even integrated in a dashboard with user interaction.
    bokeh plot server: it is part of Bokeh but requires to launch a separate Python process that takes care of responding to events from User Interface or from streaming data updates.

3D plots are not easy to interpret, it is worth first consider if a combination of 2D plots could provide a better insight into the data
    mplot3d: Matplotlib tookit for 3D visualization

5.2.9. Coding Practice

5.2.10. Dataset Discussion

5.3. Case Studies

5.3.1. Cholera

Cholera is a bacterial infection. It causes severe diarrhea, possibly leading to death by dehydration. It remains a public health threat with 1.3 - 4 million cases and 21 - 143 thousand death worldwide. It spreads by poor sanitation, sewage contaminating water and food supply. John Snow was a london anesthesiologist who discovered how cholera spread.

The Ghost Map - the story of London's most terrifying epidemic, how it changed science, cities, and the modern world - by Steven Johnson.

5.3.2. Napoleon's March

Minard's graphic of the French campaign of Russia in 1812.

5.3.3. Interactive Visualization World Data

5.4. Week 5: Assessment

(2017-07-30 22:15)