(2017-07-29 21:27)
"The ability to take data - to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it - that's going to be a hugely important skill in the next decades... Because now we really do have essentially free and ubiquitous data."
- Hal Varian, Google's Chief Economist
"The representation and presentation of data to facilitate understanding." [Kirk, 2016]
Good data visualization is:
In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
In [2]:
data = pd.read_csv('./Indicators.csv.zip')
data.shape
Out[2]:
In [3]:
data.head(10)
Out[3]:
In [4]:
# How many unique country names are there?
countries = data['CountryName'].unique().tolist()
len(countries)
Out[4]:
In [5]:
# Are there same number of country codes?
countryCodes = data['CountryCode'].unique().tolist()
len(countryCodes)
Out[5]:
In [6]:
# Are there many indicators or few?
indicators = data['IndicatorName'].unique().tolist()
len(indicators)
Out[6]:
In [7]:
# How many years of data do we have?
years = data['Year'].unique().tolist()
len(years)
Out[7]:
In [8]:
# What is the range of years?
print(min(years), 'to', max(years))
In [9]:
# select CO2 emissions for USA
hist_indicator = 'CO2 emissions \(metric'
hist_country = 'USA'
mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['CountryCode'].str.contains(hist_country)
# stage is just those indicators matching the USA for country code and Indicator
stage = data[mask1 & mask2]
stage.head()
Out[9]:
Let's see how emissions have changed over time using Matplotlib
In [10]:
# get years
years = stage['Year'].values
# get values
co2 = stage['Value'].values
# create
plt.bar(years, co2)
plt.show()
Let's make the graphic more appealing
In [11]:
# switch to a line plot
plt.plot(stage['Year'].values, stage['Value'].values)
# label the axes
plt.xlabel('Year')
plt.ylabel(stage['IndicatorName'].iloc[0])
# label the figure
plt.title('CO2 Emissions in USA')
# to make it more honest, start the y-axis at 0
plt.axis([1969, 2011, 0, 25])
plt.show()
Using histograms to explore the distribution of values
In [12]:
# if you want to just include tose within one s.d.:
# lower = stage['Value'].mean() - stage['Value'].std()
# upper = stage['Value'].mean() + stage['Value'].std()
# hist_data = [x for x in stage[:10000]['Value'] if x>lower and x<upper]
# Otherwise, let's look at all the data
hist_data = stage['Value'].values
In [13]:
len(hist_data)
Out[13]:
In [14]:
# histogram of the data
plt.hist(hist_data, bins = 10, normed = False, facecolor = 'green')
plt.xlabel(stage['IndicatorName'].iloc[0])
plt.ylabel('# of Years')
plt.title('Histogram Example')
plt.grid(True)
plt.show()
But how do the USA's numbers relate to those of other countries?
In [15]:
# select CO2 emissions for all countries in 2011
hist_indicator = 'CO2 emissions \(metric'
hist_year = 2011
mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['Year'].isin([hist_year])
# apply our mask
co2_2011 = data[mask1 & mask2]
co2_2011.head()
Out[15]:
In [16]:
len(co2_2011)
Out[16]:
In [17]:
# let's plot a histogram of the emissions per captia by country
# subplots returns a touple with the figure, axis attributes
fig, ax = plt.subplots()
ax.annotate('USA', xy = (18, 5), xycoords = 'data', xytext = (18, 30), textcoords = 'data',
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3'),)
plt.hist(co2_2011['Value'], 10, normed = False, facecolor = 'green')
plt.xlabel(stage['IndicatorName'].iloc[0])
plt.ylabel('# of Countries')
plt.title('Histogram of CO2 Emissions Per Capita')
plt.grid(True)
plt.show()
In [18]:
# select GDP per capita emissions for USA
hist_indicator = 'GDP per capita \(constant 2005'
hist_country = 'USA'
mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['CountryCode'].str.contains(hist_country)
# stage is just those indicators matching USA and GDP
gdp_stage = data[mask1 & mask2]
gdp_stage.head()
Out[18]:
In [19]:
# switch to a line plot
plt.plot(gdp_stage['Year'].values, gdp_stage['Value'].values)
# label the axes
plt.xlabel('Year')
plt.ylabel(data['IndicatorName'].iloc[0])
# label the figure
plt.title('GDP Per Capita USA')
plt.show()
Scatter plot comparing GDP against CO2 emissions
In [20]:
# make sure we're looking at the same time frames
print('GDP Min Year:', gdp_stage['Year'].min(), ', Max Year:', gdp_stage['Year'].max())
print('CO2 Min Year:', stage['Year'].min(), 'Max Year:', stage['Year'].max())
In [21]:
gdp_stage_trunc = gdp_stage[gdp_stage['Year'] < 2012]
print(len(gdp_stage_trunc))
print(len(stage))
In [22]:
%matplotlib inline
fig, axis = plt.subplots()
# Grid lines, xticks, xlabels, ylabel
axis.yaxis.grid(True)
axis.set_title('CO2 Emissions vs. GDP \(per capita\)', fontsize = 10)
axis.set_xlabel(gdp_stage_trunc['IndicatorName'].iloc[0], fontsize = 10)
axis.set_ylabel(stage['IndicatorName'].iloc[0], fontsize = 10)
X = gdp_stage_trunc['Value']
Y = stage['Value']
axis.scatter(X, Y)
plt.show()
In [23]:
np.corrcoef(gdp_stage_trunc['Value'], stage['Value'])
Out[23]:
A correlationship of 0.07 is pretty weak!
Use Folium library for geographic overlays. JSON data source: https://github.com/python-visualization/folium/raw/master/examples/data/world-countries.json
source: https://github.com/python-visualization/folium/blob/master/examples/data/world-countries.json
In [24]:
country_geo = './world-countries.json'
In [25]:
data.head()
Out[25]:
In [26]:
# pull out CO2 emissions for 2011
hist_indicator = 'CO2 emissions \(metric'
hist_year = 2011
mask1 = data['IndicatorName'].str.contains(hist_indicator)
mask2 = data['Year'].isin([hist_year])
# apply our masks
stage = data[mask1 & mask2]
stage.head()
Out[26]:
In [27]:
# create a frame with just the country code and the values we want to plot
plot_data = stage[['CountryCode', 'Value']]
plot_data.head()
Out[27]:
In [28]:
# label for the legend
hist_indicator = stage.iloc[0]['IndicatorName']
Visualize CO2 emissions per capita using Folium
In [29]:
# setup a folium map at a high-level zoom @Alok - what is the 100,0
import folium
map = folium.Map(location = [100, 0], zoom_start = 1.5)
In [30]:
# choropleth maps bind Pandas DataFrames and json geometrics
map.choropleth(geo_path = country_geo, data = plot_data,
columns = ['CountryCode', 'Value'],
key_on = 'feature.id',
fill_color = 'YlGnBu', fill_opacity = 0.7, line_opacity = 0.2,
legend_name = hist_indicator)
In [31]:
map.save('plot_data.html')
In [32]:
# import the folium interactive html file
from IPython.display import HTML
HTML('<iframe src = plot_data.html width = 900 height = 700> </iframe>')
Out[32]:
Specific use cases:
Specialized statistical plots, like automatically fitting a linear regression with confidence interval or like scatter plots color-coded by category.
seaborn: it builds on top of Matplotlib and it can also be used as a replacement for matplotlib just for an easier way to specify color palettes and plotting aestetics
Grammar of graphics plotting, if you find the interface of Matplotlib too verbose, Python provides packages based on a different paradigm of plot syntax based on R's ggplot2
ggplot: it provides similar functionality to Matplotlib and is also based on Matplotlib but provides a different interface.
altair: it has a simpler interface compared to ggplot and generates Javascript based plots easily embeddable into the Jupyter Notebook or exported as PNG.
Interactive plots, i.e. pan, zoom that work in the Jupyter Notebooks but also can be exported as Javascript to work standalone on a webpage.
bokeh: maintained by Continuum Analytics, the company behind Anaconda
plotly: is both a library and a cloud service where you can store and share your visualizations (it has free/paid accounts)
Interactive map visualization
*folium: Creates HTML pages that include the Leaflet.js javascript plotting library to display data on top of maps. *plotly: it supports color-coded country/world maps embedded in the Jupyter Notebook.
Realtime plots that update with streaming data, even integrated in a dashboard with user interaction.
bokeh plot server: it is part of Bokeh but requires to launch a separate Python process that takes care of responding to events from User Interface or from streaming data updates.
3D plots are not easy to interpret, it is worth first consider if a combination of 2D plots could provide a better insight into the data
mplot3d: Matplotlib tookit for 3D visualization
Cholera is a bacterial infection. It causes severe diarrhea, possibly leading to death by dehydration. It remains a public health threat with 1.3 - 4 million cases and 21 - 143 thousand death worldwide. It spreads by poor sanitation, sewage contaminating water and food supply. John Snow was a london anesthesiologist who discovered how cholera spread.
The Ghost Map - the story of London's most terrifying epidemic, how it changed science, cities, and the modern world - by Steven Johnson.
(2017-07-30 22:15)