Arup Sydney Python Club - Session 8
In [7]:
from IPython.display import Image
Image(filename='C:/Users/oliver.lock/Desktop/data-visualization-sample.jpg')
Out[7]:
Today's demand for data visualisation tools go beyond the standard charts and graphs used in Excel spreadsheets, displaying data in more sophisticated ways such as infographics, dials and gauges, geographic maps, heat maps, and detailed bar, pie and rose charts. This demand extends to visualisations with interactive capabilities - there is an expectation that we are able to create and can allow manipulation/drilling into the data for querying and analysis.
Using scripting / programmatic techniques for our data analysis and visualisation means we are able to surpass a number of barriers formerly in software packages (again, Excel), especially when faced with issues to do with speed, display, customisation and data size. Python has several options for creating both static and interactive data analysis/visualisation environments.
The Python scientific stack (http://scikit-learn.org/stable/) is fairly mature, and there are libraries for a variety of use cases, including machine learning http://scikit-learn.org/stable/), network analysis (https://networkx.github.io/) and data analysis (http://pandas.pydata.org/).
Many new Python data visualization libraries have been created in the past few years to close the gap. This class will focus on using 'matplotlib', one of the many libraries available. Some alternate libraries will be explained below.
Simply put, matplotlib is a graphing library for Python. It has an array of tools that you can use to create anything from simple scatter plots, to sin curves, to 3D graphs. It is used heavily in the scientific Python community for data visualisation. It is designed to closely resemble MATLAB (which can be found in many engineering courses).
You can read more about the ideas behind matplotlib on their website, but I especially recommend taking a look at their gallery to see the amazing things you can pull off with this library.
http://matplotlib.org/gallery.html
Some of the many alternatives to matplotlib include seaborn, ggplot and plotly.
Seaborn; The main difference between matplotlib and seaborn is its default styles and palettes, which are more aesthetically pleasing and modern. Seaborn is built on top of matplotlib, so it doesn't hurt to learn this first.
ggplot; ggplot is based on ggplot2 in stats package 'R' and concepts from the 'Grammar of Graphics'. The outputs are much easier to create and much better looking, but it isn't designed for high levels of customisation. It sacrifices complexity for a simple method of plotting (kind of like iOS vs Android). One of the advantages is its close integration with 'pandas' - a popular data analysis library. If you are familiar with R or SQL it might be useful to look at this.
Plotly; Plotly is an online platform for data visualisation, and within Python one can create interactive graphs and charts. Plotly’s forte is making interactive plots and offers some charts you won’t find in most libraries, like contour plots, dendograms, and 3D charts. If you are thinking of making something interactive or for the web with your Python code, this would be worth investigating.
In [8]:
# Bus network analysis with San Francisco 7 terabyte data set spanning ~2008 - 2014.
# Below extracts are from a tool that allows quick graphing and mapping
# / visualisations of any time span, at stop,route and network level for the very large dataset generated from the
# bus positioning and passenger countring system.
# See: http://pandas.pydata.org/pandas-docs/stable/visualization.html
from IPython.display import Image
Image(filename='C:/Users/oliver.lock/Desktop/bus_loads.png')
Out[8]:
In [74]:
#This combines the bus data with analystical package 'networkx' to study the network according to graph theory
# Generates visualisation of specified time period and how network stops perform
#See: https://networkx.github.io/
from IPython.display import Image
Image(filename='C:/Users/oliver.lock/Desktop/bus_graph_theory_degree.png')
Out[74]:
In his 1983 book The Visual Display of Quantitative Information, Edward Tufte defines 'graphical displays' and principles for effective graphical display. Many of these ideas have stuck in the industry to date and the below principles are a staple 101 to data visualisation.
A good data visualisation should:
In [3]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
In [76]:
x = [1,2,3,4,5,6,7,8,9,10]
y = [1,2,4,8,16,32,64,128,256,512]
In [77]:
#plot with all default settings
plt.plot(x,y)
#! plt.show()
Out[77]:
In [78]:
#flip x and y axis
plt.plot(y,x)
#! plt.show()
Out[78]:
In [79]:
# red dashed line
plt.plot(x,y,'r--')
#!plt.show()
Out[79]:
In [80]:
#a line plot can easily become a scatter plot by changing markers
#plot with blue markers
plt.plot(x,y,'bo')
#!plt.show()
Out[80]:
In [81]:
#Two datasets, same plot
x1 = [1,2,3,4,5,6,7,8,9,10]
y1 = [1,2,4,8,16,32,64,128,256,512]
x2 = [1,2,3,4,5,6,7,8,9,10,11]
y2 = [10,20,40,89,160,300,640,450,500,510,700]
plt.plot(x1,y1,x2,y2,marker='o',linestyle='None')
#!plt.show()
Out[81]:
In [82]:
#More Control, plotting more than one dataset with alternate styling
x1 = [1,2,3,4,5,6,7,8,9,10,11]
y1 = [1,2,4,8,16,32,64,128,256,512,1024]
x2 = [1,2,3,4,5,6,7,8,9,10,11]
y2 = [10,20,40,89,160,300,640,450,500,510,700]
plt.plot(x1,y1, 'go-', label='Expected', linewidth=2)
plt.plot(x2,y2, 'rs', label='Sampled')
plt.axis([0, 12, 0, 800])
plt.legend(loc='best')
#!plt.show()
Out[82]:
In [1]:
#if you run this command if will give you documentation on styles etc
#!help(plt)
In [4]:
#example bar chart with positive and negative values
import numpy as np
import matplotlib.pyplot as plt
n = 12
X = np.arange(n)
Y1 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)
Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)
plt.bar(X, +Y1, facecolor='#9999ff', edgecolor='white')
plt.bar(X, -Y2, facecolor='#ff9999', edgecolor='white')
#!plt.show()
Out[4]:
In [84]:
#Create a range of dates
x = range(1940,2016)
#Create a list of a fluctuating random value the same length of that date range
y = np.random.randn(len(x)).cumsum()
In [85]:
plt.plot(x, y)
#!plt.show()
Out[85]:
In [86]:
plt.title('Fake Global Sea Temperature over Time')
plt.xlabel('Date')
plt.ylabel('Sea Temperature')
plt.grid(True)
plt.figtext(0.995, 0.01, 'Data Source: Bureau of Astrology, 2016', ha='right', va='bottom')
plt.legend(loc='best', framealpha=0.5, prop={'size':'small'})
plt.tight_layout(pad=1)
plt.gcf().set_size_inches(8, 4)
plt.plot(x, y)
#!plt.show()
Out[86]:
After your previous exercises with similar weather station data, you have become a well-known international expert.
You have been requested to produce the following visualisations.
The data can be found here:
https://github.com/oliverclock/pythonviz/blob/master/weather_places.csv
1) Produce an image of all of the temperature or rainfall values (regardless of geographic data)
2) Produce an image that explains the potential relationship between latitude and rainfall
3) Produce an image which shows the countries in their geographic location (using supplied latitude/longitude field)
4) Produce the previous image this time with the plot points coloured or scaled by the temperature attribute
5) Extension - Visualise these values on a world map basemap
6) Extension - Visualise these values as a choropleth map (i.e. a map that shows country shapes rather than just points)
Hints: Reading CSVs - https://docs.python.org/2/library/csv.html
Extracting column from CSV - https://www.raspberrypi.org/learning/astro-pi-flight-data-analysis/graphing/ (read 1 & 2 here)
Styling data - run help(plt)
Some of these links may be interesting / useful.
http://matplotlib.org/users/pyplot_tutorial.html
http://pandas.pydata.org/pandas-docs/stable/visualization.html
https://www.raspberrypi.org/learning/astro-pi-flight-data-analysis/graphing/
Any questions please email oliver.lock@arup.com! : - )