From spreadsheets to pandas
In [22]:
Graphical abstract
Out[22]:
ToC
This is a brief tutorial for anyone who is interested in how Python can facilitate their data analyses. The tutorial is aimed at people who currently use a spreadsheet program as their primary data analyses tool, and that have no previous programming experience. If you want to code along, a simple way to install Python is to follow these instructions, but I encourage you to just read through this tutorial on a conceptual level at first.
Spreadsheet software is great for viewing and entering small data sets and creating simple visualizations fast. However, it can be tricky to create publication-ready figures, automate reproducible analysis workflows, perform advanced calculations, and clean data sets robustly. Even when using a spreadsheet program to record data, it is often beneficial to pick up some basic programming skills to facilitate the analyses of that data.
Spreadsheet programs, such as MS Excel and Libre/OpenOffice, have their functionality sectioned into menus. In programming languages, all the functionality is accessed by typing the name of functions directly instead of finding the functions in the menu hierarchy. Initially this might seem intimidating and non-intuitive for people who are used to the menu-driven approach.
However, think of it as learning a new natural language. Initially, you will slowly string together sentences by looking up individual words in the dictionary. As you improve, you will only reference the dictionary occasionally since you already know most of the words. Practicing the language whenever you can, and receiving immediate feedback, is often the fastest way to learn. Sitting at home trying to learn every word in the dictionary before engaging in conversation, is destined to kill the joy of learning any language, natural or formal.
In my experience, learning programming is similar to learning a foreign language, and you will often learn the most from just trying to do something and receiving feedback from the computer! When there is something you can't wrap you head around, or if you are actively trying to find a new way of expressing a thought, then look it up, just as you would with a natural language.
Just like in spreadsheet software, the basic installation of Python includes fundamental math operations, e.g. adding numbers together:
In [1]:
4 + 4
Out[1]:
In [2]:
a = 5
a * 2
Out[2]:
In [3]:
my_variable_name = 4
a - my_variable_name
Out[3]:
Variables can also hold more data types than just numbers, for example a sequence of characters surrounded by single or double quotation marks (called a string). In Python, it is intuitive to append string by adding them together:
In [4]:
b = 'Hello'
c = 'universe'
b + c
Out[4]:
A space can be added to separate the words:
In [5]:
b + ' ' + c
Out[5]:
In [6]:
list_of_things = [1, 55, 'Hi']
list_of_things
Out[6]:
I can index the list to access a specific item. Note that numbering in Python starts at 0, so the third item 'Hi'
, has index 2.
In [7]:
list_of_things[2]
Out[7]:
In [8]:
fruit_colors = {'tangerine':'organge', 'banana':'yellow', 'apple':['green', 'red']}
fruit_colors['banana']
Out[8]:
In [9]:
fruit_colors['apple']
Out[9]:
In [10]:
1 == 1
Out[10]:
In [11]:
1 == 0
Out[11]:
In [12]:
1 > 0
Out[12]:
In [13]:
'hey' == 'Hey'
Out[13]:
In [14]:
'hey' == 'hey'
Out[14]:
In [15]:
a >= 2 * 2
Out[15]:
When we start working with spreadsheet-like data, we will see that these comparisons are really useful to extract subsets of data, for example observations from a certain time period.
To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking File -> Open
and chose the file, you would type something similar to file.open('<filename>')
in a programming language. Don't worry if you forget the exact expression, it is often enough to just type the few first letters and then hit Tab, to show the available options, more on that later.
Since there are so many esoteric tools and functions available in Python, it is unnecessary to include all of them with the basics that are loaded by default when you start the programming language (it would be as if your new phone came with every single app preinstalled). Instead, more advanced functionality is grouped into separate packages, which can be accessed by typing import <package_name>
in Python. You can think of this as that you are telling the program which menu items you want to use (similar to how Excel hides the Developer menu
by default since most people rarely use it and you need activate it in the settings if you want to access its functionality). Some packages needs to be downloaded before they can be used, just like downloading an addon to a browser or mobile phone.
Just like in spreadsheet software menus, there are lots of different tools within each Python package. For example, if I want to use numerical Python functions, I can import the numerical python module, numpy
. I can then access any function by writing numpy.<function_name>
.
In [16]:
import numpy
numpy.mean([1, 2, 3, 4, 5])
Out[16]:
Once you start out using Python, you don't know what functions are availble within each package. Luckily, in the Jupyter Notebook, you can type numpy.
Tab (that is numpy + period + tab-key) and a small menu will pop up that shows you all the available functions in that module. This is analogous to clicking a 'numpy-menu' and then going through the list of functions. As I mentioned earlier, there are plenty of available functions and it can be helpful to filter the menu by typing the initial letters of the function name.
To write this tutorial, I am using a software called Jupyter Notebook, and I took a screenshot of how the tab-completion appears in the notebook. Below I use the Image
function from the IPython.display
package to display this screenshot.
In [17]:
from IPython.display import Image
Image('./tab-completion.png')
Out[17]:
To get more info on the function you want to use, you can type out the full name and then press Shift + Tab once to bring up a help dialogue and again to expand that dialogue. Below I am displaying a screenshot for how this looks in the numpy.mean
function. We can see that to use this function, we need to supply it with the argument a
, which should be 'array-like'. An array is essentially just a sequence of numbers. In our previous example, we numbers enclosed in brackets []
, which in Python means that these numbers are in a list, which is a sequence of number just like an array.
We could also have defined a
as a tuple: (1, 2, 3, 4, 5)
, or the specific numpy array format: numpy.array([1, 2, 3, 4, 5])
. For now, it is not important to know all the differences between these formats, and I am mainly including them so you get a better understanding of what it means for something to be 'array-like'.
In [18]:
Image('./shift-tab-help.png')
Out[18]:
If you need a more extensive help dialog, you can click Shift + Tab four times or just type ?numpy.mean
.
When you start getting familiar with typing function names, you will notice that this is often faster than looking for functions in menus. It is similar to getting fluent in a language. I know what English words I want to type right now, and it is much easier for me to type them out, than to select each one from a menu. However, sometimes I forget and it is useful to get some hints as described above.
It is common to give packages nicknames, so that it is faster to type. This is not necessary, but can save some work in long files and make code less verbose so that it is easier to read:
In [19]:
import numpy as np
np.std([1, 2, 3, 4, 5])
Out[19]:
The Python package that is most commonly used to work with spreadsheet-like data is called pandas
, the name is derived from "panel data", an econometrics term for multidimensional structured data sets. Data are easily loaded into pandas from .csv
or other spreadsheet formats. The format pandas uses to store this data is called a data frame.
I do not have any good data set lying around, so I will load a public data set from the web (you can view the data by pasting the url into your browser). I store my data frame in a variable called iris
, that I can use to access the data from now on.
In [20]:
import pandas as pd
pd.options.display.max_rows = 10 # Shortens the output from cells
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
To have a quick peak at the data, I can type iris.head(<number_of_rows>)
. Notice that I do not have to use the pd.
-syntax for this. iris
is now a pandas data frame, and all pandas data frames have a number of built-in functions (called methods) that can be appended directly to the data frame instead of by calling pandas
separately.
In [21]:
iris.head() # The default is to show 5 rows
Out[21]:
As you can see, the data frame row numbering starts at 0, which might be confusing at first, but a standard in many programming languages and you get used to it quickly. I can easily find out how many rows and columns my data set has:
In [22]:
iris.shape
Out[22]:
And the name of those five columns:
In [23]:
iris.columns
Out[23]:
To select a column we can index the data frame with the column name.
In [24]:
iris['sepal_length']
Out[24]:
The output is here rendered slightly differently from before, because when we are looking only at one column, it is no longer a data frame, but a series. The differences are not important for this lecture, so this is all you need to know about that for now.
We could now create a new column if we wanted:
In [25]:
iris['sepal_length_x2'] = iris['sepal_length'] * 2
iris['sepal_length_x2']
Out[25]:
And delete that column again:
In [26]:
iris = iris.drop('sepal_length_x2', axis=1)
There are some built-in methods that make it convenient to calculate common operation on data frame columns.
In [27]:
iris['sepal_length'].mean()
Out[27]:
In [28]:
iris['sepal_length'].median()
Out[28]:
It is also possible to use these methods on all columns at the same time without having to type the same thing over and over again.
In [29]:
iris.mean()
Out[29]:
Similarly, you can get a statistical summary of the data frame:
In [30]:
iris.describe()
Out[30]:
A common task is to subset the data into only those observations that match a criteria. For example, we might be interest in only studying one specific species. First let's find out how many different species there are in our data set:
In [31]:
iris['species'].unique()
Out[31]:
Let's arbitrarily choose setosa as the one to study! To select only observations from this species in the original data frame, we index the data frame with a comparison:
In [32]:
iris['species'] == 'setosa'
Out[32]:
In [33]:
iris[iris['species'] == 'setosa']
Out[33]:
Now we can easily perform computation on this subset of the data:
In [34]:
iris[iris['species'] == 'setosa'].mean()
Out[34]:
In [54]:
# The below calculates the mean for each row, but that is not very valuable on this data set
# iris[iris['species'] == 'setosa'].mean(axis=1)
We could also compare all groups within the data against each other, by using the split-apply-combine workflow. This splits data into groups, applies an operation on each group, and then combines the results into a table for display.
In pandas, we split into groups with the group_by
command and then we apply an operation to the grouped data frame, e.g. .mean()
.
In [35]:
iris.groupby('species').mean()
Out[35]:
We can also easily count the number of observations in each group:
In [36]:
iris.groupby('species').size()
Out[36]:
Pandas interfaces with one of Python's most powerful data visualization libraries, matplotlib
, to enable simple visualizations at minimal effort.
In [37]:
# Prevent plots from popping up in a new window
%matplotlib inline
species_comparison = iris.groupby('species').mean() # Assign to a variable
species_comparison.plot(kind='bar')
Out[37]:
Depending on what you are interesting in showing, it could be useful to have the species as the different colors and the columns along the x-axis. We can easily achieve this by transposing (.T
) our data frame.
In [38]:
species_comparison.T.plot(kind='bar')
Out[38]:
In [39]:
import seaborn as sns
sns.swarmplot('species', 'sepal_length', data = iris)
Out[39]:
That looks a bit small, so let's change the style we are using for plotting.
In [40]:
sns.set(style='ticks', context='talk', rc={'figure.figsize':(8, 5)}) # This applies to all subseque
sns.swarmplot('species', 'sepal_length', data=iris)
sns.despine() # Remove the right and top axes lines
We can use the same syntax to create many of the common plots in seaborn
.
In [41]:
sns.barplot('species', 'sepal_length', data=iris)
sns.despine() # I might set this in sns.set() to not have to type it all the time...
Bar charts are a common, but not very useful way of presenting data aggregations (e.g. the mean). A better way is to use the points as we did above, or a plot that capture the distribution of the data, such as a boxplot, or a violin plot:
In [42]:
sns.violinplot('species', 'sepal_length', data = iris)
sns.despine()
We can also combine two plots, by simply adding the two line after each other. There is also a more advanced figure interface available in matplotlib
to explicitly indicate which figure and axes you want the plot to appear in, but this is outside the scope of this tutorial (more info here and here).
In [43]:
sns.violinplot('species', 'sepal_length', data=iris, inner=None)
sns.swarmplot('species', 'sepal_length', data=iris, color='black', size=4)
sns.despine()
Instead of plotting one categorical variable vs a numerical variable, we can also plot two numerical values against each other to explore potential correlations between these two variables:
In [44]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, size=6)
Out[44]:
There is a regression line plotted by default to indicate the trend in the data. Let's turn that off for now and look at only the data points.
In [45]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, fit_reg=False, size=6)
Out[45]:
There appears to be some structure in this data. At least two clusters of points seem to be present. Let's color according to species and see if that explains what we see.
In [46]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, hue='species', fit_reg=False, size=6)
Out[46]:
Now we can add back the regression line, but this time one for each group.
In [47]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, hue='species', fit_reg=True, size=6)
Out[47]:
Instead of creating a plot for each variable against each other, we can easily create a grid of subplots for all variables with a single command:
In [48]:
sns.pairplot(iris, hue="species", size=3.5)
Out[48]:
Many visualizations are easier to create if we first reshape our data frame into the tidy format, which is what seaborn
prefers. This is also referred to as changing the data frame format from wide (many columns) to long (many rows), since it moves information from columns to rows:
In [49]:
Image('./molten-data.png') # Image from http://vita.had.co.nz/papers/tidy-data.pdf
Out[49]:
We can use pandas built-in melt
-function to "melt" the wide data frame into the long format. The new columns will be given the names variable
and value
by default (see the help of melt
if you would like to change these names).
In [50]:
iris_long = pd.melt(iris, id_vars='species')
iris_long
Out[50]:
We do not need to call groupby
or mean
on the long iris
data frame when plotting with seaborn
. Instead we control these options from seaborn
with the plot type we chose (barplot = mean automatically) and the hue
-parameter, which is analogous to groupby
.
In [51]:
sns.set(context='poster', style='white', rc={'figure.figsize':(10, 6)})
sns.swarmplot(x='variable', y='value', hue='species', data=iris_long, dodge=True, palette='Set2', size=4)
sns.despine() # removes the right and top black border
In [52]:
sns.set(context='poster', style='darkgrid', rc={'figure.figsize':(12, 6)})
sns.boxplot(y='variable', x='value', hue='species', data=iris_long, color='c', )
sns.stripplot(y='variable', x='value', hue='species', data=iris_long, size=2.5, palette=['k']*3, jitter=True, dodge=True)
sns.despine(trim=True)
It is also possible to get too fancy and accidentally hide important messages in the data. However, the fact that you now have access to several ways to plot your data forces you to consider what is actually important and how you can best communicate that message, rather than always making the same plot without considering its strengths and weaknesses.
The documentation for these packages are great resources to learn more.
pandas
pandas
usersseaborn
galleryseaborn
tutorialsmatplotlib
gallery is another great resource. We did not practice using matplotlib
explicitly in this notebook, but the plotting techniques from pandas
and seaborn
are layers on top of the matplotlib
library, so we have been using it indirectly.