This notebook combines two notebooks (with minor modifications by Amanda Easson) from previous UofT Coders sessions:
Intro Python (authors: Madeleine Bonsma-Fisher, heavily borrowing from Lina Tran and Charles Zhu): https://github.com/UofTCoders/studyGroup/blob/gh-pages/lessons/python/intro/IntroPython-MB.ipynb
Pandas (authors: Joel Ostblom and Luke Johnston): https://github.com/UofTCoders/studyGroup/blob/gh-pages/lessons/python/intro-data-analysis/from-spreadsheets-to-pandas-extended.ipynb
Additional numpy content was added by Amanda Easson, primarily based off of the Python Data Science Handbook by Jake VanderPlas: https://jakevdp.github.io/PythonDataScienceHandbook/
This is a brief tutorial for anyone who is interested in how Python can facilitate their data analyses. The tutorial is aimed at people who currently use a spreadsheet program as their primary data analyses tool, and that have no previous programming experience. If you want to code along, a simple way to install Python is to follow these instructions, but I encourage you to just read through this tutorial on a conceptual level at first.
Spreadsheet software is great for viewing and entering small data sets and creating simple visualizations fast. However, it can be tricky to create publication-ready figures, automate reproducible analysis workflows, perform advanced calculations, and clean data sets robustly. Even when using a spreadsheet program to record data, it is often beneficial to pick up some basic programming skills to facilitate the analyses of that data.
Spreadsheet programs, such as MS Excel and Libre/OpenOffice, have their functionality sectioned into menus. In programming languages, all the functionality is accessed by typing the name of functions directly instead of finding the functions in the menu hierarchy. Initially this might seem intimidating and non-intuitive for people who are used to the menu-driven approach.
However, think of it as learning a new natural language. Initially, you will slowly string together sentences by looking up individual words in the dictionary. As you improve, you will only reference the dictionary occasionally since you already know most of the words. Practicing the language whenever you can, and receiving immediate feedback, is often the fastest way to learn. Sitting at home trying to learn every word in the dictionary before engaging in conversation, is destined to kill the joy of learning any language, natural or formal.
In my experience, learning programming is similar to learning a foreign language, and you will often learn the most from just trying to do something and receiving feedback from the computer! When there is something you can't wrap you head around, or if you are actively trying to find a new way of expressing a thought, then look it up, just as you would with a natural language.
Just like in spreadsheet software, the basic installation of Python includes fundamental math operations, e.g. adding numbers together:
In [ ]:
4 + 4
In [ ]:
4**2 # 4 to the power of 2
In [ ]:
3*5; # semi-colon suppresses output
In [ ]:
a = 5
a * 2
In [ ]:
my_variable_name = 4
a - my_variable_name
Variables can also hold more data types than just numbers, for example a sequence of characters surrounded by single or double quotation marks (called a string). In Python, it is intuitive to append string by adding them together:
In [ ]:
b = 'Hello'
c = 'universe'
b + c
A space can be added to separate the words:
In [ ]:
b + ' ' + c
In [ ]:
print(b)
print(type(b))
In [ ]:
list_of_things = [1, 55, 'Hi', ['apple', 'orange', 'banana']]
list_of_things
In [ ]:
list_of_things.append('Toronto')
list_of_things
In [ ]:
list_of_things.remove(55)
list_of_things
In [ ]:
print(list_of_things + list_of_things)
A "tuple" is an immutable list (nothing can be added or subtracted) whose elements also can't be reassigned.
In [ ]:
tup1 = (1,5,2,1)
print(tup1)
tup1[0] = 3
In [ ]:
fruit_colors = {'tangerine':'orange', 'banana':'yellow', 'apple':['green', 'red']}
fruit_colors['banana']
In [ ]:
fruit_colors['apple']
list(fruit_colors.keys())
In [ ]:
participant_info = {'ID': 50964, 'age': 15.25, 'sex': 'F', 'IQ': 105, 'medications': None, 'questionnaires': ['Vineland', 'CBCL', 'SRS']}
In [ ]:
print(participant_info['ID'])
print(participant_info['questionnaires'])
In [ ]:
1 == 1
In [ ]:
1 == 0
In [ ]:
1 > 0
In [ ]:
'hey' == 'Hey'
In [ ]:
'hey' == 'hey'
In [ ]:
a >= 2 * 2 # we defined a = 5 above
In [ ]:
#indexing in Python starts at 0, not 1 (like in Matlab or Oracle)
fruits = ['apples', 'oranges', 'bananas']
print(fruits[0])
In [ ]:
print(fruits[1])
In [ ]:
# strings are just a particular kind of list
s = 'This_is_a_string.'
In [ ]:
print(s[0])
In [ ]:
# use -1 to get the last element
print(fruits[-1])
In [ ]:
print(fruits[-2])
In [ ]:
# to get a slice of the string use the : symbol
# s[0:x] will print up to the xth element, or the element with index x-1
print(s[0:4])
In [ ]:
print(s[:4])
In [ ]:
s = 'This_is_a_string.'
print(s[5:7])
In [ ]:
print(s[7:])
print(s[7:len(s)])
In [ ]:
s2 = [19034, 23]
# You will always need to start with an 'if' line
# You do not need the elif or else statements
# You can have as many elif statements as needed
if type(s2) == str:
print('s2 is a string')
elif type(s2) == int:
print('s2 is an integer')
elif type(s2) == float:
print('s2 is a float')
else:
print('s2 is not a string, integer or float')
In [ ]:
nums = [23, 56, 1, 10, 15, 0]
In [ ]:
# in this case, 'n' is a dummy variable that will be used by the for loop
# you do not need to assign it ahead of time
for n in nums:
if n%2 == 0:
print('even')
else:
print('odd')
In [ ]:
# for loops can iterate over strings as well
vowels = 'aeiou'
for vowel in vowels:
print(vowel)
In [ ]:
my_colours = ['pink', 'purple', 'blue', 'green', 'orange']
my_light_colours = ['light ' + colour for colour in my_colours]
print(my_light_colours)
In [ ]:
# always use descriptive naming for functions, variables, arguments etc.
def sum_of_squares(num1, num2):
"""
Input: two numbers
Output: the sum of the squares of the two numbers
"""
ss = num1**2 + num2**2
return(ss)
# The stuff inside """ """ is called the "docstring". It can be accessed by typing help(sum_of_squares)
In [ ]:
help(sum_of_squares)
In [ ]:
print(sum_of_squares(4,2))
In [ ]:
# the return statement in a function allows us to store the output of a function call in a variable for later use
ss1 = sum_of_squares(5,5)
In [ ]:
print(ss1)
When we start working with spreadsheet-like data, we will see that these comparisons are really useful to extract subsets of data, for example observations from a certain time period.
To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking File -> Open
and chose the file, you would type something similar to file.open('<filename>')
in a programming language. Don't worry if you forget the exact expression, it is often enough to just type the few first letters and then hit Tab, to show the available options, more on that later.
Since there are so many esoteric tools and functions available in Python, it is unnecessary to include all of them with the basics that are loaded by default when you start the programming language (it would be as if your new phone came with every single app preinstalled). Instead, more advanced functionality is grouped into separate packages, which can be accessed by typing import <package_name>
in Python. You can think of this as telling the program which menu items you want to use (similar to how Excel hides the Developer menu
by default since most people rarely use it and you need activate it in the settings if you want to access its functionality). Some packages needs to be downloaded before they can be used, just like downloading an addon to a browser or mobile phone.
Just like in spreadsheet software menus, there are lots of different tools within each Python package. For example, if I want to use numerical Python functions, I can import the numerical python module, numpy
. I can then access any function by writing numpy.<function_name>
.
If the package name is long, it is common to import the package "as" another name, like a nickname. For instance, numpy
is often imported as np
. All you have to do is type import numpy as np
. Thiis makes it faster to type and saves a bit of work and also makes the code a bit easier to read.
In [ ]:
import numpy as np
Once you start out using Python, you don't know what functions are availble within each package. Luckily, in the Jupyter Notebook, you can type numpy.
Tab (that is numpy + period + tab-key) and a small menu will pop up that shows you all the available functions in that module. This is analogous to clicking a 'numpy-menu' and then going through the list of functions. As I mentioned earlier, there are plenty of available functions and it can be helpful to filter the menu by typing the initial letters of the function name.
To get more info on the function you want to use, you can type out the full name and then press Shift + Tab once to bring up a help dialogue and again to expand that dialogue.
If you need a more extensive help dialog, you can click Shift + Tab four times or just type ?
after the function name.
In [ ]:
# use a package by importing it, you can also give it a "nickname", in this case 'np'
import numpy as np
np.mean?
In [ ]:
array = np.arange(15)
lst = list(range(15))
In [ ]:
print(array)
print(lst)
In [ ]:
print(type(array))
print(type(lst))
In [ ]:
# numpy arrays allow for vectorized calculations
print(array*2)
print(lst*2)
In [ ]:
array = array.reshape([5,3])
print(array)
In [ ]:
# for each row, take the mean across the 3 columns (using axis=1)
array.mean(axis=1)
In [ ]:
# max value in each column
array.max(axis=0)
In [ ]:
list2array = np.array(lst)
type(list2array)
list2array
In [ ]:
array2d = np.array([range(i, i + 3) for i in [2, 4, 6]])
array2d
In [ ]:
my_zeros = np.zeros((10,1), dtype=int)
my_zeros
In [ ]:
np.ones((5,2), dtype=float)
In [ ]:
np.full((3,3), np.pi)
In [ ]:
# Create an array filled with a linear sequence: start, stop, step
# up to stop value - 1
# similar to "range"
print(np.arange(0, 20, 2))
print(list(range(0, 20, 2)))
In [ ]:
# linspace: evenly spaced values: start, stop, number of values
# up to stop value
print(np.linspace(0, 20.2, 11))
In [ ]:
# uniformly distributed random numbers between 0-1
np.random.random((3,3))
In [ ]:
# normal distribution
# mean, standard deviation, array size
np.random.seed(0)
my_array = np.random.normal(0, 1, (10000,1))
print(my_array.mean())
print(my_array.std())
In [ ]:
iq = np.random.normal(100, 15, (30,1))
print(np.mean(iq))
print(np.median(iq))
print(np.max(iq))
print(np.min(iq))
print('\n')
print(iq.mean())
print(iq.max())
print(iq.median())
In [ ]:
x = np.random.normal(100,15,(3,4,5))
print("# dimensions:", x.ndim)
print("Shape:", x.shape)
print("Size:", x.size)
In [ ]:
# indexing
x[0,0,:]
In [ ]:
# change elements
x[0,0,2] = 108
x[0,0,2]
In [ ]:
x = np.arange(10)
print(x)
print("First five elements:", x[:5]) # first five elements
print("elements after index 5:", x[5:]) # elements after index 5
print("every other element:", x[::4]) # every other element
print("every other element, starting at index 3:", x[3::2]) # every other element, starting at index 3
print("all elements, reversed:",x[::-1]) # all elements, reversed
In [ ]:
# copying arrays
print(x)
x2 = x
x2[0] = 100
print(x2)
print(x)
In [ ]:
x3 = np.arange(10)
x4 = x3.copy()
x4[0] = 100
print(x4)
print(x3)
In [ ]:
# concatenate arrays
x1 = np.array([1,2,3])
x2 = np.array([4,5,6])
x3 = np.concatenate([x1,x2])
print(x3)
In [ ]:
# or stack vertically or horizontally
# make sure dimensions agree
x4 = np.vstack([x1,x2])
print(x4)
x5 = np.hstack([x1,x2])
print(x5)
In [ ]:
# numpy arithmetic
x = np.arange(5)
print(x)
print(x+5)
print(x*2)
print(x/2)
print(x//2) # "floor" division, i.e. round down to nearest integer
# alternatively:
print('\n')
print(np.add(x,5))
print(np.multiply(x,2))
print(np.divide(x,2))
print(np.floor_divide(x,2))
print(np.power(x,3))
print(np.mod(x,2))
In [ ]:
# add all elements of x
print(np.add.reduce(x))
print(np.sum(x))
# add all elements of x cumulatively
print(np.add.accumulate(x))
In [ ]:
# broadcasting
a = np.eye(5)
b = np.ones((5,1))
b = np.ones((5,5))
print(a)
print(b)
In [ ]:
print(a+b)
In [ ]:
# broadcasting: practical example: mean-centering data
data = np.random.normal(100, 15, (30, 6))
print(data.shape)
data_mean = data.mean(axis=0)
print(data_mean.shape)
data_mean_centered = data - data_mean
data_mean_centered.mean(axis=0)
The Python package that is most commonly used to work with spreadsheet-like data is called pandas
, the name is derived from "panel data", an econometrics term for multidimensional structured data sets. Data are easily loaded into pandas from .csv
or other spreadsheet formats. The format pandas uses to store this data is called a data frame.
To have a quick peak at the data, I can type df.head(<number_of_rows>)
(where "df" is the name of the dataframe). Notice that I do not have to use the pd.
-syntax for this. iris
is now a pandas data frame, and all pandas data frames have a number of built-in functions (called methods) that can be appended directly to the data frame instead of by calling pandas
separately.
In [ ]:
import pandas as pd
In [ ]:
# this will read in a csv file into a pandas DataFrame
# this csv has data of country spending on healthcare
#data = pd.read_csv('health.csv', header=0, index_col=0, encoding="ISO-8859-1")
# load several datasets
data = pd.read_csv('https://tinyurl.com/uoftcode-health', header=0, index_col=0,
encoding="ISO-8859-1")
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
In [ ]:
# the .head() function will allow us to look at first few lines of the dataframe
data.head(10) # default is 5 rows
In [ ]:
# by default, rows are indicated first, followed by the column: [row, column]
data.loc['Canada', '2008']
In [ ]:
# you can also slice a dataframe
data.loc['Canada':'Chile', '1999':'2001']
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
In [ ]:
# the .plot() function will create a simple graph for you to quickly visualize your data
data.loc['Denmark'].plot()
data.loc['Canada'].plot()
data.loc['India'].plot()
plt.legend()
plt.show()
In [ ]:
iris.head()
In [ ]:
iris.shape # rows, columns
In [ ]:
iris.columns # names of columns
To select a column we can index the data frame with the column name.
In [ ]:
iris['sepal_length']
The output is here rendered slightly differently from before, because when we are looking only at one column, it is no longer a data frame, but a *series*. The differences are not important for this lecture, so this is all you need to know about that for now.
We could now create a new column if we wanted:
In [ ]:
iris['sepal_length_x2'] = iris['sepal_length'] * 2
iris['sepal_length_x2']
And delete that column again:
In [ ]:
iris = iris.drop('sepal_length_x2', axis=1) # axis 0 = index, axis 1 = columns
There are some built-in methods that make it convenient to calculate common operation on data frame columns.
In [ ]:
iris['sepal_length'].mean()
In [ ]:
iris['sepal_length'].median()
It is also possible to use these methods on all columns at the same time without having to type the same thing over and over again.
In [ ]:
iris.mean()
Similarly, you can get a statistical summary of the data frame:
In [ ]:
iris.describe()
A common task is to subset the data into only those observations that match a criteria. For example, we might be interest in only studying one specific species. First let's find out how many different species there are in our data set:
In [ ]:
iris['species'].unique()
Let's arbitrarily choose setosa as the one to study! To select only observations from this species in the original data frame, we index the data frame with a comparison:
In [ ]:
iris['species'] == 'setosa'
In [ ]:
iris[iris['species'] == 'setosa']
Now we can easily perform computation on this subset of the data:
In [ ]:
iris[iris['species'] == 'setosa'].mean(axis=0)
We could also compare all groups within the data against each other, by using the split-apply-combine workflow. This splits data into groups, applies an operation on each group, and then combines the results into a table for display.
In pandas, we split into groups with the group_by
command and then we apply an operation to the grouped data frame, e.g. .mean()
.
In [ ]:
iris.groupby('species').mean()
We can also easily count the number of observations in each group:
In [ ]:
iris.groupby('species').size()
Pandas interfaces with one of Python's most powerful data visualization libraries, matplotlib
, to enable simple visualizations at minimal effort.
In [ ]:
# Prevent plots from popping up in a new window
%matplotlib inline
species_comparison = iris.groupby('species').mean() # Assign to a variable
species_comparison.plot(kind='bar')
Depending on what you are interesting in showing, it could be useful to have the species as the different colors and the columns along the x-axis. We can easily achieve this by transposing (.T
) our data frame.
In [ ]:
species_comparison.T.plot(kind='bar')
In [ ]:
import seaborn as sns
sns.swarmplot('species', 'sepal_length', data = iris)
The labels on this plot look a bit small, so let's change the style we are using for plotting.
In [ ]:
sns.set(style='ticks', context='talk', rc={'figure.figsize':(8, 5),'axes.spines.right':False, 'axes.spines.top':False}) # This applies to all subseque
# styles: darkgrid, whitegrid, dark, white, and ticks
# contexts: paper, notebook, talk, poster
#sns.axes_style()
In [ ]:
sns.swarmplot('species', 'sepal_length', data=iris)
We can use the same syntax to create many of the common plots in seaborn
.
In [ ]:
sns.barplot('species', 'sepal_length', data=iris)
Bar charts are a common, but not very useful way of presenting data aggregations (e.g. the mean). A better way is to use the points as we did above, or a plot that capture the distribution of the data, such as a boxplot, or a violin plot:
In [ ]:
sns.violinplot('species', 'sepal_length', data = iris)
We can also combine two plots, by simply adding the two line after each other. There is also a more advanced figure interface available in matplotlib
to explicitly indicate which figure and axes you want the plot to appear in, but this is outside the scope of this tutorial (more info here and here).
In [ ]:
sns.violinplot('species', 'sepal_length', data=iris, inner=None)
sns.swarmplot('species', 'sepal_length', data=iris, color='black', size=4)
Instead of plotting one categorical variable vs a numerical variable, we can also plot two numerical values against each other to explore potential correlations between these two variables:
In [ ]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, size=6)
There is a regression line plotted by default to indicate the trend in the data. Let's turn that off for now and look at only the data points.
In [ ]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, fit_reg=False, size=6)
There appears to be some structure in this data. At least two clusters of points seem to be present. Let's color according to species and see if that explains what we see.
In [ ]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, hue='species', fit_reg=False, size=6)
Now we can add back the regression line, but this time one for each group.
In [ ]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, hue='species', fit_reg=True, size=6)
Instead of creating a plot for each variable against each other, we can easily create a grid of subplots for all variables with a single command:
In [ ]:
sns.pairplot(iris, hue="species", size=3.5)
Many visualizations are easier to create if we first reshape our data frame into the tidy format, which is what seaborn
prefers. This is also referred to as changing the data frame format from wide (many columns) to long (many rows), since it moves information from columns to rows:
We can use pandas built-in melt
-function to "melt" the wide data frame into the long format. The new columns will be given the names variable
and value
by default (see the help of melt
if you would like to change these names).
In [ ]:
iris_long = pd.melt(iris, id_vars = 'species')
iris_long
We do not need to call groupby
or mean
on the long iris
data frame when plotting with seaborn
. Instead we control these options from seaborn
with the plot type we chose (barplot = mean automatically) and the hue
-parameter, which is analogous to groupby
.
In [ ]:
sns.set(context='poster', style='white', rc={'figure.figsize':(10, 6), 'axes.spines.right':False, 'axes.spines.top':False})
sns.swarmplot(x='variable', y='value', hue = 'species', data=iris_long, dodge=True, palette='Set2', size=4)
In [ ]:
sns.set(context='poster', style='darkgrid', rc={'figure.figsize':(12, 6)})
# stripplot: scatterplot where one variable is categorical
sns.boxplot(y='variable', x='value', hue='species', data=iris_long, color='c', )
sns.stripplot(y='variable', x='value', hue='species', data=iris_long, size=2.5, palette=['k']*3, jitter=True, dodge=True)
plt.xlim([0, 10])
It is also possible to get too fancy and accidentally hide important messages in the data. However, the fact that you now have access to several ways to plot your data forces you to consider what is actually important and how you can best communicate that message, rather than always making the same plot without considering its strengths and weaknesses.
The documentation for these packages are great resources to learn more.
pandas
pandas
usersseaborn
galleryseaborn
tutorialsmatplotlib
gallery is another great resource. We did not practice using matplotlib
explicitly in this notebook, but the plotting techniques from pandas
and seaborn
are layers on top of the matplotlib
library, so we have been using it indirectly.