Data analysis in Python

Contributors:

This notebook combines two notebooks (with minor modifications by Amanda Easson) from previous UofT Coders sessions:

Intro Python (authors: Madeleine Bonsma-Fisher, heavily borrowing from Lina Tran and Charles Zhu): https://github.com/UofTCoders/studyGroup/blob/gh-pages/lessons/python/intro/IntroPython-MB.ipynb

Pandas (authors: Joel Ostblom and Luke Johnston): https://github.com/UofTCoders/studyGroup/blob/gh-pages/lessons/python/intro-data-analysis/from-spreadsheets-to-pandas-extended.ipynb

Additional numpy content was added by Amanda Easson, primarily based off of the Python Data Science Handbook by Jake VanderPlas: https://jakevdp.github.io/PythonDataScienceHandbook/

Table of Contents

Preamble

This is a brief tutorial for anyone who is interested in how Python can facilitate their data analyses. The tutorial is aimed at people who currently use a spreadsheet program as their primary data analyses tool, and that have no previous programming experience. If you want to code along, a simple way to install Python is to follow these instructions, but I encourage you to just read through this tutorial on a conceptual level at first.

Motivation

Spreadsheet software is great for viewing and entering small data sets and creating simple visualizations fast. However, it can be tricky to create publication-ready figures, automate reproducible analysis workflows, perform advanced calculations, and clean data sets robustly. Even when using a spreadsheet program to record data, it is often beneficial to pick up some basic programming skills to facilitate the analyses of that data.

Conceptual understanding

Spreadsheet programs, such as MS Excel and Libre/OpenOffice, have their functionality sectioned into menus. In programming languages, all the functionality is accessed by typing the name of functions directly instead of finding the functions in the menu hierarchy. Initially this might seem intimidating and non-intuitive for people who are used to the menu-driven approach.

However, think of it as learning a new natural language. Initially, you will slowly string together sentences by looking up individual words in the dictionary. As you improve, you will only reference the dictionary occasionally since you already know most of the words. Practicing the language whenever you can, and receiving immediate feedback, is often the fastest way to learn. Sitting at home trying to learn every word in the dictionary before engaging in conversation, is destined to kill the joy of learning any language, natural or formal.

In my experience, learning programming is similar to learning a foreign language, and you will often learn the most from just trying to do something and receiving feedback from the computer! When there is something you can't wrap you head around, or if you are actively trying to find a new way of expressing a thought, then look it up, just as you would with a natural language.

Programming basics

Just like in spreadsheet software, the basic installation of Python includes fundamental math operations, e.g. adding numbers together:


In [ ]:
4 + 4

In [ ]:
4**2 # 4 to the power of 2

In [ ]:
3*5; # semi-colon suppresses output

Variable assignment

It is possible to assign values to variables:


In [ ]:
a = 5
a * 2

In [ ]:
my_variable_name = 4
a - my_variable_name

Variables can also hold more data types than just numbers, for example a sequence of characters surrounded by single or double quotation marks (called a string). In Python, it is intuitive to append string by adding them together:


In [ ]:
b = 'Hello'
c = 'universe'
b + c

A space can be added to separate the words:


In [ ]:
b + ' ' + c

In [ ]:
print(b)
print(type(b))

Assigning multiple values to variables

Lists

Variables can also store more than one value, for example in a list of values:


In [ ]:
list_of_things = [1, 55, 'Hi', ['apple', 'orange', 'banana']]
list_of_things

In [ ]:
list_of_things.append('Toronto')
list_of_things

In [ ]:
list_of_things.remove(55)
list_of_things

In [ ]:
print(list_of_things + list_of_things)

A "tuple" is an immutable list (nothing can be added or subtracted) whose elements also can't be reassigned.


In [ ]:
tup1 = (1,5,2,1)
print(tup1)
tup1[0] = 3

Dictionaries

In a dictionary, values are paired with names, called keys. These are not stored in any specific order, and are therefore accessed by the key name rather than a number.


In [ ]:
fruit_colors = {'tangerine':'orange', 'banana':'yellow', 'apple':['green', 'red']}
fruit_colors['banana']

In [ ]:
fruit_colors['apple']
list(fruit_colors.keys())

In [ ]:
participant_info = {'ID': 50964, 'age': 15.25, 'sex': 'F', 'IQ': 105, 'medications': None, 'questionnaires': ['Vineland', 'CBCL', 'SRS']}

In [ ]:
print(participant_info['ID']) 
print(participant_info['questionnaires'])

Comparisons

Python can compare values and assess whether the expression is true or false.


In [ ]:
1 == 1

In [ ]:
1 == 0

In [ ]:
1 > 0

In [ ]:
'hey' == 'Hey'

In [ ]:
'hey' == 'hey'

In [ ]:
a >= 2 * 2 # we defined a = 5 above

Indexing and Slicing


In [ ]:
#indexing in Python starts at 0, not 1 (like in Matlab or Oracle)
fruits = ['apples', 'oranges', 'bananas']
print(fruits[0])

In [ ]:
print(fruits[1])

In [ ]:
# strings are just a particular kind of list
s = 'This_is_a_string.'

In [ ]:
print(s[0])

In [ ]:
# use -1 to get the last element
print(fruits[-1])

In [ ]:
print(fruits[-2])

In [ ]:
# to get a slice of the string use the : symbol
# s[0:x] will print up to the xth element, or the element with index x-1
print(s[0:4])

In [ ]:
print(s[:4])

In [ ]:
s = 'This_is_a_string.'
print(s[5:7])

In [ ]:
print(s[7:])
print(s[7:len(s)])

If Statements


In [ ]:
s2 = [19034, 23]

# You will always need to start with an 'if' line
# You do not need the elif or else statements
# You can have as many elif statements as needed

if type(s2) == str:
    print('s2 is a string')
elif type(s2) == int:
    print('s2 is an integer')
elif type(s2) == float:
    print('s2 is a float')
else:
    print('s2 is not a string, integer or float')

For Loops


In [ ]:
nums = [23, 56, 1, 10, 15, 0]

In [ ]:
# in this case, 'n' is a dummy variable that will be used by the for loop
# you do not need to assign it ahead of time

for n in nums:
    if n%2 == 0:
        print('even')
    else:
        print('odd')

In [ ]:
# for loops can iterate over strings as well
vowels = 'aeiou'
for vowel in vowels:
    print(vowel)

List comprehensions

Format:

mylist = [altered_thing for thing in list_of_things]


In [ ]:
my_colours = ['pink', 'purple', 'blue', 'green', 'orange']

my_light_colours = ['light ' + colour for colour in my_colours]

print(my_light_colours)

Functions


In [ ]:
# always use descriptive naming for functions, variables, arguments etc.
def sum_of_squares(num1, num2):
    """
    Input: two numbers
    Output: the sum of the squares of the two numbers
    """
    ss = num1**2 + num2**2
    return(ss)

# The stuff inside """ """ is called the "docstring". It can be accessed by typing help(sum_of_squares)

In [ ]:
help(sum_of_squares)

In [ ]:
print(sum_of_squares(4,2))

In [ ]:
# the return statement in a function allows us to store the output of a function call in a variable for later use
ss1 = sum_of_squares(5,5)

In [ ]:
print(ss1)

When we start working with spreadsheet-like data, we will see that these comparisons are really useful to extract subsets of data, for example observations from a certain time period.

Using functions

To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking File -> Open and chose the file, you would type something similar to file.open('<filename>') in a programming language. Don't worry if you forget the exact expression, it is often enough to just type the few first letters and then hit Tab, to show the available options, more on that later.

Packages

Since there are so many esoteric tools and functions available in Python, it is unnecessary to include all of them with the basics that are loaded by default when you start the programming language (it would be as if your new phone came with every single app preinstalled). Instead, more advanced functionality is grouped into separate packages, which can be accessed by typing import <package_name> in Python. You can think of this as telling the program which menu items you want to use (similar to how Excel hides the Developer menu by default since most people rarely use it and you need activate it in the settings if you want to access its functionality). Some packages needs to be downloaded before they can be used, just like downloading an addon to a browser or mobile phone.

Just like in spreadsheet software menus, there are lots of different tools within each Python package. For example, if I want to use numerical Python functions, I can import the numerical python module, numpy. I can then access any function by writing numpy.<function_name>.

If the package name is long, it is common to import the package "as" another name, like a nickname. For instance, numpy is often imported as np. All you have to do is type import numpy as np. Thiis makes it faster to type and saves a bit of work and also makes the code a bit easier to read.


In [ ]:
import numpy as np

How to get help

Once you start out using Python, you don't know what functions are availble within each package. Luckily, in the Jupyter Notebook, you can type numpy.Tab (that is numpy + period + tab-key) and a small menu will pop up that shows you all the available functions in that module. This is analogous to clicking a 'numpy-menu' and then going through the list of functions. As I mentioned earlier, there are plenty of available functions and it can be helpful to filter the menu by typing the initial letters of the function name.

To get more info on the function you want to use, you can type out the full name and then press Shift + Tab once to bring up a help dialogue and again to expand that dialogue.

If you need a more extensive help dialog, you can click Shift + Tab four times or just type ? after the function name.

numpy: numerical python


In [ ]:
# use a package by importing it, you can also give it a "nickname", in this case 'np'
import numpy as np
np.mean?

In [ ]:
array = np.arange(15)
lst = list(range(15))

In [ ]:
print(array)
print(lst)

In [ ]:
print(type(array))
print(type(lst))

In [ ]:
# numpy arrays allow for vectorized calculations
print(array*2)
print(lst*2)

In [ ]:
array = array.reshape([5,3])
print(array)

In [ ]:
# for each row, take the mean across the 3 columns (using axis=1)
array.mean(axis=1)

In [ ]:
# max value in each column
array.max(axis=0)

In [ ]:
list2array = np.array(lst)
type(list2array)
list2array

In [ ]:
array2d = np.array([range(i, i + 3) for i in [2, 4, 6]])
array2d

In [ ]:
my_zeros = np.zeros((10,1), dtype=int)
my_zeros

In [ ]:
np.ones((5,2), dtype=float)

In [ ]:
np.full((3,3), np.pi)

In [ ]:
# Create an array filled with a linear sequence: start, stop, step
# up to stop value - 1
# similar to "range"
print(np.arange(0, 20, 2))
print(list(range(0, 20, 2)))

In [ ]:
# linspace: evenly spaced values: start, stop, number of values
# up to stop value
print(np.linspace(0, 20.2, 11))

In [ ]:
# uniformly distributed random numbers between 0-1
np.random.random((3,3))

In [ ]:
# normal distribution
# mean, standard deviation, array size
np.random.seed(0)
my_array = np.random.normal(0, 1, (10000,1))
print(my_array.mean())
print(my_array.std())

In [ ]:
iq = np.random.normal(100, 15, (30,1))
print(np.mean(iq))
print(np.median(iq))
print(np.max(iq))
print(np.min(iq))

print('\n')

print(iq.mean())
print(iq.max())
print(iq.median())

In [ ]:
x = np.random.normal(100,15,(3,4,5))

print("# dimensions:", x.ndim)
print("Shape:", x.shape)
print("Size:", x.size)

In [ ]:
# indexing
x[0,0,:]

In [ ]:
# change elements
x[0,0,2] = 108
x[0,0,2]

In [ ]:
x = np.arange(10)
print(x)
print("First five elements:", x[:5]) # first five elements
print("elements after index 5:", x[5:]) # elements after index 5
print("every other element:", x[::4]) # every other element
print("every other element, starting at index 3:", x[3::2]) # every other element, starting at index 3
print("all elements, reversed:",x[::-1]) # all elements, reversed

In [ ]:
# copying arrays
print(x)
x2 = x
x2[0] = 100
print(x2)
print(x)

In [ ]:
x3 = np.arange(10)
x4 = x3.copy()
x4[0] = 100
print(x4)
print(x3)

In [ ]:
# concatenate arrays

x1 = np.array([1,2,3])
x2 = np.array([4,5,6])
x3 = np.concatenate([x1,x2])
print(x3)

In [ ]:
# or stack vertically or horizontally
# make sure dimensions agree
x4 = np.vstack([x1,x2])
print(x4)

x5 = np.hstack([x1,x2])
print(x5)

In [ ]:
# numpy arithmetic
x = np.arange(5)
print(x)
print(x+5)
print(x*2)
print(x/2)
print(x//2)      # "floor" division, i.e. round down to nearest integer

# alternatively:
print('\n')
print(np.add(x,5))
print(np.multiply(x,2))
print(np.divide(x,2))
print(np.floor_divide(x,2))
print(np.power(x,3))
print(np.mod(x,2))

In [ ]:
# add all elements of x
print(np.add.reduce(x))
print(np.sum(x))

# add all elements of x cumulatively
print(np.add.accumulate(x))

In [ ]:
# broadcasting
a = np.eye(5)
b = np.ones((5,1))
b = np.ones((5,5))
print(a)
print(b)

In [ ]:
print(a+b)

In [ ]:
# broadcasting: practical example: mean-centering data

data = np.random.normal(100, 15, (30, 6))
print(data.shape)
data_mean = data.mean(axis=0)
print(data_mean.shape)


data_mean_centered = data - data_mean

data_mean_centered.mean(axis=0)

pandas ("panel data"): python data analysis library

The Python package that is most commonly used to work with spreadsheet-like data is called pandas, the name is derived from "panel data", an econometrics term for multidimensional structured data sets. Data are easily loaded into pandas from .csv or other spreadsheet formats. The format pandas uses to store this data is called a data frame.

To have a quick peak at the data, I can type df.head(<number_of_rows>) (where "df" is the name of the dataframe). Notice that I do not have to use the pd.-syntax for this. iris is now a pandas data frame, and all pandas data frames have a number of built-in functions (called methods) that can be appended directly to the data frame instead of by calling pandas separately.


In [ ]:
import pandas as pd

In [ ]:
# this will read in a csv file into a pandas DataFrame
# this csv has data of country spending on healthcare
#data = pd.read_csv('health.csv', header=0, index_col=0, encoding="ISO-8859-1")

# load several datasets 
data = pd.read_csv('https://tinyurl.com/uoftcode-health', header=0, index_col=0, 
                   encoding="ISO-8859-1")
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

In [ ]:
# the .head() function will allow us to look at first few lines of the dataframe
data.head(10) # default is 5 rows

In [ ]:
# by default, rows are indicated first, followed by the column: [row, column]
data.loc['Canada', '2008']

In [ ]:
# you can also slice a dataframe
data.loc['Canada':'Chile', '1999':'2001']

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

In [ ]:
# the .plot() function will create a simple graph for you to quickly visualize your data
data.loc['Denmark'].plot()
data.loc['Canada'].plot()
data.loc['India'].plot()
plt.legend()
plt.show()

In [ ]:
iris.head()

In [ ]:
iris.shape # rows, columns

In [ ]:
iris.columns # names of columns

To select a column we can index the data frame with the column name.


In [ ]:
iris['sepal_length']

The output is here rendered slightly differently from before, because when we are looking only at one column, it is no longer a data frame, but a *series*. The differences are not important for this lecture, so this is all you need to know about that for now.

We could now create a new column if we wanted:


In [ ]:
iris['sepal_length_x2'] = iris['sepal_length'] * 2
iris['sepal_length_x2']

And delete that column again:


In [ ]:
iris = iris.drop('sepal_length_x2', axis=1) # axis 0 = index, axis 1 = columns

There are some built-in methods that make it convenient to calculate common operation on data frame columns.


In [ ]:
iris['sepal_length'].mean()

In [ ]:
iris['sepal_length'].median()

It is also possible to use these methods on all columns at the same time without having to type the same thing over and over again.


In [ ]:
iris.mean()

Similarly, you can get a statistical summary of the data frame:


In [ ]:
iris.describe()

Subsetting data

A common task is to subset the data into only those observations that match a criteria. For example, we might be interest in only studying one specific species. First let's find out how many different species there are in our data set:


In [ ]:
iris['species'].unique()

Let's arbitrarily choose setosa as the one to study! To select only observations from this species in the original data frame, we index the data frame with a comparison:


In [ ]:
iris['species'] == 'setosa'

In [ ]:
iris[iris['species'] == 'setosa']

Now we can easily perform computation on this subset of the data:


In [ ]:
iris[iris['species'] == 'setosa'].mean(axis=0)

We could also compare all groups within the data against each other, by using the split-apply-combine workflow. This splits data into groups, applies an operation on each group, and then combines the results into a table for display.

In pandas, we split into groups with the group_by command and then we apply an operation to the grouped data frame, e.g. .mean().


In [ ]:
iris.groupby('species').mean()

We can also easily count the number of observations in each group:


In [ ]:
iris.groupby('species').size()

Data visualization

We can see that there are clear differences between species, but they might be even clearer if we display them graphically in a chart.

Plotting with pandas

Pandas interfaces with one of Python's most powerful data visualization libraries, matplotlib, to enable simple visualizations at minimal effort.


In [ ]:
# Prevent plots from popping up in a new window
%matplotlib inline

species_comparison = iris.groupby('species').mean() # Assign to a variable
species_comparison.plot(kind='bar')

Depending on what you are interesting in showing, it could be useful to have the species as the different colors and the columns along the x-axis. We can easily achieve this by transposing (.T) our data frame.


In [ ]:
species_comparison.T.plot(kind='bar')

Plotting with seaborn

Another plotting library is seaborn, which also builds upon matplotlib, and extends it by adding new styles, additional plot types and some commonly performed statistical measures.


In [ ]:
import seaborn as sns

sns.swarmplot('species', 'sepal_length', data = iris)

The labels on this plot look a bit small, so let's change the style we are using for plotting.


In [ ]:
sns.set(style='ticks', context='talk', rc={'figure.figsize':(8, 5),'axes.spines.right':False, 'axes.spines.top':False}) # This applies to all subseque
# styles: darkgrid, whitegrid, dark, white, and ticks
# contexts: paper, notebook, talk, poster
#sns.axes_style()

In [ ]:
sns.swarmplot('species', 'sepal_length', data=iris)

We can use the same syntax to create many of the common plots in seaborn.


In [ ]:
sns.barplot('species', 'sepal_length', data=iris)

Bar charts are a common, but not very useful way of presenting data aggregations (e.g. the mean). A better way is to use the points as we did above, or a plot that capture the distribution of the data, such as a boxplot, or a violin plot:


In [ ]:
sns.violinplot('species', 'sepal_length', data = iris)

We can also combine two plots, by simply adding the two line after each other. There is also a more advanced figure interface available in matplotlib to explicitly indicate which figure and axes you want the plot to appear in, but this is outside the scope of this tutorial (more info here and here).


In [ ]:
sns.violinplot('species', 'sepal_length', data=iris, inner=None)
sns.swarmplot('species', 'sepal_length', data=iris, color='black', size=4)

Instead of plotting one categorical variable vs a numerical variable, we can also plot two numerical values against each other to explore potential correlations between these two variables:


In [ ]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, size=6)

There is a regression line plotted by default to indicate the trend in the data. Let's turn that off for now and look at only the data points.


In [ ]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, fit_reg=False, size=6)

There appears to be some structure in this data. At least two clusters of points seem to be present. Let's color according to species and see if that explains what we see.


In [ ]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, hue='species', fit_reg=False, size=6)

Now we can add back the regression line, but this time one for each group.


In [ ]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, hue='species', fit_reg=True, size=6)

Instead of creating a plot for each variable against each other, we can easily create a grid of subplots for all variables with a single command:


In [ ]:
sns.pairplot(iris, hue="species", size=3.5)

More complex visualizations

Many visualizations are easier to create if we first reshape our data frame into the tidy format, which is what seaborn prefers. This is also referred to as changing the data frame format from wide (many columns) to long (many rows), since it moves information from columns to rows:

We can use pandas built-in melt-function to "melt" the wide data frame into the long format. The new columns will be given the names variable and value by default (see the help of melt if you would like to change these names).


In [ ]:
iris_long = pd.melt(iris, id_vars = 'species')
iris_long

We do not need to call groupby or mean on the long iris data frame when plotting with seaborn. Instead we control these options from seaborn with the plot type we chose (barplot = mean automatically) and the hue-parameter, which is analogous to groupby.


In [ ]:
sns.set(context='poster', style='white', rc={'figure.figsize':(10, 6), 'axes.spines.right':False, 'axes.spines.top':False})

sns.swarmplot(x='variable', y='value', hue = 'species', data=iris_long, dodge=True, palette='Set2', size=4)

In [ ]:
sns.set(context='poster', style='darkgrid', rc={'figure.figsize':(12, 6)})
# stripplot: scatterplot where one variable is categorical 

sns.boxplot(y='variable', x='value', hue='species', data=iris_long, color='c', )
sns.stripplot(y='variable', x='value', hue='species', data=iris_long, size=2.5, palette=['k']*3, jitter=True, dodge=True)
plt.xlim([0, 10])

It is also possible to get too fancy and accidentally hide important messages in the data. However, the fact that you now have access to several ways to plot your data forces you to consider what is actually important and how you can best communicate that message, rather than always making the same plot without considering its strengths and weaknesses.

Resources to learn more

The documentation for these packages are great resources to learn more.