OLD pandas material

This is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Code Cafe event, instructors will be on hand to help you.

If you haven't done so already please read through the Introduction to this course, which covers:

  1. What Python is and why it is of interest;
  2. Learning outcomes for the course;
  3. The course structure and support facilities;
  4. An introduction to Jupyter Notebooks;
  5. Information on course exercises.

This lesson covers:

Lesson setup code

Run the following Notebook cell every time you load this lesson (but do not edit it). Don't be concerned with what this code does at this stage.

TODO: ENSURE NUMPY ALREADY INSTALLED AND AVAILABLE BY THIS POINT


In [ ]:
____ = 0
import os
import numpy as np
from numpy.testing import assert_almost_equal, assert_array_equal
import pandas as pd

MISSING SOME TEXT HERE


In [ ]:
df = pd.read_csv('../seaborn-data/iris.csv')

Here we loaded the data from the CSV into a pandas DataFrame, which is a table of data plus a label for each row and column.

How big is this dataset?


In [ ]:
df.shape

Given that it has 150 rows (and 5 columns) we might not want to view it all at once but instead might just want to see the first few rows:

TODO: explain that here we have a method of a DataFrame, not a function of a package (although both are expressed in the form foo.bar)?


In [ ]:
df.head()

We now see that each row corresponds to a flower sample and each column to a flower attribute. Note the names above each column and index values to the left of each row. Here those index values were not present in the raw data but were automaticaly added when we imported the CSV file.

TODO: Discuss dtypes?

We might also want to view a statistical summary of the dataset to learn of the mean and variance of each flower attribute:


In [ ]:
df.describe()

Here std is the standard deviation and 25% 50% and 75% are percentiles of the data. Also, note that the summary is only of the columns that contain numerical data.

If we want to extract just one column then we can wrap the column in single quotes and square brackets then append this to to the DataFrame name. For example, we can calculate the median sepal length like this:


In [ ]:
df['sepal_length'].mean()

How many unique species do we have in our dataset?

TODO: here unique return a numpy array of objects - explain 'noisy' output from Notebook cell?


In [ ]:
df['species'].unique()

or more concisely:


In [ ]:
df['species'].nunique()

Plotting data

TODO: REWRITE SECTION

TODO: DECIDE ON PLOTTING APPROACH

1. INVOLVES LOOP + BLOCK (INDENTING), MULTIPLE ASSIGNMENT, GROUPBY. +ve: FAIRLY COMPACT FORM. AUTOMATIC TITLING AND AXIS LABELLING IS NICE. -ve: POOR PREPARATION FOR PLOTTING USING NUMPY


In [ ]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

for species_name, species_specific_df in df.groupby('species'):
    species_specific_df.plot(kind='scatter', x='petal_length', y='petal_width', 
                             title=species_name);

OR

2. Involves manual figure creation and indexing by boolean series. +ve: transferrable to plotting with numpy; -ve: more verbose and manual

TODO: Point out that using comments here and note the value in documenting code</p>


In [ ]:
# for each distinct species in our dataset
for species in df['species'].unique():
    # Isolate all samples of just that species and store the result in a new DataFrame
    df_for_species = df[df['species'] == species]
    
    # Create a blank figure
    plt.figure()
    
    # Create a scatter plot for petal length against width for the current species only
    plt.scatter(df_for_species['petal_length'], df_for_species['petal_width'])
    
    # Add a species-specific title
    plt.title(species)
    
    # Add x and y axis labels
    plt.xlabel('Petal length')
    plt.ylabel('Petal width')

or

3. seaborn. How pandas specific? Better to get aquainted with matplotlib rather than mask away all the complexity? Nice to have all data on one plot though


In [ ]:
import seaborn as sns
g = sns.FacetGrid(data=df, hue='species', size=6)
g.map(plt.scatter, 'petal_width', 'petal_length')
g.add_legend();

or

4. Another approach with seaborn. Same points apply


In [ ]:
import seaborn as sns
sns.lmplot('petal_width', 'petal_length', data=df, hue='species', fit_reg=False, size=6);

Exercise: Anscome's quartet

Try summarising and plotting a different dataset using the commands you've learned. The dataset to investigate is Anscombe's quartet, the data for which can be found in the CSV file ../seaborn-data/anscombe.csv.

Packages

TODO: REWRITE OR JUST REMOVE SECTION.</p>

R has many functions built in but there are over 8000 freely available add-on packages that provide thousands more functions. Once you know the name of a package, you call install it very easily.

For example, a package called ggplot2 is widely used to create high quality graphics. To install ggplot2:

install.packages("ggplot2")

We make all of the ggplot2 functions available to our R session with the library command

library(ggplot2)

Among other things, this makes the qplot function available to us. We can use this as an alternative to the basic plot command described above

qplot(iris$Petal.Length, iris$Petal.Width,col=iris$Species)

Alternatively, we can save ourselves typing iris$ a lot by telling qplot that the data we are referring to is the iris data

qplot(data=iris,Petal.Length, Petal.Width,col=Species)

To get help about the functionality in the ggplot2 package:

help(package=ggplot2)

Exercise (Packages)

TODO: REWRITE/REPLACE. NB can't install new packages on SageMathCloud</p>

A very popular R package is MASS which was created to support the book Modern Applied Statistics with S. This contains many more classic data sets which can be used to develop your R skills.

  1. Install the MASS package on your machine.
  2. Explore the MASS package's documentation and find a dataset that interests you.
  3. Load the MASS library into your R session.
  4. Take a look at the dataset you chose in part (2) using what you've learned so far.

The current working directory

TODO: REWRITE PARA. MOVE / REMOVE IF HAVE ALREADY HAD TO IMPORT EXTERNAL DATASETS BY THIS POINT IN THE TUTORIAL?</p>

Working with built-in datasets is great for practice but for real-life work its vital that you can import our own data. Before we do this, we must learn where Python is expecting to find your files. It does this using the concept of the present working directory.

To see what the current working directory is, we can use a function from the os package (which is always distributed with Python):


In [ ]:
import os

os.getcwd()

We can list the contents of the directory with:


In [ ]:
os.listdir()

You can create a new directory using the mkdir function:


In [ ]:
os.mkdir('mydata')
os.listdir()

Move into this new directory using the chdir function, then view its contents:


In [ ]:
os.chdir('mydata')
os.listdir()

The current working directory is where Python is currently preferentially looking for files and also where it will put any files it creates unless you tell it otherwise.

Importing your own data

TODO: REWRITE/MOVE/REMOVE SECTION

In this section, you'll learn how to import data into Python from the common .csv (comma separated values) format.

TODO: already introduced

Download the file example_data.csv to your current working directory.

You can either do this manually, using your web browser, then run the following to load the data into a pandas DataFrame:


In [ ]:
example_data = pd.read_csv('example_data.csv')

Or you can supply a URL as the first argument to the read_csv function from pandas to download and instantiate a DataFrame in a single step:


In [ ]:
example_data = pd.read_csv('https://raw.githubusercontent.com/mikecroucher/Code_cafe/master/First_steps_with_R/example_data.csv')

Exercise: example_data

  • Show the first few lines of example_data
  • Create a plot of the example_data
  • Show summary statistics of example_data

Scripts

TODO: REWRITE SECTION</p>

In the simplest terms, a script is just a text file containing a list of R commands. We can run this list in order with a single command called source()

An alternative way to think of a script is as a permanent, repeatable, annotated, shareable, cross-platform archive1 of your analysis! Everything required to repeat your analysis is available in a single place. The only extra required ingredient is a computer.

For example, based on the article at http://www.walkingrandomly.com/?p=5254, we have created a script called best_fit.R that finds the parameters p1 and p2 such that the curve p1*cos(p2*xdata) + p2*sin(p1*xdata) is a best fit for the example_data described earlier. The details of this are beyond the scope of this course but you can easily download and run this analysis yourself.

download.file('https://raw.githubusercontent.com/mikecroucher/Code_cafe/master/First_steps_with_R/best_fit.R',destfile='best_fit.R')
source('best_fit.R')

By doing this, you have reproduced the analysis that we did. You are able to check and extend our results or apply the code to your own work. Making code and data publicly available like this is the foundation of Open Data Science

Further reading and next steps

In this session, we told you how to import data from a file but not how to export it. The following link will teach you how to export to .csv. Tutorial: Exporting an R data frame to a .csv file

TODO: REMOVE/REPLACE THE ABOVE PARA

There are many resources online and in print for learning Python. Here are some recommendations:

  • Data Carpentry's introductory Python for Ecologists tutorial, which provide an introduction to using Python for automating data analysis tasks.
  • Software Carpentry's Programming with Python tutorial, which follows on from the Data Carpentry course and offers FINISH
  • The Dive Into Python 3 by Mark Pilgrim. Freely available online under a CC-BY-SA license and also published as a book by Apress (2009, ISBN: 978-1430224150). Provides a introduction to Python as a general purpose programming language. Each chapter starts with a block of (initially unreadable) code that supposedly does something useful; this is then picked apart to introduce the reader to different aspects of the language and its core libraries.
  • The Python for Data Analysis book by Wes McKinney (O’Reilly Media, 2012, ISBN: 978-1-4493-1979-3). Wes is the original author of pandas, one of the most popular and powerful Python libraries for manipulating tabular data; this book focusses on
  • The Learn Python the Hard Way book by Zed Shaw (3rd ed, Addison Wesley, 2013, ISBN: 978-0321884916). Another book on Python as a general purpose language. Readers can view the book online before they buy.

TODO: PYTHON EQUIV OF SWIRL (swirlypy)?

TODO: PROJECT EULER?

Getting help NOTES

TODO: MERGE WITH EARLIER 'GETTING HELP SECTION'

  • TAB COMPLETION IN JUPYTER (DOES NOT WORK ON INDEXED OBJECTS e.g. my_list[0].<tab> OR OBJECTS THAT HAVE NOT YET BEEN INSTANTIATED)
  • ONLINE DOCS (e.g. for numpy) OR DOCS IN IDE
  • STACK OVERFLOW
  • MAIL LISTS: pydata, pystatsmodels, numpy, scipy, matplotlib
  • IRC
  • Sheffield Python group

TODO: FINISH

References

TODO: remove/replace

[1] Getting Started with R - An Introduction for Biologists. Authors: Beckerman and Petchey.