pandas materialThis is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Code Cafe event, instructors will be on hand to help you.
If you haven't done so already please read through the Introduction to this course, which covers:
This lesson covers:
In [ ]:
____ = 0
import os
import numpy as np
from numpy.testing import assert_almost_equal, assert_array_equal
import pandas as pd
In [ ]:
df = pd.read_csv('../seaborn-data/iris.csv')
Here we loaded the data from the CSV into a pandas DataFrame, which is a table of data plus a label for each row and column.
How big is this dataset?
In [ ]:
df.shape
In [ ]:
df.head()
We now see that each row corresponds to a flower sample and each column to a flower attribute. Note the names above each column and index values to the left of each row. Here those index values were not present in the raw data but were automaticaly added when we imported the CSV file.
We might also want to view a statistical summary of the dataset to learn of the mean and variance of each flower attribute:
In [ ]:
df.describe()
Here std is the standard deviation and 25% 50% and 75% are percentiles of the data.
Also, note that the summary is only of the columns that contain numerical data.
If we want to extract just one column then we can wrap the column in single quotes and square brackets then append this to to the DataFrame name. For example, we can calculate the median sepal length like this:
In [ ]:
df['sepal_length'].mean()
In [ ]:
df['species'].unique()
or more concisely:
In [ ]:
df['species'].nunique()
In [ ]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
for species_name, species_specific_df in df.groupby('species'):
species_specific_df.plot(kind='scatter', x='petal_length', y='petal_width',
title=species_name);
In [ ]:
# for each distinct species in our dataset
for species in df['species'].unique():
# Isolate all samples of just that species and store the result in a new DataFrame
df_for_species = df[df['species'] == species]
# Create a blank figure
plt.figure()
# Create a scatter plot for petal length against width for the current species only
plt.scatter(df_for_species['petal_length'], df_for_species['petal_width'])
# Add a species-specific title
plt.title(species)
# Add x and y axis labels
plt.xlabel('Petal length')
plt.ylabel('Petal width')
or
3. seaborn. How pandas specific? Better to get aquainted with matplotlib rather than mask away all the complexity? Nice to have all data on one plot though
In [ ]:
import seaborn as sns
g = sns.FacetGrid(data=df, hue='species', size=6)
g.map(plt.scatter, 'petal_width', 'petal_length')
g.add_legend();
or
4. Another approach with seaborn. Same points apply
In [ ]:
import seaborn as sns
sns.lmplot('petal_width', 'petal_length', data=df, hue='species', fit_reg=False, size=6);
Try summarising and plotting a different dataset using the commands you've learned. The dataset to investigate is Anscombe's quartet, the data for which can be found in the CSV file ../seaborn-data/anscombe.csv.
R has many functions built in but there are over 8000 freely available add-on packages that provide thousands more functions. Once you know the name of a package, you call install it very easily.
For example, a package called ggplot2 is widely used to create high quality graphics. To install ggplot2:
install.packages("ggplot2")
We make all of the ggplot2 functions available to our R session with the library command
library(ggplot2)
Among other things, this makes the qplot function available to us. We can use this as an alternative to the basic plot command described above
qplot(iris$Petal.Length, iris$Petal.Width,col=iris$Species)
Alternatively, we can save ourselves typing iris$ a lot by telling qplot that the data we are referring to is the iris data
qplot(data=iris,Petal.Length, Petal.Width,col=Species)
To get help about the functionality in the ggplot2 package:
help(package=ggplot2)
A very popular R package is MASS which was created to support the book Modern Applied Statistics with S. This contains many more classic data sets which can be used to develop your R skills.
Working with built-in datasets is great for practice but for real-life work its vital that you can import our own data. Before we do this, we must learn where Python is expecting to find your files. It does this using the concept of the present working directory.
To see what the current working directory is, we can use a function from the os package (which is always distributed with Python):
In [ ]:
import os
os.getcwd()
We can list the contents of the directory with:
In [ ]:
os.listdir()
You can create a new directory using the mkdir function:
In [ ]:
os.mkdir('mydata')
os.listdir()
Move into this new directory using the chdir function, then view its contents:
In [ ]:
os.chdir('mydata')
os.listdir()
The current working directory is where Python is currently preferentially looking for files and also where it will put any files it creates unless you tell it otherwise.
In this section, you'll learn how to import data into Python from the common .csv (comma separated values) format.
Download the file example_data.csv to your current working directory.
You can either do this manually, using your web browser, then run the following to load the data into a pandas DataFrame:
In [ ]:
example_data = pd.read_csv('example_data.csv')
Or you can supply a URL as the first argument to the read_csv function from pandas to download and instantiate a DataFrame in a single step:
In [ ]:
example_data = pd.read_csv('https://raw.githubusercontent.com/mikecroucher/Code_cafe/master/First_steps_with_R/example_data.csv')
In the simplest terms, a script is just a text file containing a list of R commands. We can run this list in order with a single command called source()
An alternative way to think of a script is as a permanent, repeatable, annotated, shareable, cross-platform archive1 of your analysis! Everything required to repeat your analysis is available in a single place. The only extra required ingredient is a computer.
For example, based on the article at http://www.walkingrandomly.com/?p=5254, we have created a script called best_fit.R that finds the parameters p1 and p2 such that the curve p1*cos(p2*xdata) + p2*sin(p1*xdata) is a best fit for the example_data described earlier. The details of this are beyond the scope of this course but you can easily download and run this analysis yourself.
download.file('https://raw.githubusercontent.com/mikecroucher/Code_cafe/master/First_steps_with_R/best_fit.R',destfile='best_fit.R')
source('best_fit.R')
By doing this, you have reproduced the analysis that we did. You are able to check and extend our results or apply the code to your own work. Making code and data publicly available like this is the foundation of Open Data Science
In this session, we told you how to import data from a file but not how to export it. The following link will teach you how to export to .csv. Tutorial: Exporting an R data frame to a .csv file
There are many resources online and in print for learning Python. Here are some recommendations:
swirlypy)?my_list[0].<tab> OR OBJECTS THAT HAVE NOT YET BEEN INSTANTIATED)[1] Getting Started with R - An Introduction for Biologists. Authors: Beckerman and Petchey.