In [1]:
import pandas as pd
import numpy as np

Table of Contents


In [2]:
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')


DataFrame basics

Difficulty: easy

A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames

Consider the following Python dictionary data and Python list labels:


In [3]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Task: Create a DataFrame df from this dictionary data which has the index labels.


In [4]:
# Answer
df = pd.DataFrame(data, index=labels)

Task: Display a summary of the basic information about this DataFrame and its data.


In [5]:
# Answer
df.info()

# ...or...

df.describe()


<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
age         8 non-null float64
animal      10 non-null object
priority    10 non-null object
visits      10 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes
Out[5]:
age visits
count 8.000000 10.000000
mean 3.437500 1.900000
std 2.007797 0.875595
min 0.500000 1.000000
25% 2.375000 1.000000
50% 3.000000 2.000000
75% 4.625000 2.750000
max 7.000000 3.000000

Task: Return the first 3 rows of the DataFrame df.


In [6]:
# Answer
df.iloc[:3]

# or equivalently

df.head(3)


Out[6]:
age animal priority visits
a 2.5 cat yes 1
b 3.0 cat yes 3
c 0.5 snake no 2

Task: Select just the 'animal' and 'age' columns from the DataFrame df.


In [7]:
# Answer
df.loc[:, ['animal', 'age']]

# or

df[['animal', 'age']]


Out[7]:
animal age
a cat 2.5
b cat 3.0
c snake 0.5
d dog NaN
e dog 5.0
f cat 2.0
g snake 4.5
h cat NaN
i dog 7.0
j dog 3.0

Task: Select the data in rows ["d", "e", "i"] and in columns ['animal', 'age'].


In [8]:
# Answer
df.loc[["d", "e", "i"], ["animal", "age"]]


Out[8]:
animal age
d dog NaN
e dog 5.0
i dog 7.0

Task: Select the data in rows [3, 4, 8] and in columns ['animal', 'age'].


In [9]:
# Answer
# either
df.ix[[3, 4, 8], ['animal', 'age']]
# or preferentially
df.iloc[[3, 4, 8], [1, 0]]


Out[9]:
animal age
d dog NaN
e dog 5.0
i dog 7.0

Task: Select only the rows where the number of visits is greater than 2.


In [10]:
# Answer
df[df['visits'] > 2]


Out[10]:
age animal priority visits
b 3.0 cat yes 3
d NaN dog yes 3
f 2.0 cat no 3

Task: Select the rows where the age is missing, i.e. is NaN.


In [11]:
# Answer
df[df['age'].isnull()]


Out[11]:
age animal priority visits
d NaN dog yes 3
h NaN cat yes 1

Task: Select the rows where the animal is a cat and the age is less than 3.


In [12]:
# Answer
df[(df['animal'] == 'cat') & (df['age'] < 3)]


Out[12]:
age animal priority visits
a 2.5 cat yes 1
f 2.0 cat no 3

Task: Change the age in row 'f' to 1.5.


In [13]:
# Answer
df.loc['f', 'age'] = 1.5

Task: Calculate the sum of all visits (the total number of visits).


In [14]:
# Answer
df['visits'].sum()


Out[14]:
19

Task: Calculate the mean age for each different animal in df.


In [15]:
# Answer
df.groupby('animal')['age'].mean()


Out[15]:
animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

Task: Append a new row 'k' to df with your choice of values for each column. Then delete that row to return the original DataFrame.


In [16]:
# Answer
df.loc['k'] = [5.5, 'dog', 'no', 2]

# and then deleting the new row...

df = df.drop('k')

Task: Count the number of each type of animal in df.


In [17]:
# Answer
df['animal'].value_counts()


Out[17]:
cat      4
dog      4
snake    2
Name: animal, dtype: int64

Task: Sort df first by the values in the 'age' in decending order, then by the value in the 'visit' column in ascending order.


In [18]:
# Answer
df.sort_values(by=['age', 'visits'], ascending=[False, True])


Out[18]:
age animal priority visits
i 7.0 dog no 2
e 5.0 dog no 2
g 4.5 snake no 1
j 3.0 dog no 1
b 3.0 cat yes 3
a 2.5 cat yes 1
f 1.5 cat no 3
c 0.5 snake no 2
h NaN cat yes 1
d NaN dog yes 3

Task: The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.


In [19]:
# Answer
df['priority'] = df['priority'].map({'yes': True, 'no': False})

Task: In the 'animal' column, change the 'snake' entries to 'python'.


In [20]:
# Answer
df['animal'] = df['animal'].replace('snake', 'python')

Task: For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

DataFrames: beyond the basics

Slightly trickier: you may need to combine two or more methods to get the right answer

Difficulty: medium

The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.

Task: How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?


In [21]:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

In [22]:
# Answer
len(df) - df.duplicated(keep=False).sum()

# or perhaps more simply...

len(df.drop_duplicates(keep=False))


Out[22]:
4

Task: A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For each group, find the sum of the three greatest values.


In [23]:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})

In [24]:
# Answer
df.groupby('grps')['vals'].nlargest(3).sum(level=0)


Out[24]:
grps
a    409
b    156
c    345
Name: vals, dtype: int64

DataFrames: harder problems

These might require a bit of thinking outside the box...

...but all are solvable using just the usual pandas/NumPy methods (and so avoid using explicit for loops).

Difficulty: hard

Task: Consider a DataFrame consisting of purely numerical data. Create a list of the row-column indices of the 3 largest values.


In [25]:
df = pd.DataFrame.from_dict({'A': {1: 0, 2: 6, 3: 12, 4: 18, 5: 24},
 'B': {1: 1, 2: 7, 3: 13, 4: 19, 5: 25},
 'C': {1: 2, 2: 8, 3: 14, 4: 20, 5: 26},
 'D': {1: 3, 2: 9, 3: 15, 4: 21, 5: 27},
 'E': {1: 4, 2: 10, 3: 16, 4: 22, 5: 28},
 'F': {1: 5, 2: 11, 3: 17, 4: 23, 5: 29}})
df.head()


Out[25]:
A B C D E F
1 0 1 2 3 4 5
2 6 7 8 9 10 11
3 12 13 14 15 16 17
4 18 19 20 21 22 23
5 24 25 26 27 28 29

In [26]:
# Answer
df.unstack().sort_values()[-3:].index.tolist()
# http://stackoverflow.com/questions/14941261/index-and-column-for-the-max-value-in-pandas-dataframe/
# credit: DSM


Out[26]:
[('D', 5), ('E', 5), ('F', 5)]

Task: Given a DataFrame with a column of group IDs, 'groups', and a column of corresponding integer values, 'vals', replace any negative values in 'vals' with the group mean.


In [27]:
import pandas as pd

In [28]:
df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=["groups", "vals"])
df


Out[28]:
groups vals
0 1 1
1 1 -1
2 2 1
3 2 2

In [29]:
# Answer
def replace_within_group(group):
    mask = group < 0
    # Select those values where it is < 0, and replace
    # them with the mean of the values which are not < 0.
    group[mask] = group[~mask].mean() # "~" is the "invert" or "complement" operation
    return group

df.groupby(['groups'])['vals'].transform(replace_within_group)
# http://stackoverflow.com/questions/14760757/replacing-values-with-groupby-means/
# credit: unutbu


Out[29]:
0    1
1    1
2    1
3    2
Name: vals, dtype: int64

Cleaning Data

Making a DataFrame easier to work with

Difficulty: easy/medium

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?


In [30]:
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})

Task: Some values in the the FlightNumber column are missing. These numbers are meant to increase by 10 with each row. Therefore the numbers 10055 and 10075 need to replace the missing values. Fill in these missing numbers and make the column an integer column (instead of a float column).


In [31]:
# Answer
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)

Apply

Task: The From_To column would be better as two separate columns! Split each string on the underscore delimiter _ to make two new columns with the correct values. Assign the correct column names to a new temporary DataFrame called temp.


In [32]:
# Answer
# temp = df.From_To.str.split('_', expand=True)
# temp.columns = ['From', 'To']

# or

temp = pd.DataFrame()
temp["From"], temp["To"] = zip(*df["From_To"].apply(lambda s: s.split("_")))

Task: Notice how the capitalisation of the city names is all mixed up in this temporary DataFrame. Standardise the strings so that only the first letter is uppercase (e.g. "londON" should become "London".)


In [33]:
# Answer
# temp['From'] = temp['From'].str.capitalize()
# temp['To'] = temp['To'].str.capitalize()

# or

temp['From'] = temp['From'].apply(lambda s: s.capitalize())
temp['To'] = temp['To'].apply(lambda s: s.capitalize())

Task: Delete the From_To column from df and attach the temporary DataFrame from the previous questions.


In [34]:
# Answer
df = df.drop('From_To', axis=1)
df = df.join(temp)

Task: In the Airline column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. '(British Airways. )' should become 'British Airways'.


In [35]:
# Answer
df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()

Task:. In the RecentDelays column, the values have been entered into the DataFrame as a list. We would like each first value in its own column, each second value in its own column, and so on. If there isn't an Nth value, the value should be NaN.

Expand the Series of lists into a DataFrame named delays, rename the columns delay_1, delay_2, etc. and replace the unwanted RecentDelays column in df with delays.


In [36]:
# Answer
# there are several ways to do this, but the following approach is possibly the simplest

delays = df['RecentDelays'].apply(pd.Series)

delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)]

df = df.drop('RecentDelays', axis=1).join(delays)

The DataFrame should look much better now.

Rename, delete, rank and pivot


In [37]:
fn = "data/Saliva.txt"
df = pd.read_csv(fn, sep='\t')
df


Out[37]:
TaxName species Abundance SampleCategory frequency rank
0 Rothia mucilaginosa 43675 9.861078 Caries 10.0 1.0
1 Veillonella atypica 39777 7.030844 Caries 10.0 2.0
2 Prevotella histicola 470565 6.433449 Caries 10.0 3.0
3 Veillonella dispar 39778 5.008735 Caries 10.0 4.0
4 Streptococcus salivarius 1304 4.653603 Caries 10.0 5.0
5 Rothia mucilaginosa 43675 10.521432 Healthy 10.0 1.0
6 Prevotella histicola 470565 5.224614 Healthy 10.0 2.0
7 Veillonella atypica 39777 4.792011 Healthy 10.0 3.0
8 Prevotella melaninogenica 28132 4.042395 Healthy 10.0 4.0
9 Veillonella dispar 39778 3.970057 Healthy 10.0 5.0
10 Rothia mucilaginosa 43675 7.273407 Perio 9.0 1.0
11 Prevotella melaninogenica 28132 4.673063 Perio 10.0 2.0
12 Veillonella atypica 39777 4.401230 Perio 10.0 3.0
13 Prevotella histicola 470565 4.328944 Perio 10.0 4.0
14 Veillonella dispar 39778 3.835228 Perio 10.0 5.0

Task: Rename species to NCBI_TaxID


In [38]:
# Answer:
df = df.rename(columns={"species": "NCBI_TaxID"})

# or

# columns = df.columns.tolist()
# df.columns = [columns[0]] + ["NCBI_TaxID"] + columns[2:]

Task: Delete the columns named frequency and rank


In [39]:
# Answer:
df = df.drop(["rank", "frequency"], axis=1)
df


Out[39]:
TaxName NCBI_TaxID Abundance SampleCategory
0 Rothia mucilaginosa 43675 9.861078 Caries
1 Veillonella atypica 39777 7.030844 Caries
2 Prevotella histicola 470565 6.433449 Caries
3 Veillonella dispar 39778 5.008735 Caries
4 Streptococcus salivarius 1304 4.653603 Caries
5 Rothia mucilaginosa 43675 10.521432 Healthy
6 Prevotella histicola 470565 5.224614 Healthy
7 Veillonella atypica 39777 4.792011 Healthy
8 Prevotella melaninogenica 28132 4.042395 Healthy
9 Veillonella dispar 39778 3.970057 Healthy
10 Rothia mucilaginosa 43675 7.273407 Perio
11 Prevotella melaninogenica 28132 4.673063 Perio
12 Veillonella atypica 39777 4.401230 Perio
13 Prevotella histicola 470565 4.328944 Perio
14 Veillonella dispar 39778 3.835228 Perio

Task: Select the top 2 most abundant taxa per sample category. Write a function that expects a DataFrame, a column-name, and top-n (an Integer indicating the top n most abundant things within the given column-name).


In [40]:
# Answer:
def top_n(df, colname, n=3):
    return df.sort_values(colname, ascending=False)[:n]

df.groupby("SampleCategory").apply(top_n, "Abundance", 2)


Out[40]:
TaxName NCBI_TaxID Abundance SampleCategory
SampleCategory
Caries 0 Rothia mucilaginosa 43675 9.861078 Caries
1 Veillonella atypica 39777 7.030844 Caries
Healthy 5 Rothia mucilaginosa 43675 10.521432 Healthy
6 Prevotella histicola 470565 5.224614 Healthy
Perio 10 Rothia mucilaginosa 43675 7.273407 Perio
11 Prevotella melaninogenica 28132 4.673063 Perio

Task: Create a column named Rank that ranks the taxa by their abundance within each SampleCategory in descending order (most abundant taxa get the lowest rank).


In [41]:
# Answer:
df["Rank"] = df.groupby("SampleCategory")["Abundance"].rank(ascending=False)
df


Out[41]:
TaxName NCBI_TaxID Abundance SampleCategory Rank
0 Rothia mucilaginosa 43675 9.861078 Caries 1.0
1 Veillonella atypica 39777 7.030844 Caries 2.0
2 Prevotella histicola 470565 6.433449 Caries 3.0
3 Veillonella dispar 39778 5.008735 Caries 4.0
4 Streptococcus salivarius 1304 4.653603 Caries 5.0
5 Rothia mucilaginosa 43675 10.521432 Healthy 1.0
6 Prevotella histicola 470565 5.224614 Healthy 2.0
7 Veillonella atypica 39777 4.792011 Healthy 3.0
8 Prevotella melaninogenica 28132 4.042395 Healthy 4.0
9 Veillonella dispar 39778 3.970057 Healthy 5.0
10 Rothia mucilaginosa 43675 7.273407 Perio 1.0
11 Prevotella melaninogenica 28132 4.673063 Perio 2.0
12 Veillonella atypica 39777 4.401230 Perio 3.0
13 Prevotella histicola 470565 4.328944 Perio 4.0
14 Veillonella dispar 39778 3.835228 Perio 5.0

Task: Reshape the DataFrame so that you can compare the values of Rank and Abundance from the three sample categories by placing them next to each other in one row per taxon. In other words, reshape in a way that you get one row per taxon, placing Rank and Abundance values from the three sample categories next to each other (from "long" to "wide" format).


In [42]:
# Answer:
abu = df.pivot_table(values='Abundance', index='TaxName', columns='SampleCategory')
rank = df.pivot_table(values='Rank', index='TaxName', columns='SampleCategory')
dfp = pd.concat([rank, abu], axis=1).reset_index()
dfp


Out[42]:
SampleCategory TaxName Caries Healthy Perio Caries Healthy Perio
0 Prevotella histicola 3.0 2.0 4.0 6.433449 5.224614 4.328944
1 Prevotella melaninogenica NaN 4.0 2.0 NaN 4.042395 4.673063
2 Rothia mucilaginosa 1.0 1.0 1.0 9.861078 10.521432 7.273407
3 Streptococcus salivarius 5.0 NaN NaN 4.653603 NaN NaN
4 Veillonella atypica 2.0 3.0 3.0 7.030844 4.792011 4.401230
5 Veillonella dispar 4.0 5.0 5.0 5.008735 3.970057 3.835228