In [1]:

    
import pandas as pd
import numpy as np



In [2]:

    
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')

DataFrame basics

Difficulty: easy

A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames

Consider the following Python dictionary data and Python list labels:



In [3]:

    
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Task: Create a DataFrame df from this dictionary data which has the index labels.



In [4]:

    
# Answer
df = pd.DataFrame(data, index=labels)

Task: Display a summary of the basic information about this DataFrame and its data.



In [5]:

    
# Answer
df.info()

# ...or...

df.describe()









    



<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
age         8 non-null float64
animal      10 non-null object
priority    10 non-null object
visits      10 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes






    Out[5]:






  
    
      
      age
      visits
    
  
  
    
      count
      8.000000
      10.000000
    
    
      mean
      3.437500
      1.900000
    
    
      std
      2.007797
      0.875595
    
    
      min
      0.500000
      1.000000
    
    
      25%
      2.375000
      1.000000
    
    
      50%
      3.000000
      2.000000
    
    
      75%
      4.625000
      2.750000
    
    
      max
      7.000000
      3.000000

Task: Return the first 3 rows of the DataFrame df.



In [6]:

    
# Answer
df.iloc[:3]

# or equivalently

df.head(3)









    Out[6]:






  
    
      
      age
      animal
      priority
      visits
    
  
  
    
      a
      2.5
      cat
      yes
      1
    
    
      b
      3.0
      cat
      yes
      3
    
    
      c
      0.5
      snake
      no
      2

Task: Select just the 'animal' and 'age' columns from the DataFrame df.



In [7]:

    
# Answer
df.loc[:, ['animal', 'age']]

# or

df[['animal', 'age']]

Task: Select the data in rows ["d", "e", "i"] and in columns ['animal', 'age'].



In [8]:

    
# Answer
df.loc[["d", "e", "i"], ["animal", "age"]]

Task: Select the data in rows [3, 4, 8] and in columns ['animal', 'age'].



In [9]:

    
# Answer
# either
df.ix[[3, 4, 8], ['animal', 'age']]
# or preferentially
df.iloc[[3, 4, 8], [1, 0]]

Task: Select only the rows where the number of visits is greater than 2.



In [10]:

    
# Answer
df[df['visits'] > 2]









    Out[10]:






  
    
      
      age
      animal
      priority
      visits
    
  
  
    
      b
      3.0
      cat
      yes
      3
    
    
      d
      NaN
      dog
      yes
      3
    
    
      f
      2.0
      cat
      no
      3

Task: Select the rows where the age is missing, i.e. is NaN.



In [11]:

    
# Answer
df[df['age'].isnull()]









    Out[11]:






  
    
      
      age
      animal
      priority
      visits
    
  
  
    
      d
      NaN
      dog
      yes
      3
    
    
      h
      NaN
      cat
      yes
      1

Task: Select the rows where the animal is a cat and the age is less than 3.



In [12]:

    
# Answer
df[(df['animal'] == 'cat') & (df['age'] < 3)]









    Out[12]:






  
    
      
      age
      animal
      priority
      visits
    
  
  
    
      a
      2.5
      cat
      yes
      1
    
    
      f
      2.0
      cat
      no
      3

Task: Change the age in row 'f' to 1.5.



In [13]:

    
# Answer
df.loc['f', 'age'] = 1.5

Task: Calculate the sum of all visits (the total number of visits).



In [14]:

    
# Answer
df['visits'].sum()









    Out[14]:





19

Task: Calculate the mean age for each different animal in df.



In [15]:

    
# Answer
df.groupby('animal')['age'].mean()









    Out[15]:





animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

Task: Append a new row 'k' to df with your choice of values for each column. Then delete that row to return the original DataFrame.



In [16]:

    
# Answer
df.loc['k'] = [5.5, 'dog', 'no', 2]

# and then deleting the new row...

df = df.drop('k')

Task: Count the number of each type of animal in df.



In [17]:

    
# Answer
df['animal'].value_counts()









    Out[17]:





cat      4
dog      4
snake    2
Name: animal, dtype: int64

Task: Sort df first by the values in the 'age' in decending order, then by the value in the 'visit' column in ascending order.



In [18]:

    
# Answer
df.sort_values(by=['age', 'visits'], ascending=[False, True])

Task: The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.



In [19]:

    
# Answer
df['priority'] = df['priority'].map({'yes': True, 'no': False})

Task: In the 'animal' column, change the 'snake' entries to 'python'.



In [20]:

    
# Answer
df['animal'] = df['animal'].replace('snake', 'python')

Task: For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

DataFrames: beyond the basics

Slightly trickier: you may need to combine two or more methods to get the right answer

Difficulty: medium

The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.

Task: How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?



In [21]:

    
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})



In [22]:

    
# Answer
len(df) - df.duplicated(keep=False).sum()

# or perhaps more simply...

len(df.drop_duplicates(keep=False))









    Out[22]:





4

Task: A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For each group, find the sum of the three greatest values.



In [23]:

    
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})



In [24]:

    
# Answer
df.groupby('grps')['vals'].nlargest(3).sum(level=0)









    Out[24]:





grps
a    409
b    156
c    345
Name: vals, dtype: int64

DataFrames: harder problems

These might require a bit of thinking outside the box...

...but all are solvable using just the usual pandas/NumPy methods (and so avoid using explicit for loops).

Difficulty: hard

Task: Consider a DataFrame consisting of purely numerical data. Create a list of the row-column indices of the 3 largest values.



In [25]:

    
df = pd.DataFrame.from_dict({'A': {1: 0, 2: 6, 3: 12, 4: 18, 5: 24},
 'B': {1: 1, 2: 7, 3: 13, 4: 19, 5: 25},
 'C': {1: 2, 2: 8, 3: 14, 4: 20, 5: 26},
 'D': {1: 3, 2: 9, 3: 15, 4: 21, 5: 27},
 'E': {1: 4, 2: 10, 3: 16, 4: 22, 5: 28},
 'F': {1: 5, 2: 11, 3: 17, 4: 23, 5: 29}})
df.head()



In [26]:

    
# Answer
df.unstack().sort_values()[-3:].index.tolist()
# http://stackoverflow.com/questions/14941261/index-and-column-for-the-max-value-in-pandas-dataframe/
# credit: DSM









    Out[26]:





[('D', 5), ('E', 5), ('F', 5)]

Task: Given a DataFrame with a column of group IDs, 'groups', and a column of corresponding integer values, 'vals', replace any negative values in 'vals' with the group mean.



In [27]:

    
import pandas as pd



In [28]:

    
df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=["groups", "vals"])
df



In [29]:

    
# Answer
def replace_within_group(group):
    mask = group < 0
    # Select those values where it is < 0, and replace
    # them with the mean of the values which are not < 0.
    group[mask] = group[~mask].mean() # "~" is the "invert" or "complement" operation
    return group

df.groupby(['groups'])['vals'].transform(replace_within_group)
# http://stackoverflow.com/questions/14760757/replacing-values-with-groupby-means/
# credit: unutbu









    Out[29]:





0    1
1    1
2    1
3    2
Name: vals, dtype: int64

Cleaning Data

Making a DataFrame easier to work with

Difficulty: easy/medium

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?



In [30]:

    
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})

Task: Some values in the the FlightNumber column are missing. These numbers are meant to increase by 10 with each row. Therefore the numbers 10055 and 10075 need to replace the missing values. Fill in these missing numbers and make the column an integer column (instead of a float column).



In [31]:

    
# Answer
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)

Apply

Task: The From_To column would be better as two separate columns! Split each string on the underscore delimiter _ to make two new columns with the correct values. Assign the correct column names to a new temporary DataFrame called temp.



In [32]:

    
# Answer
# temp = df.From_To.str.split('_', expand=True)
# temp.columns = ['From', 'To']

# or

temp = pd.DataFrame()
temp["From"], temp["To"] = zip(*df["From_To"].apply(lambda s: s.split("_")))

Task: Notice how the capitalisation of the city names is all mixed up in this temporary DataFrame. Standardise the strings so that only the first letter is uppercase (e.g. "londON" should become "London".)



In [33]:

    
# Answer
# temp['From'] = temp['From'].str.capitalize()
# temp['To'] = temp['To'].str.capitalize()

# or

temp['From'] = temp['From'].apply(lambda s: s.capitalize())
temp['To'] = temp['To'].apply(lambda s: s.capitalize())

Task: Delete the From_To column from df and attach the temporary DataFrame from the previous questions.



In [34]:

    
# Answer
df = df.drop('From_To', axis=1)
df = df.join(temp)

Task: In the Airline column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. '(British Airways. )' should become 'British Airways'.



In [35]:

    
# Answer
df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()

Task:. In the RecentDelays column, the values have been entered into the DataFrame as a list. We would like each first value in its own column, each second value in its own column, and so on. If there isn't an Nth value, the value should be NaN.

Expand the Series of lists into a DataFrame named delays, rename the columns delay_1, delay_2, etc. and replace the unwanted RecentDelays column in df with delays.



In [36]:

    
# Answer
# there are several ways to do this, but the following approach is possibly the simplest

delays = df['RecentDelays'].apply(pd.Series)

delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)]

df = df.drop('RecentDelays', axis=1).join(delays)

The DataFrame should look much better now.

Rename, delete, rank and pivot



In [37]:

    
fn = "data/Saliva.txt"
df = pd.read_csv(fn, sep='\t')
df









    Out[37]:






  
    
      
      TaxName
      species
      Abundance
      SampleCategory
      frequency
      rank
    
  
  
    
      0
      Rothia mucilaginosa
      43675
      9.861078
      Caries
      10.0
      1.0
    
    
      1
      Veillonella atypica
      39777
      7.030844
      Caries
      10.0
      2.0
    
    
      2
      Prevotella histicola
      470565
      6.433449
      Caries
      10.0
      3.0
    
    
      3
      Veillonella dispar
      39778
      5.008735
      Caries
      10.0
      4.0
    
    
      4
      Streptococcus salivarius
      1304
      4.653603
      Caries
      10.0
      5.0
    
    
      5
      Rothia mucilaginosa
      43675
      10.521432
      Healthy
      10.0
      1.0
    
    
      6
      Prevotella histicola
      470565
      5.224614
      Healthy
      10.0
      2.0
    
    
      7
      Veillonella atypica
      39777
      4.792011
      Healthy
      10.0
      3.0
    
    
      8
      Prevotella melaninogenica
      28132
      4.042395
      Healthy
      10.0
      4.0
    
    
      9
      Veillonella dispar
      39778
      3.970057
      Healthy
      10.0
      5.0
    
    
      10
      Rothia mucilaginosa
      43675
      7.273407
      Perio
      9.0
      1.0
    
    
      11
      Prevotella melaninogenica
      28132
      4.673063
      Perio
      10.0
      2.0
    
    
      12
      Veillonella atypica
      39777
      4.401230
      Perio
      10.0
      3.0
    
    
      13
      Prevotella histicola
      470565
      4.328944
      Perio
      10.0
      4.0
    
    
      14
      Veillonella dispar
      39778
      3.835228
      Perio
      10.0
      5.0

Task: Rename species to NCBI_TaxID



In [38]:

    
# Answer:
df = df.rename(columns={"species": "NCBI_TaxID"})

# or

# columns = df.columns.tolist()
# df.columns = [columns[0]] + ["NCBI_TaxID"] + columns[2:]

Task: Delete the columns named frequency and rank



In [39]:

    
# Answer:
df = df.drop(["rank", "frequency"], axis=1)
df









    Out[39]:






  
    
      
      TaxName
      NCBI_TaxID
      Abundance
      SampleCategory
    
  
  
    
      0
      Rothia mucilaginosa
      43675
      9.861078
      Caries
    
    
      1
      Veillonella atypica
      39777
      7.030844
      Caries
    
    
      2
      Prevotella histicola
      470565
      6.433449
      Caries
    
    
      3
      Veillonella dispar
      39778
      5.008735
      Caries
    
    
      4
      Streptococcus salivarius
      1304
      4.653603
      Caries
    
    
      5
      Rothia mucilaginosa
      43675
      10.521432
      Healthy
    
    
      6
      Prevotella histicola
      470565
      5.224614
      Healthy
    
    
      7
      Veillonella atypica
      39777
      4.792011
      Healthy
    
    
      8
      Prevotella melaninogenica
      28132
      4.042395
      Healthy
    
    
      9
      Veillonella dispar
      39778
      3.970057
      Healthy
    
    
      10
      Rothia mucilaginosa
      43675
      7.273407
      Perio
    
    
      11
      Prevotella melaninogenica
      28132
      4.673063
      Perio
    
    
      12
      Veillonella atypica
      39777
      4.401230
      Perio
    
    
      13
      Prevotella histicola
      470565
      4.328944
      Perio
    
    
      14
      Veillonella dispar
      39778
      3.835228
      Perio

Task: Select the top 2 most abundant taxa per sample category. Write a function that expects a DataFrame, a column-name, and top-n (an Integer indicating the top n most abundant things within the given column-name).



In [40]:

    
# Answer:
def top_n(df, colname, n=3):
    return df.sort_values(colname, ascending=False)[:n]

df.groupby("SampleCategory").apply(top_n, "Abundance", 2)









    Out[40]:






  
    
      
      
      TaxName
      NCBI_TaxID
      Abundance
      SampleCategory
    
    
      SampleCategory
      
      
      
      
      
    
  
  
    
      Caries
      0
      Rothia mucilaginosa
      43675
      9.861078
      Caries
    
    
      1
      Veillonella atypica
      39777
      7.030844
      Caries
    
    
      Healthy
      5
      Rothia mucilaginosa
      43675
      10.521432
      Healthy
    
    
      6
      Prevotella histicola
      470565
      5.224614
      Healthy
    
    
      Perio
      10
      Rothia mucilaginosa
      43675
      7.273407
      Perio
    
    
      11
      Prevotella melaninogenica
      28132
      4.673063
      Perio

Task: Create a column named Rank that ranks the taxa by their abundance within each SampleCategory in descending order (most abundant taxa get the lowest rank).



In [41]:

    
# Answer:
df["Rank"] = df.groupby("SampleCategory")["Abundance"].rank(ascending=False)
df









    Out[41]:






  
    
      
      TaxName
      NCBI_TaxID
      Abundance
      SampleCategory
      Rank
    
  
  
    
      0
      Rothia mucilaginosa
      43675
      9.861078
      Caries
      1.0
    
    
      1
      Veillonella atypica
      39777
      7.030844
      Caries
      2.0
    
    
      2
      Prevotella histicola
      470565
      6.433449
      Caries
      3.0
    
    
      3
      Veillonella dispar
      39778
      5.008735
      Caries
      4.0
    
    
      4
      Streptococcus salivarius
      1304
      4.653603
      Caries
      5.0
    
    
      5
      Rothia mucilaginosa
      43675
      10.521432
      Healthy
      1.0
    
    
      6
      Prevotella histicola
      470565
      5.224614
      Healthy
      2.0
    
    
      7
      Veillonella atypica
      39777
      4.792011
      Healthy
      3.0
    
    
      8
      Prevotella melaninogenica
      28132
      4.042395
      Healthy
      4.0
    
    
      9
      Veillonella dispar
      39778
      3.970057
      Healthy
      5.0
    
    
      10
      Rothia mucilaginosa
      43675
      7.273407
      Perio
      1.0
    
    
      11
      Prevotella melaninogenica
      28132
      4.673063
      Perio
      2.0
    
    
      12
      Veillonella atypica
      39777
      4.401230
      Perio
      3.0
    
    
      13
      Prevotella histicola
      470565
      4.328944
      Perio
      4.0
    
    
      14
      Veillonella dispar
      39778
      3.835228
      Perio
      5.0

Task: Reshape the DataFrame so that you can compare the values of Rank and Abundance from the three sample categories by placing them next to each other in one row per taxon. In other words, reshape in a way that you get one row per taxon, placing Rank and Abundance values from the three sample categories next to each other (from "long" to "wide" format).



In [42]:

    
# Answer:
abu = df.pivot_table(values='Abundance', index='TaxName', columns='SampleCategory')
rank = df.pivot_table(values='Rank', index='TaxName', columns='SampleCategory')
dfp = pd.concat([rank, abu], axis=1).reset_index()
dfp









    Out[42]:






  
    
      SampleCategory
      TaxName
      Caries
      Healthy
      Perio
      Caries
      Healthy
      Perio
    
  
  
    
      0
      Prevotella histicola
      3.0
      2.0
      4.0
      6.433449
      5.224614
      4.328944
    
    
      1
      Prevotella melaninogenica
      NaN
      4.0
      2.0
      NaN
      4.042395
      4.673063
    
    
      2
      Rothia mucilaginosa
      1.0
      1.0
      1.0
      9.861078
      10.521432
      7.273407
    
    
      3
      Streptococcus salivarius
      5.0
      NaN
      NaN
      4.653603
      NaN
      NaN
    
    
      4
      Veillonella atypica
      2.0
      3.0
      3.0
      7.030844
      4.792011
      4.401230
    
    
      5
      Veillonella dispar
      4.0
      5.0
      5.0
      5.008735
      3.970057
      3.835228

	age	visits
count	8.000000	10.000000
mean	3.437500	1.900000
std	2.007797	0.875595
min	0.500000	1.000000
25%	2.375000	1.000000
50%	3.000000	2.000000
75%	4.625000	2.750000
max	7.000000	3.000000

	animal	age
a	cat	2.5
b	cat	3.0
c	snake	0.5
d	dog	NaN
e	dog	5.0
f	cat	2.0
g	snake	4.5
h	cat	NaN
i	dog	7.0
j	dog	3.0

	TaxName	species	Abundance	SampleCategory	frequency	rank
0	Rothia mucilaginosa	43675	9.861078	Caries	10.0	1.0
1	Veillonella atypica	39777	7.030844	Caries	10.0	2.0
2	Prevotella histicola	470565	6.433449	Caries	10.0	3.0
3	Veillonella dispar	39778	5.008735	Caries	10.0	4.0
4	Streptococcus salivarius	1304	4.653603	Caries	10.0	5.0
5	Rothia mucilaginosa	43675	10.521432	Healthy	10.0	1.0
6	Prevotella histicola	470565	5.224614	Healthy	10.0	2.0
7	Veillonella atypica	39777	4.792011	Healthy	10.0	3.0
8	Prevotella melaninogenica	28132	4.042395	Healthy	10.0	4.0
9	Veillonella dispar	39778	3.970057	Healthy	10.0	5.0
10	Rothia mucilaginosa	43675	7.273407	Perio	9.0	1.0
11	Prevotella melaninogenica	28132	4.673063	Perio	10.0	2.0
12	Veillonella atypica	39777	4.401230	Perio	10.0	3.0
13	Prevotella histicola	470565	4.328944	Perio	10.0	4.0
14	Veillonella dispar	39778	3.835228	Perio	10.0	5.0