In [ ]:
import pandas as pd
import numpy as np

Table of Contents


In [ ]:
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')

DataFrame basics

Difficulty: easy

A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames

Consider the following Python dictionary data and Python list labels:


In [ ]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Task: Create a DataFrame df from this dictionary data which has the index labels.


In [ ]:
# Write your answer here

Task: Display a summary of the basic information about this DataFrame and its data.


In [ ]:
# Write your answer here

Task: Return the first 3 rows of the DataFrame df.


In [ ]:
# Write your answer here

Task: Select just the 'animal' and 'age' columns from the DataFrame df.


In [ ]:
# Write your answer here

Task: Select the data in rows ["d", "e", "i"] and in columns ['animal', 'age'].


In [ ]:
# Write your answer here

Task: Select the data in rows [3, 4, 8] and in columns ['animal', 'age'].


In [ ]:
# Write your answer here

Task: Select only the rows where the number of visits is greater than 2.


In [ ]:
# Write your answer here

Task: Select the rows where the age is missing, i.e. is NaN.


In [ ]:
# Write your answer here

Task: Select the rows where the animal is a cat and the age is less than 3.


In [ ]:
# Write your answer here

Task: Change the age in row 'f' to 1.5.


In [ ]:
# Write your answer here

Task: Calculate the sum of all visits (the total number of visits).


In [ ]:
# Write your answer here

Task: Calculate the mean age for each different animal in df.


In [ ]:
# Write your answer here

Task: Append a new row 'k' to df with your choice of values for each column. Then delete that row to return the original DataFrame.


In [ ]:
# Write your answer here

Task: Count the number of each type of animal in df.


In [ ]:
# Write your answer here

Task: Sort df first by the values in the 'age' in decending order, then by the value in the 'visit' column in ascending order.


In [ ]:
# Write your answer here

Task: The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.


In [ ]:
# Write your answer here

Task: In the 'animal' column, change the 'snake' entries to 'python'.


In [ ]:
# Write your answer here

Task: For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

DataFrames: beyond the basics

Slightly trickier: you may need to combine two or more methods to get the right answer

Difficulty: medium

The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.

Task: How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?


In [ ]:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

In [ ]:
# Write your answer here

Task: A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For each group, find the sum of the three greatest values.


In [ ]:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})

In [ ]:
# Write your answer here

DataFrames: harder problems

These might require a bit of thinking outside the box...

...but all are solvable using just the usual pandas/NumPy methods (and so avoid using explicit for loops).

Difficulty: hard

Task: Consider a DataFrame consisting of purely numerical data. Create a list of the row-column indices of the 3 largest values.


In [ ]:
df = pd.DataFrame.from_dict({'A': {1: 0, 2: 6, 3: 12, 4: 18, 5: 24},
 'B': {1: 1, 2: 7, 3: 13, 4: 19, 5: 25},
 'C': {1: 2, 2: 8, 3: 14, 4: 20, 5: 26},
 'D': {1: 3, 2: 9, 3: 15, 4: 21, 5: 27},
 'E': {1: 4, 2: 10, 3: 16, 4: 22, 5: 28},
 'F': {1: 5, 2: 11, 3: 17, 4: 23, 5: 29}})
df.head()

In [ ]:
# Write your answer here

Task: Given a DataFrame with a column of group IDs, 'groups', and a column of corresponding integer values, 'vals', replace any negative values in 'vals' with the group mean.


In [ ]:
import pandas as pd

In [ ]:
df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=["groups", "vals"])
df

In [ ]:
# Write your answer here

Cleaning Data

Making a DataFrame easier to work with

Difficulty: easy/medium

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?


In [ ]:
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})

Task: Some values in the the FlightNumber column are missing. These numbers are meant to increase by 10 with each row. Therefore the numbers 10055 and 10075 need to replace the missing values. Fill in these missing numbers and make the column an integer column (instead of a float column).


In [ ]:
# Write your answer here

Apply

Task: The From_To column would be better as two separate columns! Split each string on the underscore delimiter _ to make two new columns with the correct values. Assign the correct column names to a new temporary DataFrame called temp.


In [ ]:
# Write your answer here

Task: Notice how the capitalisation of the city names is all mixed up in this temporary DataFrame. Standardise the strings so that only the first letter is uppercase (e.g. "londON" should become "London".)


In [ ]:
# Write your answer here

Task: Delete the From_To column from df and attach the temporary DataFrame from the previous questions.


In [ ]:
# Write your answer here

Task: In the Airline column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. '(British Airways. )' should become 'British Airways'.


In [ ]:
# Write your answer here

Task:. In the RecentDelays column, the values have been entered into the DataFrame as a list. We would like each first value in its own column, each second value in its own column, and so on. If there isn't an Nth value, the value should be NaN.

Expand the Series of lists into a DataFrame named delays, rename the columns delay_1, delay_2, etc. and replace the unwanted RecentDelays column in df with delays.


In [ ]:
# Write your answer here

The DataFrame should look much better now.

Rename, delete, rank and pivot


In [ ]:
fn = "data/Saliva.txt"
df = pd.read_csv(fn, sep='\t')
df

Task: Rename species to NCBI_TaxID


In [ ]:
# Write your answer here

Task: Delete the columns named frequency and rank


In [ ]:
# Write your answer here

Task: Select the top 2 most abundant taxa per sample category. Write a function that expects a DataFrame, a column-name, and top-n (an Integer indicating the top n most abundant things within the given column-name).


In [ ]:
# Write your answer here

Task: Create a column named Rank that ranks the taxa by their abundance within each SampleCategory in descending order (most abundant taxa get the lowest rank).


In [ ]:
# Write your answer here

Task: Reshape the DataFrame so that you can compare the values of Rank and Abundance from the three sample categories by placing them next to each other in one row per taxon. In other words, reshape in a way that you get one row per taxon, placing Rank and Abundance values from the three sample categories next to each other (from "long" to "wide" format).


In [ ]:
# Write your answer here