In [1]:
import pandas as pd
import numpy as np
In [2]:
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')
In [3]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Task: Create a DataFrame df from this dictionary data which has the index labels.
In [4]:
# Answer
df = pd.DataFrame(data, index=labels)
Task: Display a summary of the basic information about this DataFrame and its data.
In [5]:
# Answer
df.info()
# ...or...
df.describe()
Out[5]:
Task: Return the first 3 rows of the DataFrame df.
In [6]:
# Answer
df.iloc[:3]
# or equivalently
df.head(3)
Out[6]:
Task: Select just the 'animal' and 'age' columns from the DataFrame df.
In [7]:
# Answer
df.loc[:, ['animal', 'age']]
# or
df[['animal', 'age']]
Out[7]:
Task: Select the data in rows ["d", "e", "i"] and in columns ['animal', 'age'].
In [8]:
# Answer
df.loc[["d", "e", "i"], ["animal", "age"]]
Out[8]:
Task: Select the data in rows [3, 4, 8] and in columns ['animal', 'age'].
In [9]:
# Answer
# either
df.ix[[3, 4, 8], ['animal', 'age']]
# or preferentially
df.iloc[[3, 4, 8], [1, 0]]
Out[9]:
Task: Select only the rows where the number of visits is greater than 2.
In [10]:
# Answer
df[df['visits'] > 2]
Out[10]:
Task: Select the rows where the age is missing, i.e. is NaN.
In [11]:
# Answer
df[df['age'].isnull()]
Out[11]:
Task: Select the rows where the animal is a cat and the age is less than 3.
In [12]:
# Answer
df[(df['animal'] == 'cat') & (df['age'] < 3)]
Out[12]:
Task: Change the age in row 'f' to 1.5.
In [13]:
# Answer
df.loc['f', 'age'] = 1.5
Task: Calculate the sum of all visits (the total number of visits).
In [14]:
# Answer
df['visits'].sum()
Out[14]:
Task: Calculate the mean age for each different animal in df.
In [15]:
# Answer
df.groupby('animal')['age'].mean()
Out[15]:
Task: Append a new row 'k' to df with your choice of values for each column. Then delete that row to return the original DataFrame.
In [16]:
# Answer
df.loc['k'] = [5.5, 'dog', 'no', 2]
# and then deleting the new row...
df = df.drop('k')
Task: Count the number of each type of animal in df.
In [17]:
# Answer
df['animal'].value_counts()
Out[17]:
Task: Sort df first by the values in the 'age' in decending order, then by the value in the 'visit' column in ascending order.
In [18]:
# Answer
df.sort_values(by=['age', 'visits'], ascending=[False, True])
Out[18]:
Task: The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.
In [19]:
# Answer
df['priority'] = df['priority'].map({'yes': True, 'no': False})
Task: In the 'animal' column, change the 'snake' entries to 'python'.
In [20]:
# Answer
df['animal'] = df['animal'].replace('snake', 'python')
Task: For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).
Slightly trickier: you may need to combine two or more methods to get the right answer
Difficulty: medium
The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.
Task: How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?
In [21]:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
In [22]:
# Answer
len(df) - df.duplicated(keep=False).sum()
# or perhaps more simply...
len(df.drop_duplicates(keep=False))
Out[22]:
Task: A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For each group, find the sum of the three greatest values.
In [23]:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
In [24]:
# Answer
df.groupby('grps')['vals'].nlargest(3).sum(level=0)
Out[24]:
Task: Consider a DataFrame consisting of purely numerical data. Create a list of the row-column indices of the 3 largest values.
In [25]:
df = pd.DataFrame.from_dict({'A': {1: 0, 2: 6, 3: 12, 4: 18, 5: 24},
'B': {1: 1, 2: 7, 3: 13, 4: 19, 5: 25},
'C': {1: 2, 2: 8, 3: 14, 4: 20, 5: 26},
'D': {1: 3, 2: 9, 3: 15, 4: 21, 5: 27},
'E': {1: 4, 2: 10, 3: 16, 4: 22, 5: 28},
'F': {1: 5, 2: 11, 3: 17, 4: 23, 5: 29}})
df.head()
Out[25]:
In [26]:
# Answer
df.unstack().sort_values()[-3:].index.tolist()
# http://stackoverflow.com/questions/14941261/index-and-column-for-the-max-value-in-pandas-dataframe/
# credit: DSM
Out[26]:
Task: Given a DataFrame with a column of group IDs, 'groups', and a column of corresponding integer values, 'vals', replace any negative values in 'vals' with the group mean.
In [27]:
import pandas as pd
In [28]:
df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=["groups", "vals"])
df
Out[28]:
In [29]:
# Answer
def replace_within_group(group):
mask = group < 0
# Select those values where it is < 0, and replace
# them with the mean of the values which are not < 0.
group[mask] = group[~mask].mean() # "~" is the "invert" or "complement" operation
return group
df.groupby(['groups'])['vals'].transform(replace_within_group)
# http://stackoverflow.com/questions/14760757/replacing-values-with-groupby-means/
# credit: unutbu
Out[29]:
In [30]:
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
'Budapest_PaRis', 'Brussels_londOn'],
'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
'12. Air France', '"Swiss Air"']})
Task: Some values in the the FlightNumber column are missing. These numbers are meant to increase by 10 with each row. Therefore the numbers 10055 and 10075 need to replace the missing values. Fill in these missing numbers and make the column an integer column (instead of a float column).
In [31]:
# Answer
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
Task: The From_To column would be better as two separate columns! Split each string on the underscore delimiter _ to make two new columns with the correct values. Assign the correct column names to a new temporary DataFrame called temp.
In [32]:
# Answer
# temp = df.From_To.str.split('_', expand=True)
# temp.columns = ['From', 'To']
# or
temp = pd.DataFrame()
temp["From"], temp["To"] = zip(*df["From_To"].apply(lambda s: s.split("_")))
Task: Notice how the capitalisation of the city names is all mixed up in this temporary DataFrame. Standardise the strings so that only the first letter is uppercase (e.g. "londON" should become "London".)
In [33]:
# Answer
# temp['From'] = temp['From'].str.capitalize()
# temp['To'] = temp['To'].str.capitalize()
# or
temp['From'] = temp['From'].apply(lambda s: s.capitalize())
temp['To'] = temp['To'].apply(lambda s: s.capitalize())
Task: Delete the From_To column from df and attach the temporary DataFrame from the previous questions.
In [34]:
# Answer
df = df.drop('From_To', axis=1)
df = df.join(temp)
Task: In the Airline column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. '(British Airways. )' should become 'British Airways'.
In [35]:
# Answer
df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()
Task:. In the RecentDelays column, the values have been entered into the DataFrame as a list. We would like each first value in its own column, each second value in its own column, and so on. If there isn't an Nth value, the value should be NaN.
Expand the Series of lists into a DataFrame named delays, rename the columns delay_1, delay_2, etc. and replace the unwanted RecentDelays column in df with delays.
In [36]:
# Answer
# there are several ways to do this, but the following approach is possibly the simplest
delays = df['RecentDelays'].apply(pd.Series)
delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)]
df = df.drop('RecentDelays', axis=1).join(delays)
The DataFrame should look much better now.
In [37]:
fn = "data/Saliva.txt"
df = pd.read_csv(fn, sep='\t')
df
Out[37]:
Task: Rename species to NCBI_TaxID
In [38]:
# Answer:
df = df.rename(columns={"species": "NCBI_TaxID"})
# or
# columns = df.columns.tolist()
# df.columns = [columns[0]] + ["NCBI_TaxID"] + columns[2:]
Task: Delete the columns named frequency and rank
In [39]:
# Answer:
df = df.drop(["rank", "frequency"], axis=1)
df
Out[39]:
Task: Select the top 2 most abundant taxa per sample category. Write a function that expects a DataFrame, a column-name, and top-n (an Integer indicating the top n most abundant things within the given column-name).
In [40]:
# Answer:
def top_n(df, colname, n=3):
return df.sort_values(colname, ascending=False)[:n]
df.groupby("SampleCategory").apply(top_n, "Abundance", 2)
Out[40]:
Task: Create a column named Rank that ranks the taxa by their abundance within each SampleCategory in descending order (most abundant taxa get the lowest rank).
In [41]:
# Answer:
df["Rank"] = df.groupby("SampleCategory")["Abundance"].rank(ascending=False)
df
Out[41]:
Task: Reshape the DataFrame so that you can compare the values of Rank and Abundance from the three sample categories by placing them next to each other in one row per taxon. In other words, reshape in a way that you get one row per taxon, placing Rank and Abundance values from the three sample categories next to each other (from "long" to "wide" format).
In [42]:
# Answer:
abu = df.pivot_table(values='Abundance', index='TaxName', columns='SampleCategory')
rank = df.pivot_table(values='Rank', index='TaxName', columns='SampleCategory')
dfp = pd.concat([rank, abu], axis=1).reset_index()
dfp
Out[42]: