In [ ]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')

Table of Contents


In [ ]:
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')

Getting and Knowing your Data

Task: load the following file as a data frame


In [ ]:
fn = r"data/drinks.csv"

In [ ]:
# Write your answer here

Task: See the first 10 entries


In [ ]:
# Write your answer here

Task: Which country has the highest alcohol consumption (total litres of pure alcohol)?


In [ ]:
# Write your answer here

Groupby

Task: Which continent drinks most beer on average?


In [ ]:
# Write your answer here

Task: List all unique continents.


In [ ]:
# Write your answer here

Task: Which countries have missing values in the continent column?


In [ ]:
# Write your answer here

Task: Set "the" missing continent with a name of your choice.


In [ ]:
# Write your answer here

Task: For each continent print "the" statistics (summary stats using "df.describe()") for wine consumption.


In [ ]:
# Write your answer here

Task: Print the median alcoohol consumption per continent for every column


In [ ]:
# Write your answer here

Task: Print the mean, min and max values for spirit consumption.


In [ ]:
# Write your answer here

Task: GroupBy Continent and create a Boxplot. (Hint: using e.g. figsize=(12, 9), rot=90 might help with legibility.)


In [ ]:
# Write your answer here

Concatenate, Merge & Join

Task: Import the first dataset cars1 and cars2. Assign each to a to a variable called cars1 and cars2.


In [ ]:
# Write your answer here

Task: It seems our first dataset has some unnamed blank columns, fix cars1.


In [ ]:
# Write your answer here

Task: Join cars1 and cars2 into a single DataFrame called cars


In [ ]:
# Write your answer here

Apply (interspersed)

Task: Create function that returns the first word of the string in the "car" column, the manufacturer name. Use the "apply" method to create a new column in the DataFrame.


In [ ]:
# Write your answer here

Consider the following DataFrames for the next exercises


In [ ]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']},
                     index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                     'B': ['B4', 'B5', 'B6', 'B7'],
                     'C': ['C4', 'C5', 'C6', 'C7'],
                     'D': ['D4', 'D5', 'D6', 'D7']},
                      index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                     'B': ['B8', 'B9', 'B10', 'B11'],
                     'C': ['C8', 'C9', 'C10', 'C11'],
                     'D': ['D8', 'D9', 'D10', 'D11']},
                     index=[8, 9, 10, 11])

Task: Concatenate the three DataFrames along the rows.


In [ ]:
# Write your answer here

Task: How many missing values (NaNs) are produced if you concatenate along the other axis (appending the columns)?


In [ ]:
# Write your answer here

Let's consider another data set to do some more Merge, Join & Concatenate exerciseses


In [ ]:
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8', '9', '10'],
        'first_name': ['Alice', 'Ayoung', 'Bran', 'Bryce', 'Betty', 'Jane', np.nan], 
        'last_name': ['Aoni', 'Atiches', 'Balwner', 'Brice', 'Btisan', np.nan, 'Doe']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])

Task: Join the two dataframes, data1 and data2, along rows and assign all_data. Make sure that the row index is unique.


In [ ]:
# Write your answer here

Task: Join the two dataframes, data1 and data2, along columns and assing to all_data_col.


In [ ]:
# Write your answer here

Task: Merge all_data and data3 along the subject_id value.


In [ ]:
# Write your answer here

Task: How many test_ids have missing values in the first or last name column?


In [ ]:
# Write your answer here

Task: Merge only the data that has the same 'subject_id' in both data1 and data2.


In [ ]:
# Write your answer here

Transform

The transform method returns an object that is indexed the same (same size) as the one being grouped.

Task: Given a DataFrame with a column of group IDs, 'groups', and a column of corresponding integer values, 'vals', replace any negative values in 'vals' with the group mean.


In [ ]:
# Write your answer here

In [ ]:
# Write your answer here

Task: Use groupby in conjunction with transform across multiple columns: We want to group by one to n columns and apply a function on these groups across two columns.

1. Calculate the sum of `a` and `b` and assign it to a column named `e`.
2. Group by 'c' and `d`, and calculate the sum of `e`

In [ ]:
df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})

In [ ]:
# Write your answer here

Task: Normalize (standardize) the data by calculating the z-score. Group the data by year and calculate the z-score per group. z = (value - mean) / standard_deviation

$$z=\frac{x-\mu}{\sigma}$$

In [ ]:
index = pd.date_range('10/1/1999', periods=1100)
ser = pd.Series(np.random.normal(0.5, 2, 1100), index=index)
ser = ser.rolling(window=100,min_periods=100).mean().dropna()

In [ ]:
# Answer:
key = lambda x: x.year
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ser.groupby(key).transform(zscore)

Task: We would expect the result to now have mean 0 and standard deviation 1 within each group, which we can easily check. Calculate the mean and standard deviation within each group.


In [ ]:
# Write your answer here

Task: Visually compare the original and transformed data sets.


In [ ]:
# Write your answer here

Pivot

Task: Let's reshape this small example DataFrame of ICD10 codes. Each person has different code-associations. Only positive associations are listed. Transform (reshape) the DataFrame to a wide format (one column per code) that lists positive and negative (missing) associations as Booleans.


In [ ]:
df = pd.DataFrame({"Person": ["a", "a", "a", "b", "c", "c"], "Code": ["D99", "E32", "A41", "D99", "D99", "A41"]}, columns=["Person", "Code"])
df

In [ ]:
# Write your answer here

Combine DataFrames

Task: In the data/microbiome subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10th file that describes the content of each. Write code that imports each of the data spreadsheets and combines them into a single DataFrame, adding the identifying information from the metadata spreadsheet as columns in the combined DataFrame.


In [ ]:
# Write your answer here

GroupBy Titanic data

Load the dataset in titanic.xls. It contains data on all the passengers that travelled on the Titanic.

Task: Women and children first?

  1. Use the groupby method to calculate the proportion of passengers that survived by sex.
  2. Calculate the same proportion, but by class and sex.
  3. Create age categories: children (under 14 years), adolescents (14-20), adult (21-64), and senior(65+), and calculate survival proportions by age category, class and sex. Additionally, count the number of passengers per group.

In [ ]:
# Write your answer here

Let's plot the number of survivors grouped by sex and passenger class.


In [ ]:
# Write your answer here

Task: Let's also look at the deaths (and not only at the survivors) within the groups and create a stacked Barplot of survivers vs. deaths grouped by sex and passenger-class (as before).

  1. Convert the "survived" column to boolean values
  2. Compute the cross tabulation (a.k.a. contingency table) of passenger-class and sex vs. survived. Assign the result to the variable name "death_counts" --> checkout pd.crosstab()
  3. Create a stacked barplot of the computed death_counts

In [ ]:
# Write your answer here

Task: Another way of comparing the groups is to look at the survival rate, by adjusting for the number of people in each group. Create a stacked, horizontal Barplot of the adjusted death counts

  1. Sum the death_counts per passenger-class and sex, and convert to data type float (for Python 2.x division purposes).
  2. Compute the adjusted survival rate by dividing the death_counts by the result from 1.
  3. Plot a stacked, horizontal Barplot from the result of 3.

In [ ]:
# Write your answer here