In [ ]:

    
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')



In [ ]:

    
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')

Getting and Knowing your Data

Task: load the following file as a data frame



In [ ]:

    
fn = r"data/drinks.csv"



In [ ]:

    
# Write your answer here

Task: See the first 10 entries



In [ ]:

    
# Write your answer here

Task: Which country has the highest alcohol consumption (total litres of pure alcohol)?



In [ ]:

    
# Write your answer here

Groupby

Task: Which continent drinks most beer on average?



In [ ]:

    
# Write your answer here

Task: List all unique continents.



In [ ]:

    
# Write your answer here

Task: Which countries have missing values in the continent column?



In [ ]:

    
# Write your answer here

Task: Set "the" missing continent with a name of your choice.



In [ ]:

    
# Write your answer here

Task: For each continent print "the" statistics (summary stats using "df.describe()") for wine consumption.



In [ ]:

    
# Write your answer here

Task: Print the median alcoohol consumption per continent for every column



In [ ]:

    
# Write your answer here

Task: Print the mean, min and max values for spirit consumption.



In [ ]:

    
# Write your answer here

Task: GroupBy Continent and create a Boxplot. (Hint: using e.g. figsize=(12, 9), rot=90 might help with legibility.)



In [ ]:

    
# Write your answer here

Concatenate, Merge & Join

Task: Import the first dataset cars1 and cars2. Assign each to a to a variable called cars1 and cars2.



In [ ]:

    
# Write your answer here

Task: It seems our first dataset has some unnamed blank columns, fix cars1.



In [ ]:

    
# Write your answer here

Task: Join cars1 and cars2 into a single DataFrame called cars



In [ ]:

    
# Write your answer here

Apply (interspersed)

Task: Create function that returns the first word of the string in the "car" column, the manufacturer name. Use the "apply" method to create a new column in the DataFrame.



In [ ]:

    
# Write your answer here

Consider the following DataFrames for the next exercises



In [ ]:

    
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']},
                     index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                     'B': ['B4', 'B5', 'B6', 'B7'],
                     'C': ['C4', 'C5', 'C6', 'C7'],
                     'D': ['D4', 'D5', 'D6', 'D7']},
                      index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                     'B': ['B8', 'B9', 'B10', 'B11'],
                     'C': ['C8', 'C9', 'C10', 'C11'],
                     'D': ['D8', 'D9', 'D10', 'D11']},
                     index=[8, 9, 10, 11])

Task: Concatenate the three DataFrames along the rows.



In [ ]:

    
# Write your answer here

Task: How many missing values (NaNs) are produced if you concatenate along the other axis (appending the columns)?



In [ ]:

    
# Write your answer here

Let's consider another data set to do some more Merge, Join & Concatenate exerciseses



In [ ]:

    
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8', '9', '10'],
        'first_name': ['Alice', 'Ayoung', 'Bran', 'Bryce', 'Betty', 'Jane', np.nan], 
        'last_name': ['Aoni', 'Atiches', 'Balwner', 'Brice', 'Btisan', np.nan, 'Doe']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])

Task: Join the two dataframes, data1 and data2, along rows and assign all_data. Make sure that the row index is unique.



In [ ]:

    
# Write your answer here

Task: Join the two dataframes, data1 and data2, along columns and assing to all_data_col.



In [ ]:

    
# Write your answer here

Task: Merge all_data and data3 along the subject_id value.



In [ ]:

    
# Write your answer here

Task: How many test_ids have missing values in the first or last name column?



In [ ]:

    
# Write your answer here

Task: Merge only the data that has the same 'subject_id' in both data1 and data2.



In [ ]:

    
# Write your answer here

Transform

The transform method returns an object that is indexed the same (same size) as the one being grouped.

Task: Given a DataFrame with a column of group IDs, 'groups', and a column of corresponding integer values, 'vals', replace any negative values in 'vals' with the group mean.



In [ ]:

    
# Write your answer here



In [ ]:

    
# Write your answer here

Task: Use groupby in conjunction with transform across multiple columns: We want to group by one to n columns and apply a function on these groups across two columns.

1. Calculate the sum of `a` and `b` and assign it to a column named `e`.
2. Group by 'c' and `d`, and calculate the sum of `e`



In [ ]:

    
df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})



In [ ]:

    
# Write your answer here

Task: Normalize (standardize) the data by calculating the z-score. Group the data by year and calculate the z-score per group. z = (value - mean) / standard_deviation

$$z=\frac{x-\mu}{\sigma}$$



In [ ]:

    
index = pd.date_range('10/1/1999', periods=1100)
ser = pd.Series(np.random.normal(0.5, 2, 1100), index=index)
ser = ser.rolling(window=100,min_periods=100).mean().dropna()



In [ ]:

    
# Answer:
key = lambda x: x.year
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ser.groupby(key).transform(zscore)

Task: We would expect the result to now have mean 0 and standard deviation 1 within each group, which we can easily check. Calculate the mean and standard deviation within each group.



In [ ]:

    
# Write your answer here

Task: Visually compare the original and transformed data sets.



In [ ]:

    
# Write your answer here

Pivot

Task: Let's reshape this small example DataFrame of ICD10 codes. Each person has different code-associations. Only positive associations are listed. Transform (reshape) the DataFrame to a wide format (one column per code) that lists positive and negative (missing) associations as Booleans.



In [ ]:

    
df = pd.DataFrame({"Person": ["a", "a", "a", "b", "c", "c"], "Code": ["D99", "E32", "A41", "D99", "D99", "A41"]}, columns=["Person", "Code"])
df



In [ ]:

    
# Write your answer here

Combine DataFrames

Task: In the data/microbiome subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10th file that describes the content of each. Write code that imports each of the data spreadsheets and combines them into a single DataFrame, adding the identifying information from the metadata spreadsheet as columns in the combined DataFrame.



In [ ]:

    
# Write your answer here

GroupBy Titanic data

Load the dataset in titanic.xls. It contains data on all the passengers that travelled on the Titanic.

Task: Women and children first?

Use the groupby method to calculate the proportion of passengers that survived by sex.
Calculate the same proportion, but by class and sex.
Create age categories: children (under 14 years), adolescents (14-20), adult (21-64), and senior(65+), and calculate survival proportions by age category, class and sex. Additionally, count the number of passengers per group.



In [ ]:

    
# Write your answer here

Let's plot the number of survivors grouped by sex and passenger class.



In [ ]:

    
# Write your answer here

Task: Let's also look at the deaths (and not only at the survivors) within the groups and create a stacked Barplot of survivers vs. deaths grouped by sex and passenger-class (as before).

Convert the "survived" column to boolean values
Compute the cross tabulation (a.k.a. contingency table) of passenger-class and sex vs. survived. Assign the result to the variable name "death_counts" --> checkout pd.crosstab()
Create a stacked barplot of the computed death_counts



In [ ]:

    
# Write your answer here

Task: Another way of comparing the groups is to look at the survival rate, by adjusting for the number of people in each group. Create a stacked, horizontal Barplot of the adjusted death counts

Sum the death_counts per passenger-class and sex, and convert to data type float (for Python 2.x division purposes).
Compute the adjusted survival rate by dividing the death_counts by the result from 1.
Plot a stacked, horizontal Barplot from the result of 3.



In [ ]:

    
# Write your answer here

Table of Contents