In [ ]:

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')





In [ ]:

%%javascript
\$.getScript('misc/kmahelona_ipython_notebook_toc.js')



# Getting and Knowing your Data



In [ ]:

fn = r"data/drinks.csv"




In [ ]:



Task: See the first 10 entries



In [ ]:



Task: Which country has the highest alcohol consumption (total litres of pure alcohol)?



In [ ]:



# Groupby

Task: Which continent drinks most beer on average?



In [ ]:





In [ ]:



Task: Which countries have missing values in the continent column?



In [ ]:





In [ ]:



Task: For each continent print "the" statistics (summary stats using "df.describe()") for wine consumption.



In [ ]:



Task: Print the median alcoohol consumption per continent for every column



In [ ]:



Task: Print the mean, min and max values for spirit consumption.



In [ ]:



Task: GroupBy Continent and create a Boxplot. (Hint: using e.g. figsize=(12, 9), rot=90 might help with legibility.)



In [ ]:



# Concatenate, Merge & Join

Task: Import the first dataset cars1 and cars2. Assign each to a to a variable called cars1 and cars2.



In [ ]:



Task: It seems our first dataset has some unnamed blank columns, fix cars1.



In [ ]:



Task: Join cars1 and cars2 into a single DataFrame called cars



In [ ]:



### Apply (interspersed)

Task: Create function that returns the first word of the string in the "car" column, the manufacturer name. Use the "apply" method to create a new column in the DataFrame.



In [ ]:



Consider the following DataFrames for the next exercises



In [ ]:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])



Task: Concatenate the three DataFrames along the rows.



In [ ]:



Task: How many missing values (NaNs) are produced if you concatenate along the other axis (appending the columns)?



In [ ]:



Let's consider another data set to do some more Merge, Join & Concatenate exerciseses



In [ ]:

raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8', '9', '10'],
'first_name': ['Alice', 'Ayoung', 'Bran', 'Bryce', 'Betty', 'Jane', np.nan],
'last_name': ['Aoni', 'Atiches', 'Balwner', 'Brice', 'Btisan', np.nan, 'Doe']}

raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])



Task: Join the two dataframes, data1 and data2, along rows and assign all_data. Make sure that the row index is unique.



In [ ]:



Task: Join the two dataframes, data1 and data2, along columns and assing to all_data_col.



In [ ]:



Task: Merge all_data and data3 along the subject_id value.



In [ ]:



Task: How many test_ids have missing values in the first or last name column?



In [ ]:



Task: Merge only the data that has the same 'subject_id' in both data1 and data2.



In [ ]:



# Transform

The transform method returns an object that is indexed the same (same size) as the one being grouped.

Task: Given a DataFrame with a column of group IDs, 'groups', and a column of corresponding integer values, 'vals', replace any negative values in 'vals' with the group mean.



In [ ]:




In [ ]:



Task: Use groupby in conjunction with transform across multiple columns: We want to group by one to n columns and apply a function on these groups across two columns.

1. Calculate the sum of a and b and assign it to a column named e.
2. Group by 'c' and d, and calculate the sum of e


In [ ]:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})




In [ ]:



Task: Normalize (standardize) the data by calculating the z-score. Group the data by year and calculate the z-score per group. z = (value - mean) / standard_deviation

$$z=\frac{x-\mu}{\sigma}$$


In [ ]:

index = pd.date_range('10/1/1999', periods=1100)
ser = pd.Series(np.random.normal(0.5, 2, 1100), index=index)
ser = ser.rolling(window=100,min_periods=100).mean().dropna()




In [ ]:

key = lambda x: x.year
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ser.groupby(key).transform(zscore)



Task: We would expect the result to now have mean 0 and standard deviation 1 within each group, which we can easily check. Calculate the mean and standard deviation within each group.



In [ ]:



Task: Visually compare the original and transformed data sets.



In [ ]:



# Pivot

Task: Let's reshape this small example DataFrame of ICD10 codes. Each person has different code-associations. Only positive associations are listed. Transform (reshape) the DataFrame to a wide format (one column per code) that lists positive and negative (missing) associations as Booleans.



In [ ]:

df = pd.DataFrame({"Person": ["a", "a", "a", "b", "c", "c"], "Code": ["D99", "E32", "A41", "D99", "D99", "A41"]}, columns=["Person", "Code"])
df




In [ ]:



# Combine DataFrames

Task: In the data/microbiome subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10th file that describes the content of each. Write code that imports each of the data spreadsheets and combines them into a single DataFrame, adding the identifying information from the metadata spreadsheet as columns in the combined DataFrame.



In [ ]:



# GroupBy Titanic data

Load the dataset in titanic.xls. It contains data on all the passengers that travelled on the Titanic.

1. Use the groupby method to calculate the proportion of passengers that survived by sex.
2. Calculate the same proportion, but by class and sex.
3. Create age categories: children (under 14 years), adolescents (14-20), adult (21-64), and senior(65+), and calculate survival proportions by age category, class and sex. Additionally, count the number of passengers per group.


In [ ]:



Let's plot the number of survivors grouped by sex and passenger class.



In [ ]:



Task: Let's also look at the deaths (and not only at the survivors) within the groups and create a stacked Barplot of survivers vs. deaths grouped by sex and passenger-class (as before).

1. Convert the "survived" column to boolean values
2. Compute the cross tabulation (a.k.a. contingency table) of passenger-class and sex vs. survived. Assign the result to the variable name "death_counts" --> checkout pd.crosstab()
3. Create a stacked barplot of the computed death_counts


In [ ]:



Task: Another way of comparing the groups is to look at the survival rate, by adjusting for the number of people in each group. Create a stacked, horizontal Barplot of the adjusted death counts

1. Sum the death_counts per passenger-class and sex, and convert to data type float (for Python 2.x division purposes).
2. Compute the adjusted survival rate by dividing the death_counts by the result from 1.
3. Plot a stacked, horizontal Barplot from the result of 3.


In [ ]: