In [ ]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')
In [ ]:
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')
Task: load the following file as a data frame
In [ ]:
fn = r"data/drinks.csv"
In [ ]:
# Write your answer here
Task: See the first 10 entries
In [ ]:
# Write your answer here
Task: Which country has the highest alcohol consumption (total litres of pure alcohol)?
In [ ]:
# Write your answer here
Task: Which continent drinks most beer on average?
In [ ]:
# Write your answer here
Task: List all unique continents.
In [ ]:
# Write your answer here
Task: Which countries have missing values in the continent column?
In [ ]:
# Write your answer here
Task: Set "the" missing continent with a name of your choice.
In [ ]:
# Write your answer here
Task: For each continent print "the" statistics (summary stats using "df.describe()") for wine consumption.
In [ ]:
# Write your answer here
Task: Print the median alcoohol consumption per continent for every column
In [ ]:
# Write your answer here
Task: Print the mean, min and max values for spirit consumption.
In [ ]:
# Write your answer here
Task: GroupBy Continent and create a Boxplot. (Hint: using e.g. figsize=(12, 9), rot=90
might help with legibility.)
In [ ]:
# Write your answer here
Task: Import the first dataset cars1 and cars2. Assign each to a to a variable called cars1 and cars2.
In [ ]:
# Write your answer here
Task: It seems our first dataset has some unnamed blank columns, fix cars1.
In [ ]:
# Write your answer here
Task: Join cars1 and cars2 into a single DataFrame called cars
In [ ]:
# Write your answer here
Task: Create function that returns the first word of the string in the "car" column, the manufacturer name. Use the "apply" method to create a new column in the DataFrame.
In [ ]:
# Write your answer here
Consider the following DataFrames for the next exercises
In [ ]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
Task: Concatenate the three DataFrames along the rows.
In [ ]:
# Write your answer here
Task: How many missing values (NaNs) are produced if you concatenate along the other axis (appending the columns)?
In [ ]:
# Write your answer here
Let's consider another data set to do some more Merge, Join & Concatenate exerciseses
In [ ]:
raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8', '9', '10'],
'first_name': ['Alice', 'Ayoung', 'Bran', 'Bryce', 'Betty', 'Jane', np.nan],
'last_name': ['Aoni', 'Atiches', 'Balwner', 'Brice', 'Btisan', np.nan, 'Doe']}
raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
Task: Join the two dataframes, data1
and data2
, along rows and assign all_data
. Make sure that the row index is unique.
In [ ]:
# Write your answer here
Task: Join the two dataframes, data1
and data2
, along columns and assing to all_data_col
.
In [ ]:
# Write your answer here
Task: Merge all_data
and data3
along the subject_id value.
In [ ]:
# Write your answer here
Task: How many test_ids have missing values in the first or last name column?
In [ ]:
# Write your answer here
Task: Merge only the data that has the same 'subject_id' in both data1
and data2
.
In [ ]:
# Write your answer here
The transform method returns an object that is indexed the same (same size) as the one being grouped.
Task: Given a DataFrame with a column of group IDs, 'groups', and a column of corresponding integer values, 'vals', replace any negative values in 'vals' with the group mean.
In [ ]:
# Write your answer here
In [ ]:
# Write your answer here
Task: Use groupby in conjunction with transform across multiple columns: We want to group by one to n columns and apply a function on these groups across two columns.
1. Calculate the sum of `a` and `b` and assign it to a column named `e`.
2. Group by 'c' and `d`, and calculate the sum of `e`
In [ ]:
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
In [ ]:
# Write your answer here
Task: Normalize (standardize) the data by calculating the z-score. Group the data by year
and calculate the z-score per group. z = (value - mean) / standard_deviation
In [ ]:
index = pd.date_range('10/1/1999', periods=1100)
ser = pd.Series(np.random.normal(0.5, 2, 1100), index=index)
ser = ser.rolling(window=100,min_periods=100).mean().dropna()
In [ ]:
# Answer:
key = lambda x: x.year
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ser.groupby(key).transform(zscore)
Task: We would expect the result to now have mean 0 and standard deviation 1 within each group, which we can easily check. Calculate the mean and standard deviation within each group.
In [ ]:
# Write your answer here
Task: Visually compare the original and transformed data sets.
In [ ]:
# Write your answer here
Task: Let's reshape this small example DataFrame of ICD10 codes. Each person has different code-associations. Only positive associations are listed. Transform (reshape) the DataFrame to a wide format (one column per code) that lists positive and negative (missing) associations as Booleans.
In [ ]:
df = pd.DataFrame({"Person": ["a", "a", "a", "b", "c", "c"], "Code": ["D99", "E32", "A41", "D99", "D99", "A41"]}, columns=["Person", "Code"])
df
In [ ]:
# Write your answer here
Task: In the data/microbiome subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10th file that describes the content of each. Write code that imports each of the data spreadsheets and combines them into a single DataFrame
, adding the identifying information from the metadata spreadsheet as columns in the combined DataFrame
.
In [ ]:
# Write your answer here
Load the dataset in titanic.xls
. It contains data on all the passengers that travelled on the Titanic.
Task: Women and children first?
groupby
method to calculate the proportion of passengers that survived by sex.
In [ ]:
# Write your answer here
Let's plot the number of survivors grouped by sex and passenger class.
In [ ]:
# Write your answer here
Task: Let's also look at the deaths (and not only at the survivors) within the groups and create a stacked Barplot of survivers vs. deaths grouped by sex and passenger-class (as before).
In [ ]:
# Write your answer here
Task: Another way of comparing the groups is to look at the survival rate, by adjusting for the number of people in each group. Create a stacked, horizontal Barplot of the adjusted death counts
In [ ]:
# Write your answer here