Pandas Munging Exercises

See also: Pandas API Reference

See also: Reshaping and Pivot Tables



In [93]:
# 1. Import pandas as pd and read_csv() simple.csv into a dataframe 'df' (optionally, auto-convert dates).

import pandas as pd

df = pd.read_csv('data/simple.csv', infer_datetime_format=True, parse_dates=['Date'])
df.head(5)


Out[93]:
Date Count Weird Date Weird Count
0 2016-12-01 11.0 \tThursday, December 01, 2016 11
1 2016-12-02 NaN Friday, December 02, 2016 NaN
2 2016-12-03 49.0 Saturday, December 03, 2016 49
3 2016-12-04 48.0 NaN 48
4 2016-12-05 29.0 Monday, December 05, 2016 29

In [94]:
# 2. Use df.dropna() to drop any sample that contains any Na/NaN values.

df.dropna().head(5)


Out[94]:
Date Count Weird Date Weird Count
0 2016-12-01 11.0 \tThursday, December 01, 2016 11
2 2016-12-03 49.0 Saturday, December 03, 2016 49
4 2016-12-05 29.0 Monday, December 05, 2016 29
6 2016-12-07 36.0 Wednesday, December 07, 2016 36
7 2016-12-08 45.0 Thursday, December 08, 2016 45

In [95]:
# 3. Use df.dropna() with the subset keyword argument to drop those rows without a Count.

df.dropna(subset=['Count']).head(5)


Out[95]:
Date Count Weird Date Weird Count
0 2016-12-01 11.0 \tThursday, December 01, 2016 11
2 2016-12-03 49.0 Saturday, December 03, 2016 49
3 2016-12-04 48.0 NaN 48
4 2016-12-05 29.0 Monday, December 05, 2016 29
6 2016-12-07 36.0 Wednesday, December 07, 2016 36

In [96]:
# 4. Use df.fillna() to fill the NaN values in Count with 0.

df.fillna(0).head(5)


Out[96]:
Date Count Weird Date Weird Count
0 2016-12-01 11.0 \tThursday, December 01, 2016 11
1 2016-12-02 0.0 Friday, December 02, 2016 0
2 2016-12-03 49.0 Saturday, December 03, 2016 49
3 2016-12-04 48.0 0 48
4 2016-12-05 29.0 Monday, December 05, 2016 29

In [97]:
# 5. Use pd.to_numeric() to convert "Weird Count" to numbers. After error, try with keyword errors='coerce'.

new_series = pd.to_numeric(df['Weird Count'], errors='coerce')
df['Weird Count'] = new_series
df.head(5)


Out[97]:
Date Count Weird Date Weird Count
0 2016-12-01 11.0 \tThursday, December 01, 2016 11.0
1 2016-12-02 NaN Friday, December 02, 2016 NaN
2 2016-12-03 49.0 Saturday, December 03, 2016 49.0
3 2016-12-04 48.0 NaN 48.0
4 2016-12-05 29.0 Monday, December 05, 2016 29.0

In [98]:
# 6. Use str.replace('\t', '') on the column Weird Date to delete any tabs. 

df['Weird Date'] = df['Weird Date'].str.replace('\t', '')
df.head(5)


Out[98]:
Date Count Weird Date Weird Count
0 2016-12-01 11.0 Thursday, December 01, 2016 11.0
1 2016-12-02 NaN Friday, December 02, 2016 NaN
2 2016-12-03 49.0 Saturday, December 03, 2016 49.0
3 2016-12-04 48.0 NaN 48.0
4 2016-12-05 29.0 Monday, December 05, 2016 29.0

In [99]:
# 7. Use .str.partition(',')[2] to chop WEEKDAY COMMA  from Weird Date and make a new column.

df['Less Weird Date'] = df['Weird Date'].str.partition(',')[2]
df.head(5)


Out[99]:
Date Count Weird Date Weird Count Less Weird Date
0 2016-12-01 11.0 Thursday, December 01, 2016 11.0 December 01, 2016
1 2016-12-02 NaN Friday, December 02, 2016 NaN December 02, 2016
2 2016-12-03 49.0 Saturday, December 03, 2016 49.0 December 03, 2016
3 2016-12-04 48.0 NaN 48.0 None
4 2016-12-05 29.0 Monday, December 05, 2016 29.0 December 05, 2016

In [100]:
# 8. Use .str.strip() to remove any whitespace in this new column.

df['Less Weird Date'] = df['Less Weird Date'].str.strip()
df.head(5)


Out[100]:
Date Count Weird Date Weird Count Less Weird Date
0 2016-12-01 11.0 Thursday, December 01, 2016 11.0 December 01, 2016
1 2016-12-02 NaN Friday, December 02, 2016 NaN December 02, 2016
2 2016-12-03 49.0 Saturday, December 03, 2016 49.0 December 03, 2016
3 2016-12-04 48.0 NaN 48.0 None
4 2016-12-05 29.0 Monday, December 05, 2016 29.0 December 05, 2016

In [101]:
# 9. Use pd.to_datetime() to convert the weirdly formatted dates in Less Weird Dates to pandas datetimes.

df['Reconstituted Date'] = pd.to_datetime(df['Less Weird Date'])
df.head(5)


Out[101]:
Date Count Weird Date Weird Count Less Weird Date Reconstituted Date
0 2016-12-01 11.0 Thursday, December 01, 2016 11.0 December 01, 2016 2016-12-01
1 2016-12-02 NaN Friday, December 02, 2016 NaN December 02, 2016 2016-12-02
2 2016-12-03 49.0 Saturday, December 03, 2016 49.0 December 03, 2016 2016-12-03
3 2016-12-04 48.0 NaN 48.0 None NaT
4 2016-12-05 29.0 Monday, December 05, 2016 29.0 December 05, 2016 2016-12-05

In [102]:
# 10. Convert the nice pandas dates to month long period types using df[].dt.to_period().

df['Month'] = df['Reconstituted Date'].dt.to_period(freq='M')
df.head(5)


Out[102]:
Date Count Weird Date Weird Count Less Weird Date Reconstituted Date Month
0 2016-12-01 11.0 Thursday, December 01, 2016 11.0 December 01, 2016 2016-12-01 2016-12
1 2016-12-02 NaN Friday, December 02, 2016 NaN December 02, 2016 2016-12-02 2016-12
2 2016-12-03 49.0 Saturday, December 03, 2016 49.0 December 03, 2016 2016-12-03 2016-12
3 2016-12-04 48.0 NaN 48.0 None NaT NaT
4 2016-12-05 29.0 Monday, December 05, 2016 29.0 December 05, 2016 2016-12-05 2016-12

In [103]:
# 11. Convert Count to an int using the column's "astype()" method.

df['Count'] = df['Count'].fillna(0)

df['Count'] = df['Count'].astype(int)

df.head(5)


Out[103]:
Date Count Weird Date Weird Count Less Weird Date Reconstituted Date Month
0 2016-12-01 11 Thursday, December 01, 2016 11.0 December 01, 2016 2016-12-01 2016-12
1 2016-12-02 0 Friday, December 02, 2016 NaN December 02, 2016 2016-12-02 2016-12
2 2016-12-03 49 Saturday, December 03, 2016 49.0 December 03, 2016 2016-12-03 2016-12
3 2016-12-04 48 NaN 48.0 None NaT NaT
4 2016-12-05 29 Monday, December 05, 2016 29.0 December 05, 2016 2016-12-05 2016-12

In [104]:
# 12. Import numpy as np. Run pd.isnull(np.NaN). Run None == np.NaN. Run np.NaN == np.NaN. What does that tell you?

import numpy as np

print(pd.isnull(np.NaN))

print(None == np.NaN)

print(np.NaN == np.NaN)


True
False
False

In [105]:
# 13. Do a database style inner join of the df and a copy (df.copy()) of the dataframe on Date using pd.merge()

copy = df.copy()

df.merge(copy, how='inner', on='Date').head(3)


Out[105]:
Date Count_x Weird Date_x Weird Count_x Less Weird Date_x Reconstituted Date_x Month_x Count_y Weird Date_y Weird Count_y Less Weird Date_y Reconstituted Date_y Month_y
0 2016-12-01 11 Thursday, December 01, 2016 11.0 December 01, 2016 2016-12-01 2016-12 11 Thursday, December 01, 2016 11.0 December 01, 2016 2016-12-01 2016-12
1 2016-12-02 0 Friday, December 02, 2016 NaN December 02, 2016 2016-12-02 2016-12 0 Friday, December 02, 2016 NaN December 02, 2016 2016-12-02 2016-12
2 2016-12-03 49 Saturday, December 03, 2016 49.0 December 03, 2016 2016-12-03 2016-12 49 Saturday, December 03, 2016 49.0 December 03, 2016 2016-12-03 2016-12

In [106]:
# 14. Combine the df with a copy of the df using concat, effectively stacking the df on top of itself.

pd.concat([df, copy]).head(5)


Out[106]:
Date Count Weird Date Weird Count Less Weird Date Reconstituted Date Month
0 2016-12-01 11 Thursday, December 01, 2016 11.0 December 01, 2016 2016-12-01 2016-12
1 2016-12-02 0 Friday, December 02, 2016 NaN December 02, 2016 2016-12-02 2016-12
2 2016-12-03 49 Saturday, December 03, 2016 49.0 December 03, 2016 2016-12-03 2016-12
3 2016-12-04 48 NaN 48.0 None NaT NaT
4 2016-12-05 29 Monday, December 05, 2016 29.0 December 05, 2016 2016-12-05 2016-12

In [107]:
# 15. Convert Count to string type using .astype(). If failure, use raise_on_error=False argument.

df['Count'].astype(str, raise_on_error=False).head(5)


Out[107]:
0    11
1     0
2    49
3    48
4    29
Name: Count, dtype: object

In [108]:
# 16. Bin the values in Count in groups of 10. 0-9, 10-19, 20-29, etc. using pd.cut.

df['Weird Count'] = pd.cut(df['Count'], bins=[-1, 9, 19, 29, 39, 49, 59])

df.head(5)


Out[108]:
Date Count Weird Date Weird Count Less Weird Date Reconstituted Date Month
0 2016-12-01 11 Thursday, December 01, 2016 (9, 19] December 01, 2016 2016-12-01 2016-12
1 2016-12-02 0 Friday, December 02, 2016 (-1, 9] December 02, 2016 2016-12-02 2016-12
2 2016-12-03 49 Saturday, December 03, 2016 (39, 49] December 03, 2016 2016-12-03 2016-12
3 2016-12-04 48 NaN (39, 49] None NaT NaT
4 2016-12-05 29 Monday, December 05, 2016 (19, 29] December 05, 2016 2016-12-05 2016-12

In [109]:
# 17. Use pd.read_csv to read the CFPB CSB into dataframe 'df2'.

df2 = pd.read_csv('data/cfpb_complaints_with_fictitious_data.csv')
df2.head(1)


Out[109]:
Date received Product Sub-product Issue Sub-issue Consumer complaint narrative Company public response Company State ZIP code ... Consumer consent provided? Submitted via Date sent to company Company response to consumer Timely response? Consumer disputed? Complaint ID Consumer Claim Company Offer Amount Received
0 3/24/2015 Credit card NaN Other NaN Received Capital One charge card offer XXXX. A... NaN Capital One OH 440XX ... Consent provided Web 3/30/2015 Closed with explanation Yes No 1297939 332.63 217.18 130.22

1 rows × 21 columns


In [110]:
# 18. Filter df2 down to ['Product', 'Sub-product', 'Complaint ID', and 'Date received']

df2 = df2[['Product', 'Sub-product', 'Complaint ID', 'Date received']]
df2.head(5)


Out[110]:
Product Sub-product Complaint ID Date received
0 Credit card NaN 1297939 3/24/2015
1 Debt collection Other (i.e. phone, health club, etc.) 1296593 3/23/2015
2 Credit card NaN 1296693 3/23/2015
3 Credit reporting NaN 1296955 3/23/2015
4 Debt collection Payday loan 1296727 3/23/2015

In [111]:
# 19. set_index() with ['Product', 'Sub-product] amd assign the result to df3.

df3 = df2.set_index(['Product', 'Sub-product'])