In this lab you will see some examples of some commonly used data wrangling tools in Python. In particular, we aim to give you some familiarity with:
In [ ]:
import pandas as pd
import numpy as np
# These lines load the tests.
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab02.ok')
The code below produces the data frames used in the examples
In [ ]:
heroes = pd.DataFrame(
data={'color': ['red', 'green', 'black',
'blue', 'black', 'red'],
'first_seen_on': ['a', 'a', 'f', 'a', 'a', 'f'],
'first_season': [2, 1, 2, 3, 3, 1]},
index=['flash', 'arrow', 'vibe',
'atom', 'canary', 'firestorm']
)
identities = pd.DataFrame(
data={'ego': ['barry allen', 'oliver queen', 'cisco ramon',
'ray palmer', 'sara lance',
'martin stein', 'ronnie raymond'],
'alter-ego': ['flash', 'arrow', 'vibe', 'atom',
'canary', 'firestorm', 'firestorm']}
)
teams = pd.DataFrame(
data={'team': ['flash', 'arrow', 'flash', 'legends',
'flash', 'legends', 'arrow'],
'hero': ['flash', 'arrow', 'vibe', 'atom',
'killer frost', 'firestorm', 'speedy']})
For the examples that follow, we will be using a toy data set containing information about superheroes in the Arrowverse. In the first_seen_on column, a stands for Archer and f, Flash.
In [ ]:
heroes
In [ ]:
identities
In [ ]:
teams
To select a column of a DataFrame by column label, the safest and fastest way is to use the .loc method. General usage looks like frame.loc[rowname,colname]. (Reminder that the colon : means "everything"). For example, if we want the color column of the ex data frame, we would use :
In [ ]:
heroes.loc[:, 'color']
Selecting multiple columns is easy. You just need to supply a list of column names. Here we select the color and value columns:
In [ ]:
heroes.loc[:, ['color', 'first_season']]
While .loc is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the [] method, which takes on the form frame['colname'].
In [ ]:
heroes['first_seen_on']
Similarly, if we want to select a row by its label, we can use the same .loc method.
In [ ]:
heroes.loc[['flash', 'vibe'], :]
If we want all the columns returned, we can, for brevity, drop the colon without issue.
In [ ]:
heroes.loc[['flash', 'vibe']]
In [ ]:
heroes.loc['flash':'atom', :'first_seen_on']
In [ ]:
heroes.iloc[:4,:2]
Filtering is the process of removing unwanted material. In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set. For example, we may be interested in characters that debuted in season 3 of Archer. Note that compound expressions have to be grouped with parentheses.
In [ ]:
heroes[(heroes['first_season']==3) & (heroes['first_seen_on']=='a')]
We want to highlight the strategy for filtering to answer the question above:
first_season and first_seen_onfirst_seen_on equal to afirst_season equal to 3first_seen_on equal to a $\rightarrow$ hero['first_seen_on']=='a'first_season equal to 3 $\rightarrow$ heroes['first_season']==3Note that compound expressions have to be grouped with parentheses.
For your reference, some commonly used comparison operators are given below.
| Symbol | Usage | Meaning |
|---|---|---|
| == | a == b | Does a equal b? |
| <= | a <= b | Is a less than or equal to b? |
| >= | a >= b | Is a greater than or equal to b? |
| < | a < b | Is a less than b? |
| > | a > b | Is a greater than b? |
| ~ | ~p | Returns negation of p |
| | | p | q | p OR q |
| & | p & q | p AND q |
| ^ | p ^ q | p XOR q (exclusive or) |
An often-used operation missing from the above table is a test-of-membership. The Series.isin(values) method returns a boolean array denoting whether each element of Series is in values. We can then use the array to subset our data frame. For example, if we wanted to see which rows of heroes had values in $\{1,3\}$, we would use:
In [ ]:
heroes[heroes['first_season'].isin([1,3])]
Notice that in both examples above, the expression in the brackets evaluates to a boolean series. The general strategy for filtering data frames, then, is to write an expression of the form frame[logical statement].
To count the number of instances of a value in a Series, we can use the value_counts method. Below we count the number of instances of each color.
In [ ]:
heroes['color'].value_counts()
A more sophisticated analysis might involve counting the number of instances a tuple appears. Here we count $(color,value)$ tuples.
In [ ]:
heroes.groupby(['color', 'first_season']).size()
This returns a series that has been multi-indexed. We'll eschew this topic for now. To get a data frame back, we'll use the reset_index method, which also allows us to simulataneously name the new column.
In [ ]:
heroes.groupby(['color', 'first_season']).size().reset_index(name='count')
Suppose we have another table that classifies superheroes into their respective teams. Note that canary is not in this data set and that killer frost and speedy are additions that aren't in the original heroes set.
For simplicity of the example, we'll convert the index of the heroes data frame into an explicit column called hero. A careful examination of the documentation will reveal that joining on a mixture of the index and columns is possible.
In [ ]:
heroes['hero'] = heroes.index
heroes
The inner join below returns rows representing the heroes that appear in both data frames.
In [ ]:
pd.merge(heroes, teams, how='inner', on='hero')
The left join returns rows representing heroes in the ex ("left") data frame, augmented by information found in the teams data frame. Its counterpart, the right join, would return heroes in the teams data frame. Note that the team for hero canary is an NaN value, representing missing data.
In [ ]:
pd.merge(heroes, teams, how='left', on='hero')
In [ ]:
pd.merge(heroes, teams, how='outer', on='hero')
In [ ]:
pd.merge(heroes, identities, how='inner',
left_on='hero', right_on='alter-ego')
As shown in lecture, there are a multitude of reasons why a data set might have missing values. The current implementation of Pandas uses the numpy NaN to represent these null values (older implementations even used -inf and inf). Future versions of Pandas might implement a true null value---keep your eyes peeled for this in updates! More information can be found (http://pandas.pydata.org/pandas-docs/stable/missing_data.html)[here].
Because of the specialness of missing values, they merit their own set of tools. For this lab, we will focus on detection. For replacement, see the docs.
In [ ]:
x = np.nan
y = pd.merge(heroes, teams, how='outer', on='hero')['first_season']
y
To check if a value is null, we use the isnull() method for series and data frames. Alternatively, there is a pd.isnull() function as well.
In [ ]:
x.isnull() # won't work since x is neither a series nor a data frame
In [ ]:
pd.isnull(x)
In [ ]:
y.isnull()
In [ ]:
pd.isnull(y)
Since filtering out missing data is such a common operation, Pandas also has conveniently included the analogous notnull() methods and function for improved human readability.
In [ ]:
y.notnull()
In [ ]:
y[y.notnull()]
Consider the "complete" data set shown below. Note that the rows are indexed by the superheroes' names.
In [ ]:
heroes_complete = pd.merge(heroes, identities, left_on='hero', right_on='alter-ego')
heroes_complete = pd.merge(heroes_complete, teams, how='outer', on='hero')
heroes_complete.set_index('hero', inplace=True)
heroes_complete
What are the outputs of the following calls? State what is wrong with the ones that will produce errors and propose a fix. To challenge yourself, try to do this exercise without running any commands.
In [ ]:
heroes_complete.loc['flash'] # SOLUTION
In [ ]:
heroes_complete.iloc[1, 3:5] # SOLUTION
In [ ]:
heroes_complete.loc[:, 'first_seen_on':'team'] # SOLUTION
In [ ]:
heroes_complete.iloc[1:3] # SOLUTION
In [ ]:
heroes_complete.iloc[1, 1] # SOLUTION
In [ ]:
heroes_complete.loc[heroes_complete['color'].isin(['red', 'black'])] # SOLUTION
In [ ]:
heroes_complete.ix[1, 'color'] # SOLUTION
In [ ]:
heroes_complete[heroes_complete['first_season'] % 2 == 0] # SOLUTION
In [ ]:
heroes_complete[heroes_complete['color'].isnull()] # SOLUTION
Can you propose a fix to any of the broken ones above?
The practice problems below use the department of transportation's "On-Time" flight data for all flights originating from SFO or OAK in January 2016. Information about the variables can be found in the readme.html file. Information about the airports and airlines are contained in the comma-delimited files airports.dat and airlines.dat, respectively. Both were sourced from http://openflights.org/data.html
Disclaimer: There is a more direct way of dealing with time data that is not presented in these problems. This activity is merely an academic exercise.
In [ ]:
flights = pd.read_csv("flights.dat", dtype={'sched_dep_time': 'f8', 'sched_arr_time': 'f8'})
flights.head()
In [ ]:
airports_cols = [
'openflights_id',
'name',
'city',
'country',
'iata',
'icao',
'latitude',
'longitude',
'altitude',
'tz',
'dst',
'tz_olson',
'type',
'airport_dsource'
]
airports = pd.read_csv("airports.dat", names=airports_cols)
airports.head()
In [ ]:
def extract_hour(time):
"""
Extracts hour information from military time.
Args:
time (float64): array of time given in military format.
Takes on values in 0.0-2359.0 due to float64 representation.
Returns:
array (float64): array of input dimension with hour information.
Should only take on integer values in 0-23
"""
return time // 100 # SOLUTION
In [ ]:
def extract_mins(time):
"""
Extracts minute information from military time
Args:
time (float64): array of time given in military format.
Takes on values in 0.0-2359.0 due to float64 representation.
Returns:
array (float64): array of input dimension with hour information.
Should only take on integer values in 0-59
"""
return time % 100 # SOLUTION
In [ ]:
def convert_to_minofday(time):
"""
Converts military time to minute of day
Args:
time (float64): array of time given in military format.
Takes on values in 0.0-2359.0 due to float64 representation.
Returns:
array (float64): array of input dimension with minute of day
Example: 1:03pm is converted to 783.0
>>> convert_to_minofday(1303.0)
783.0
"""
return extract_hour(time) * 60 + extract_mins(time) # SOLUTION
def calc_time_diff(x, y):
"""
Calculates delay times y - x
Args:
x (float64): array of scheduled time given in military format.
Takes on values in 0.0-2359.0 due to float64 representation.
y (float64): array of same dimensions giving actual time
Returns:
array (float64): array of input dimension with delay time
"""
scheduled = convert_to_minofday(x) # SOLUTION
actual = convert_to_minofday(y) # SOLUTION
return actual - scheduled # SOLUTION
In [ ]:
delay = calc_time_diff(flights['sched_dep_time'], flights['actual_dep_time']) # SOLUTION
delayed15 = flights[delay > 15] # SOLUTION
In [ ]:
delayed_airports = airports['iata'].isin(delayed15['destination']) # SOLUTION
delayed_destinations = airports.loc[delayed_airports, 'city'].sort_values() # SOLUTION
In [ ]:
top10 = (
flights[['tailnum', 'destination']]
.drop_duplicates()
.groupby('tailnum')
.size()
.sort_values(ascending=False)
.head(10)
)
In [ ]:
top10 = ...
In [ ]:
flights['delay'] = calc_time_diff(flights['sched_dep_time'], flights['actual_dep_time'])
avg_delay = (
flights[flights['origin']=='SFO']
.groupby('destination')
.agg({'delay': np.mean})
)
airports = pd.merge(airports, avg_delay, how='inner', left_on='iata', right_index=True)
In [ ]:
airports = ...
Let's take a look at our non-null results. Do any of the delay values catch your eye?
...
In [ ]:
airports.loc[pd.notnull(airports['delay']), ['name', 'city', 'delay']]\
.sort_values('delay', ascending=False)
In [ ]:
I_totally_did_everything=True
In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()
In [ ]:
_ = ok.submit()