Lab 2

Due 11:59pm 01/27/2017 (Completion-based)

In this lab you will see some examples of some commonly used data wrangling tools in Python. In particular, we aim to give you some familiarity with:

  • Slicing data frames
  • Filtering data
  • Grouped counts
  • Joining two tables
  • NA/Null values

Setup


In [ ]:
import pandas as pd
import numpy as np

# These lines load the tests.
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab02.ok')

The code below produces the data frames used in the examples


In [ ]:
heroes = pd.DataFrame(
    data={'color': ['red', 'green', 'black', 
                    'blue', 'black', 'red'],
          'first_seen_on': ['a', 'a', 'f', 'a', 'a', 'f'],
          'first_season': [2, 1, 2, 3, 3, 1]},
    index=['flash', 'arrow', 'vibe', 
           'atom', 'canary', 'firestorm']
)

identities = pd.DataFrame(
    data={'ego': ['barry allen', 'oliver queen', 'cisco ramon',
                  'ray palmer', 'sara lance', 
                  'martin stein', 'ronnie raymond'],
          'alter-ego': ['flash', 'arrow', 'vibe', 'atom',
                        'canary', 'firestorm', 'firestorm']}
)

teams = pd.DataFrame(
    data={'team': ['flash', 'arrow', 'flash', 'legends', 
                   'flash', 'legends', 'arrow'],
          'hero': ['flash', 'arrow', 'vibe', 'atom', 
                   'killer frost', 'firestorm', 'speedy']})

Pandas and Wrangling

For the examples that follow, we will be using a toy data set containing information about superheroes in the Arrowverse. In the first_seen_on column, a stands for Archer and f, Flash.


In [ ]:
heroes

In [ ]:
identities

In [ ]:
teams

Slice and Dice

Column selection by label

To select a column of a DataFrame by column label, the safest and fastest way is to use the .loc method. General usage looks like frame.loc[rowname,colname]. (Reminder that the colon : means "everything"). For example, if we want the color column of the ex data frame, we would use :


In [ ]:
heroes.loc[:, 'color']

Selecting multiple columns is easy. You just need to supply a list of column names. Here we select the color and value columns:


In [ ]:
heroes.loc[:, ['color', 'first_season']]

While .loc is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the [] method, which takes on the form frame['colname'].


In [ ]:
heroes['first_seen_on']

Row Selection by Label

Similarly, if we want to select a row by its label, we can use the same .loc method.


In [ ]:
heroes.loc[['flash', 'vibe'], :]

If we want all the columns returned, we can, for brevity, drop the colon without issue.


In [ ]:
heroes.loc[['flash', 'vibe']]

General Selection by Label

More generally you can slice across both rows and columns at the same time. For example:


In [ ]:
heroes.loc['flash':'atom', :'first_seen_on']

Selection by Integer Index

If you want to select rows and columns by position, the Data Frame has an analogous .iloc method for integer indexing. Remember that Python indexing starts at 0.


In [ ]:
heroes.iloc[:4,:2]

Filtering with boolean arrays

Filtering is the process of removing unwanted material. In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set. For example, we may be interested in characters that debuted in season 3 of Archer. Note that compound expressions have to be grouped with parentheses.


In [ ]:
heroes[(heroes['first_season']==3) & (heroes['first_seen_on']=='a')]

Problem Solving Strategy

We want to highlight the strategy for filtering to answer the question above:

  • Identify the variables of interest
    • Interested in the debut: first_season and first_seen_on
  • Translate the question into statements one with True/False answers
    • Did the hero debut on Archer? $\rightarrow$ The hero has first_seen_on equal to a
    • Did the hero debut in season 3? $\rightarrow$ The hero has first_season equal to 3
  • Translate the statements into boolean statements
    • The hero has first_seen_on equal to a $\rightarrow$ hero['first_seen_on']=='a'
    • The hero has first_season equal to 3 $\rightarrow$ heroes['first_season']==3
  • Use the boolean array to filter the data

Note that compound expressions have to be grouped with parentheses.

For your reference, some commonly used comparison operators are given below.

Symbol Usage Meaning
== a == b Does a equal b?
<= a <= b Is a less than or equal to b?
>= a >= b Is a greater than or equal to b?
< a < b Is a less than b?
> a > b Is a greater than b?
~ ~p Returns negation of p
| p | q p OR q
& p & q p AND q
^ p ^ q p XOR q (exclusive or)

An often-used operation missing from the above table is a test-of-membership. The Series.isin(values) method returns a boolean array denoting whether each element of Series is in values. We can then use the array to subset our data frame. For example, if we wanted to see which rows of heroes had values in $\{1,3\}$, we would use:


In [ ]:
heroes[heroes['first_season'].isin([1,3])]

Notice that in both examples above, the expression in the brackets evaluates to a boolean series. The general strategy for filtering data frames, then, is to write an expression of the form frame[logical statement].

Counting Rows

To count the number of instances of a value in a Series, we can use the value_counts method. Below we count the number of instances of each color.


In [ ]:
heroes['color'].value_counts()

A more sophisticated analysis might involve counting the number of instances a tuple appears. Here we count $(color,value)$ tuples.


In [ ]:
heroes.groupby(['color', 'first_season']).size()

This returns a series that has been multi-indexed. We'll eschew this topic for now. To get a data frame back, we'll use the reset_index method, which also allows us to simulataneously name the new column.


In [ ]:
heroes.groupby(['color', 'first_season']).size().reset_index(name='count')

Joining Tables on One Column

Suppose we have another table that classifies superheroes into their respective teams. Note that canary is not in this data set and that killer frost and speedy are additions that aren't in the original heroes set.

For simplicity of the example, we'll convert the index of the heroes data frame into an explicit column called hero. A careful examination of the documentation will reveal that joining on a mixture of the index and columns is possible.


In [ ]:
heroes['hero'] = heroes.index
heroes

Inner Join

The inner join below returns rows representing the heroes that appear in both data frames.


In [ ]:
pd.merge(heroes, teams, how='inner', on='hero')

Left and right join

The left join returns rows representing heroes in the ex ("left") data frame, augmented by information found in the teams data frame. Its counterpart, the right join, would return heroes in the teams data frame. Note that the team for hero canary is an NaN value, representing missing data.


In [ ]:
pd.merge(heroes, teams, how='left', on='hero')

Outer join

An outer join on hero will return all heroes found in both the left and right data frames. Any missing values are filled in with NaN.


In [ ]:
pd.merge(heroes, teams, how='outer', on='hero')

More than one match?

If the values in the columns to be matched don't uniquely identify a row, then a cartesian product is formed in the merge. For example, notice that firestorm has two different egos, so information from heroes had to be duplicated in the merge, once for each ego.


In [ ]:
pd.merge(heroes, identities, how='inner', 
         left_on='hero', right_on='alter-ego')

Missing Values

As shown in lecture, there are a multitude of reasons why a data set might have missing values. The current implementation of Pandas uses the numpy NaN to represent these null values (older implementations even used -inf and inf). Future versions of Pandas might implement a true null value---keep your eyes peeled for this in updates! More information can be found (http://pandas.pydata.org/pandas-docs/stable/missing_data.html)[here].

Because of the specialness of missing values, they merit their own set of tools. For this lab, we will focus on detection. For replacement, see the docs.


In [ ]:
x = np.nan
y = pd.merge(heroes, teams, how='outer', on='hero')['first_season']
y

To check if a value is null, we use the isnull() method for series and data frames. Alternatively, there is a pd.isnull() function as well.


In [ ]:
x.isnull() # won't work since x is neither a series nor a data frame

In [ ]:
pd.isnull(x)

In [ ]:
y.isnull()

In [ ]:
pd.isnull(y)

Since filtering out missing data is such a common operation, Pandas also has conveniently included the analogous notnull() methods and function for improved human readability.


In [ ]:
y.notnull()

In [ ]:
y[y.notnull()]

Practice Set 1

Consider the "complete" data set shown below. Note that the rows are indexed by the superheroes' names.


In [ ]:
heroes_complete = pd.merge(heroes, identities, left_on='hero', right_on='alter-ego')
heroes_complete = pd.merge(heroes_complete, teams, how='outer', on='hero')
heroes_complete.set_index('hero', inplace=True)
heroes_complete

What are the outputs of the following calls? State what is wrong with the ones that will produce errors and propose a fix. To challenge yourself, try to do this exercise without running any commands.

heroes_complete[flash]

In [ ]:
heroes_complete.loc['flash'] # SOLUTION
heroes_complete[1, 3]

In [ ]:
heroes_complete.iloc[1, 3:5] # SOLUTION
heroes_complete.loc['first_seen_on':'team']

In [ ]:
heroes_complete.loc[:, 'first_seen_on':'team'] # SOLUTION
heroes_complete.iloc[1:3]

In [ ]:
heroes_complete.iloc[1:3] # SOLUTION
heroes_complete.iloc[1, 1]

In [ ]:
heroes_complete.iloc[1, 1] # SOLUTION
heroes_complete.loc[heroes_complete['color'].isin(['red', 'black'])]

In [ ]:
heroes_complete.loc[heroes_complete['color'].isin(['red', 'black'])] # SOLUTION
heroes_complete.loc[1, 'color']

In [ ]:
heroes_complete.ix[1, 'color'] # SOLUTION
heroes_complete[heroes_complete['first_season'] % 2 == 0]

In [ ]:
heroes_complete[heroes_complete['first_season'] % 2 == 0] # SOLUTION
heroes_complete[heroes_complete['color'] == np.nan]

In [ ]:
heroes_complete[heroes_complete['color'].isnull()] # SOLUTION

Can you propose a fix to any of the broken ones above?

Practice Set 2

The practice problems below use the department of transportation's "On-Time" flight data for all flights originating from SFO or OAK in January 2016. Information about the variables can be found in the readme.html file. Information about the airports and airlines are contained in the comma-delimited files airports.dat and airlines.dat, respectively. Both were sourced from http://openflights.org/data.html

Disclaimer: There is a more direct way of dealing with time data that is not presented in these problems. This activity is merely an academic exercise.


In [ ]:
flights = pd.read_csv("flights.dat", dtype={'sched_dep_time': 'f8', 'sched_arr_time': 'f8'})
flights.head()

In [ ]:
airports_cols = [
    'openflights_id',
    'name',
    'city',
    'country',
    'iata',
    'icao',
    'latitude',
    'longitude',
    'altitude',
    'tz',
    'dst',
    'tz_olson',
    'type',
    'airport_dsource'
]

airports = pd.read_csv("airports.dat", names=airports_cols)
airports.head()

Question 1

It looks like the departure and arrival were read in a floating-point numbers. Write two functions, extract_hour and extract_mins that converts military time to hours and minutes, respectively. Hint: You may want to use modular arithmetic and integer division.


In [ ]:
def extract_hour(time):
    """
    Extracts hour information from military time.
    
    Args: 
        time (float64): array of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): array of input dimension with hour information.  
          Should only take on integer values in 0-23
    """
    return time // 100 # SOLUTION

In [ ]:
def extract_mins(time):
    """
    Extracts minute information from military time
    
    Args: 
        time (float64): array of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): array of input dimension with hour information.  
          Should only take on integer values in 0-59
    """
    return time % 100 # SOLUTION

Question 2

Using your two functions above, filter the flights data for flights that departed 15 or more minutes later than scheduled. You need not worry about flights that were delayed to the next day for this question.


In [ ]:
def convert_to_minofday(time):
    """
    Converts military time to minute of day
    
    Args:
        time (float64): array of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): array of input dimension with minute of day
    
    Example: 1:03pm is converted to 783.0
    >>> convert_to_minofday(1303.0)
    783.0
    """
    return extract_hour(time) * 60 + extract_mins(time) # SOLUTION

def calc_time_diff(x, y):
    """
    Calculates delay times y - x
    
    Args:
        x (float64): array of scheduled time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
        y (float64): array of same dimensions giving actual time
    
    Returns:
        array (float64): array of input dimension with delay time
    """
    
    scheduled = convert_to_minofday(x) # SOLUTION
    actual = convert_to_minofday(y) # SOLUTION
    
    return actual - scheduled # SOLUTION

In [ ]:
delay = calc_time_diff(flights['sched_dep_time'], flights['actual_dep_time']) # SOLUTION
delayed15 = flights[delay > 15] # SOLUTION

Question 3

Using your answer from question 2, find the full name of every destination city with a flight from SFO or OAK that was delayed by 15 or more minutes. The airport codes used in flights are IATA codes. Sort the cities alphabetically.


In [ ]:
delayed_airports = airports['iata'].isin(delayed15['destination']) # SOLUTION
delayed_destinations = airports.loc[delayed_airports, 'city'].sort_values() # SOLUTION

Question 4

Find the tail number of the top ten planes, measured by number of destinations the plane flew to in January. You may find drop_duplicates and sort_values helpful.


In [ ]:
top10 = (
    flights[['tailnum', 'destination']]
    .drop_duplicates()
    .groupby('tailnum')
    .size()
    .sort_values(ascending=False)
    .head(10)
)

In [ ]:
top10 = ...

Challenge

Add a new column to airports called sfo_arr_delay_avg that contains information about the average delay time in January from SFO.


In [ ]:
flights['delay'] = calc_time_diff(flights['sched_dep_time'], flights['actual_dep_time'])
avg_delay = (
    flights[flights['origin']=='SFO']
    .groupby('destination')
    .agg({'delay': np.mean})
)
airports = pd.merge(airports, avg_delay, how='inner', left_on='iata', right_index=True)

In [ ]:
airports = ...

Let's take a look at our non-null results. Do any of the delay values catch your eye?

...


In [ ]:
airports.loc[pd.notnull(airports['delay']), ['name', 'city', 'delay']]\
    .sort_values('delay', ascending=False)

Submission

Run the cell below to submit the lab. You may resubmit as many times you want. We will be grading you on effort/completion.


In [ ]:
I_totally_did_everything=True

In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()

In [ ]:
_ = ok.submit()