In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%load_ext autoreload
%autoreload 2
%matplotlib inline

import seaborn as sns
import pandas as pd
import requests

from tweet_parser.tweet import Tweet
from gapi import gnipapi
from gapi.gnipapi import *


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

This will lean heavily on Tom Augspurger's excellent series on Modern Pandas.

Quote:

Method chaining, where you call methods on an object one after another, is in vogue at the moment. It's always been a style of programming that's been possible with pandas, and over the past several releases, we've added methods that enable even more chaining.

- assign (0.16.0): For adding new columns to a DataFrame in a chain (inspired by dplyr's mutate)
- pipe (0.16.2): For including user-defined methods in method chains.
- rename (0.18.0): For altering axis names (in additional to changing the actual labels as before).

- Window methods (0.18): Took the top-level pd.rolling\_\* and pd.expanding\_\* functions and made them NDFrame methods with a groupby-like API.
- Resample (0.18.0) Added a new groupby-like API
- .where/mask/Indexers accept Callables (0.18.1): In the next release you'll be able to pass a callable to the indexing methods, to be evaluated within the DataFrame's context (like .query, but with code instead of strings).

My scripts will typically start off with large-ish chain at the start getting things into a manageable state. It's good to have the bulk of your munging done with right away so you can start to do Science™:


Part of the goal will be to develop different coding styles with Pandas, moving from a script-ish, verbose approach to a piped style that flows well with discrete cleaning operations grouped into single functions. This flows very well into using pyspark's dataframe as well, as pyspark requires that kind of style and there is a great deal of overlap with pandas' dataframe methods in pyspark.

Method chains are a popular method in programming these days, with the rise of functional languages that can change function composition to be more readable. Examples of this in various languages:

.scala
 def fooNotIndent : List[Int] = (1 to 100).view.map { _ + 3 }.filter { _ > 10 }.flatMap { table.get }.take(3).toList

  def fooIndent: List[Int] =
  (1 to 100)
    .view
    .map { _ + 3 }
    .filter { _ > 10 }
    .flatMap { table.get }
    .take(3)
    .toList

or comparing (from TA's post)

tumble_after(
    broke(
        fell_down(
            fetch(went_up(jack_jill, "hill"), "water"),
            jack),
        "crown"),
    "jill"
)

with (from TA's post)

jack_jill %>%
    went_up("hill") %>%
    fetch("water") %>%
    fell_down("jack") %>%
    broke("crown") %>%
    tumble_after("jill")

or

jack_jill 
    .pipe(went("hill", "up"))
    .pipe(fetch("water"))
    .pipe(fell_down("jack"))
    .pipe(broke("crown"))
    .pipe(tumble_after("jill"))

There are several cases I'd like to address in this session -

  1. Effective pandas usage
  2. Interactive development strategies
  3. Balancing exploration and reproducability
  4. Joining heterogenous datatypes

This might be a lot for a single session, but hey.

Let's start off with a problem that we might have that we can try to answer:

What Airports or flights have issues with delays?

I am purposefully choosing a dataset that we will have difficulties in joining with twitter data, and also to illustrate a point...

Let's use Tom's use of the BTS airline delay dataset, which requires a bit of work to obtain and parse through.

In data work, I have a loose set of semantics to describe stages in a workflow:

Research and understanding

  • understand question of interest
  • understand the data sources available
  • evaluate requirements from stakeholders
  • think about available methods and timeframes for implementation
  • judge final output (serialized data, production model, figures / slides / report, etc)

data gathering / pull

  • a single stage of collecting data from some source, be it scraping a website, some form of database, reading from a csv or other binary file, etc.

trivial cleaning

  • dealing with various data sources returns data in formats that you may not want, be it with weird variable names, transformations from CamelCase to camelcase, or other very early-stages ops that you define to ease the rest of the process. E.g., renaming columns with spaces in them to be `` delinated. This can often be integrated into the basic data pull step, but should be explicit for reproducibility purposes.

Non-trivial cleaning / preprocessing

  • There might be missing data, multiple or non-standard representation of NULL values (99s, strings 'nan', etc), which finding and handling are crucial
  • for JSON or similar formats, nested data structures might be present
  • datetime parsing and conversion if important to underlying analysis
  • reshaping data to facilitate analysis (wide to long, stacked to unstacked, etc).
  • frequency normalization for time-series data
  • text cleaning or tokenization
  • joining additional data sources with current data (which has already be gathered and pulled)

exploratory analysis

  • with "clean" data, you can begin to poke at questions, from basic summarization and counting to faceted charts if it makes sense for the data
  • may include branching by reformatting your data into a different set (single operations to aggregated buckets, denormalized time-series stats, etc)
  • lots of plotting and descriptions
  • often can loop back to previous stages to gather more data or stabilize workflow when finding good features or ways of processing data

modeling / prediction

  • potential input to ML functions, which includes sampling and so forth
  • output could be predicted values for a database or downstream operation, figures and text for a report, etc.

It's important to note that these stages are NEVER LINEAR, even though it almost always looks like it to the end consumer of posts like this. Each stage can be non-trivial for a host of reasons, and choices made in the early stages have strong effects in the rest of the process. Lots of iteration might be needed in each stage, and managing technical debt here can make each iteration faster.

Given these tasks, it seems logical to define our code in simliar stages, though I have no precise guides for how to do this. In our example for today, we can start by grabbing some data from the web. We'll follow TA's flights data grab and later add some tweet data via the Gnip api.

Let's play

What type of questions do we want to answer? Let's say we have a project that will investigate customer sentiment around airports / airlines. Perhaps some questions of interest are:

  • Are customer moods affected heavily by flight delays?
  • Are customers more likely to tweet due to flight delays?
  • Do these effects vary by airport, airlines, or other factors?

What type of data will we need? Probably some detailed data about flight delays, preferabbly flight-level data that includes information about the carrier, airport, destination, etc. We'll also need tweet data that reasonably matches these critera, of course. In this context, we'll probably be satisfied with simple exploration.

I chose these questions partially due to the great series of posts by Augsperger that illustrate working with complex data, such as flight delays. :)

Data Gathering

For most notebook purposes, I like to include all imports at the top of a notebook, even if the code they enable will not be introduced until a later point. I like to keep this cell apart from others in the notebook for easy maintenence.


In [7]:
import os
import zipfile

import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In the original example from Tom, the code is written out as such:

.python
headers = {
    'Referer': 'https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time',
    'Origin': 'https://www.transtats.bts.gov',
    'Content-Type': 'application/x-www-form-urlencoded',
}

params = (
    ('Table_ID', '236'),
    ('Has_Group', '3'),
    ('Is_Zipped', '0'),
)

data = <TRUNCATED>

os.makedirs('data', exist_ok=True)
dest = "data/flights.csv.zip"

if not os.path.exists(dest):
    r = requests.post('https://www.transtats.bts.gov/DownLoad_Table.asp',
                      headers=headers, params=params, data=data, stream=True)

    with open("data/flights.csv.zip", 'wb') as f:
        for chunk in r.iter_content(chunk_size=102400): 
            if chunk:
                f.write(chunk)

Given out focus today, let's wrap all initial data pulling into a function for logical separation.


In [8]:
def maybe_pull_airport_data():
    """
    lightly modified from TA's post.
    
    """
    headers = {
        'Referer': 'https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time',
        'Origin': 'https://www.transtats.bts.gov',
        'Content-Type': 'application/x-www-form-urlencoded',
    }

    params = (
        ('Table_ID', '236'),
        ('Has_Group', '3'),
        ('Is_Zipped', '0'),
    )
    
    # query string to be sent. can modify the 'where' dates to change the size of data returned.

    data = "UserTableName=On_Time_Performance&DBShortName=On_Time&RawDataTable=T_ONTIME&sqlstr=+SELECT+FL_DATE%2CUNIQUE_CARRIER%2CAIRLINE_ID%2CTAIL_NUM%2CFL_NUM%2CORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CORIGIN%2CORIGIN_CITY_NAME%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID%2CDEST%2CDEST_CITY_NAME%2CCRS_DEP_TIME%2CDEP_TIME%2CDEP_DELAY%2CTAXI_OUT%2CWHEELS_OFF%2CWHEELS_ON%2CTAXI_IN%2CCRS_ARR_TIME%2CARR_TIME%2CARR_DELAY%2CCANCELLED%2CCANCELLATION_CODE%2CCARRIER_DELAY%2CWEATHER_DELAY%2CNAS_DELAY%2CSECURITY_DELAY%2CLATE_AIRCRAFT_DELAY+FROM++T_ONTIME+WHERE+YEAR%3D2017&varlist=FL_DATE%2CUNIQUE_CARRIER%2CAIRLINE_ID%2CTAIL_NUM%2CFL_NUM%2CORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CORIGIN%2CORIGIN_CITY_NAME%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID%2CDEST%2CDEST_CITY_NAME%2CCRS_DEP_TIME%2CDEP_TIME%2CDEP_DELAY%2CTAXI_OUT%2CWHEELS_OFF%2CWHEELS_ON%2CTAXI_IN%2CCRS_ARR_TIME%2CARR_TIME%2CARR_DELAY%2CCANCELLED%2CCANCELLATION_CODE%2CCARRIER_DELAY%2CWEATHER_DELAY%2CNAS_DELAY%2CSECURITY_DELAY%2CLATE_AIRCRAFT_DELAY&grouplist=&suml=&sumRegion=&filter1=title%3D&filter2=title%3D&geo=All%A0&time=January&timename=Month&GEOGRAPHY=All&XYEAR=2017&FREQUENCY=1&VarDesc=Year&VarType=Num&VarDesc=Quarter&VarType=Num&VarDesc=Month&VarType=Num&VarDesc=DayofMonth&VarType=Num&VarDesc=DayOfWeek&VarType=Num&VarName=FL_DATE&VarDesc=FlightDate&VarType=Char&VarName=UNIQUE_CARRIER&VarDesc=UniqueCarrier&VarType=Char&VarName=AIRLINE_ID&VarDesc=AirlineID&VarType=Num&VarDesc=Carrier&VarType=Char&VarName=TAIL_NUM&VarDesc=TailNum&VarType=Char&VarName=FL_NUM&VarDesc=FlightNum&VarType=Char&VarName=ORIGIN_AIRPORT_ID&VarDesc=OriginAirportID&VarType=Num&VarName=ORIGIN_AIRPORT_SEQ_ID&VarDesc=OriginAirportSeqID&VarType=Num&VarName=ORIGIN_CITY_MARKET_ID&VarDesc=OriginCityMarketID&VarType=Num&VarName=ORIGIN&VarDesc=Origin&VarType=Char&VarName=ORIGIN_CITY_NAME&VarDesc=OriginCityName&VarType=Char&VarDesc=OriginState&VarType=Char&VarDesc=OriginStateFips&VarType=Char&VarDesc=OriginStateName&VarType=Char&VarDesc=OriginWac&VarType=Num&VarName=DEST_AIRPORT_ID&VarDesc=DestAirportID&VarType=Num&VarName=DEST_AIRPORT_SEQ_ID&VarDesc=DestAirportSeqID&VarType=Num&VarName=DEST_CITY_MARKET_ID&VarDesc=DestCityMarketID&VarType=Num&VarName=DEST&VarDesc=Dest&VarType=Char&VarName=DEST_CITY_NAME&VarDesc=DestCityName&VarType=Char&VarDesc=DestState&VarType=Char&VarDesc=DestStateFips&VarType=Char&VarDesc=DestStateName&VarType=Char&VarDesc=DestWac&VarType=Num&VarName=CRS_DEP_TIME&VarDesc=CRSDepTime&VarType=Char&VarName=DEP_TIME&VarDesc=DepTime&VarType=Char&VarName=DEP_DELAY&VarDesc=DepDelay&VarType=Num&VarDesc=DepDelayMinutes&VarType=Num&VarDesc=DepDel15&VarType=Num&VarDesc=DepartureDelayGroups&VarType=Num&VarDesc=DepTimeBlk&VarType=Char&VarName=TAXI_OUT&VarDesc=TaxiOut&VarType=Num&VarName=WHEELS_OFF&VarDesc=WheelsOff&VarType=Char&VarName=WHEELS_ON&VarDesc=WheelsOn&VarType=Char&VarName=TAXI_IN&VarDesc=TaxiIn&VarType=Num&VarName=CRS_ARR_TIME&VarDesc=CRSArrTime&VarType=Char&VarName=ARR_TIME&VarDesc=ArrTime&VarType=Char&VarName=ARR_DELAY&VarDesc=ArrDelay&VarType=Num&VarDesc=ArrDelayMinutes&VarType=Num&VarDesc=ArrDel15&VarType=Num&VarDesc=ArrivalDelayGroups&VarType=Num&VarDesc=ArrTimeBlk&VarType=Char&VarName=CANCELLED&VarDesc=Cancelled&VarType=Num&VarName=CANCELLATION_CODE&VarDesc=CancellationCode&VarType=Char&VarDesc=Diverted&VarType=Num&VarDesc=CRSElapsedTime&VarType=Num&VarDesc=ActualElapsedTime&VarType=Num&VarDesc=AirTime&VarType=Num&VarDesc=Flights&VarType=Num&VarDesc=Distance&VarType=Num&VarDesc=DistanceGroup&VarType=Num&VarName=CARRIER_DELAY&VarDesc=CarrierDelay&VarType=Num&VarName=WEATHER_DELAY&VarDesc=WeatherDelay&VarType=Num&VarName=NAS_DELAY&VarDesc=NASDelay&VarType=Num&VarName=SECURITY_DELAY&VarDesc=SecurityDelay&VarType=Num&VarName=LATE_AIRCRAFT_DELAY&VarDesc=LateAircraftDelay&VarType=Num&VarDesc=FirstDepTime&VarType=Char&VarDesc=TotalAddGTime&VarType=Num&VarDesc=LongestAddGTime&VarType=Num&VarDesc=DivAirportLandings&VarType=Num&VarDesc=DivReachedDest&VarType=Num&VarDesc=DivActualElapsedTime&VarType=Num&VarDesc=DivArrDelay&VarType=Num&VarDesc=DivDistance&VarType=Num&VarDesc=Div1Airport&VarType=Char&VarDesc=Div1AirportID&VarType=Num&VarDesc=Div1AirportSeqID&VarType=Num&VarDesc=Div1WheelsOn&VarType=Char&VarDesc=Div1TotalGTime&VarType=Num&VarDesc=Div1LongestGTime&VarType=Num&VarDesc=Div1WheelsOff&VarType=Char&VarDesc=Div1TailNum&VarType=Char&VarDesc=Div2Airport&VarType=Char&VarDesc=Div2AirportID&VarType=Num&VarDesc=Div2AirportSeqID&VarType=Num&VarDesc=Div2WheelsOn&VarType=Char&VarDesc=Div2TotalGTime&VarType=Num&VarDesc=Div2LongestGTime&VarType=Num&VarDesc=Div2WheelsOff&VarType=Char&VarDesc=Div2TailNum&VarType=Char&VarDesc=Div3Airport&VarType=Char&VarDesc=Div3AirportID&VarType=Num&VarDesc=Div3AirportSeqID&VarType=Num&VarDesc=Div3WheelsOn&VarType=Char&VarDesc=Div3TotalGTime&VarType=Num&VarDesc=Div3LongestGTime&VarType=Num&VarDesc=Div3WheelsOff&VarType=Char&VarDesc=Div3TailNum&VarType=Char&VarDesc=Div4Airport&VarType=Char&VarDesc=Div4AirportID&VarType=Num&VarDesc=Div4AirportSeqID&VarType=Num&VarDesc=Div4WheelsOn&VarType=Char&VarDesc=Div4TotalGTime&VarType=Num&VarDesc=Div4LongestGTime&VarType=Num&VarDesc=Div4WheelsOff&VarType=Char&VarDesc=Div4TailNum&VarType=Char&VarDesc=Div5Airport&VarType=Char&VarDesc=Div5AirportID&VarType=Num&VarDesc=Div5AirportSeqID&VarType=Num&VarDesc=Div5WheelsOn&VarType=Char&VarDesc=Div5TotalGTime&VarType=Num&VarDesc=Div5LongestGTime&VarType=Num&VarDesc=Div5WheelsOff&VarType=Char&VarDesc=Div5TailNum&VarType=Char"

    os.makedirs('data', exist_ok=True)
    dest = "data/flights.csv.zip"

    if not os.path.exists(dest):
        r = requests.post('https://www.transtats.bts.gov/DownLoad_Table.asp',
                          headers=headers, params=params, data=data, stream=True)

        with open("data/flights.csv.zip", 'wb') as f:
            for chunk in r.iter_content(chunk_size=102400): 
                if chunk:
                    f.write(chunk)
    

    zf = zipfile.ZipFile("data/flights.csv.zip")
    fp = zf.extract(zf.filelist[0].filename, path='data/')
    df = (pd
          .read_csv(fp, parse_dates=["FL_DATE"])
          .rename(columns=str.lower) #note this takes a callable
         )
    return df

Our function may be a bit sloppy from a DRY standpoint, but let's be serious: there is no need for arguments in this function and no other piece of our analysis will ever touch the fields inside of here. You could argue that it could take a flexibile filename option, but again, for the purposes of this demo, that might be overkill, but refactoring the single function to take a filename argument would take a minutue or two, now that the core logic is stable. This gives us a high-level intro point for our demo, a single call to the function.

Imagine this was going to go into a much larger set of functions or a library of some sort -- the function can be moved to a python file and work out of the box, which can simply your notebook or code at the risk of making more dependencies for users and disrupting the flow of analysis for a technical consumer.

That was a lot of crap - let's get back to the data.


In [9]:
flights = maybe_pull_airport_data()

flights.head()
flights.shape
flights.info()


Out[9]:
fl_date unique_carrier airline_id tail_num fl_num origin_airport_id origin_airport_seq_id origin_city_market_id origin origin_city_name ... arr_time arr_delay cancelled cancellation_code carrier_delay weather_delay nas_delay security_delay late_aircraft_delay unnamed: 32
0 2017-01-01 AA 19805 N3CGAA 307 11292 1129202 30325 DEN Denver, CO ... 1328.0 -17.0 0.0 NaN NaN NaN NaN NaN NaN NaN
1 2017-01-01 AA 19805 N3CGAA 307 14107 1410702 30466 PHX Phoenix, AZ ... 1653.0 -8.0 0.0 NaN NaN NaN NaN NaN NaN NaN
2 2017-01-01 AA 19805 N3EVAA 309 11278 1127803 30852 DCA Washington, DC ... 930.0 -14.0 0.0 NaN NaN NaN NaN NaN NaN NaN
3 2017-01-01 AA 19805 N3NAAA 310 13303 1330303 32467 MIA Miami, FL ... 1646.0 -14.0 0.0 NaN NaN NaN NaN NaN NaN NaN
4 2017-01-01 AA 19805 N3FCAA 311 14107 1410702 30466 PHX Phoenix, AZ ... 107.0 28.0 0.0 NaN 0.0 0.0 12.0 0.0 16.0 NaN

5 rows × 33 columns

Out[9]:
(3307279, 33)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3307279 entries, 0 to 3307278
Data columns (total 33 columns):
fl_date                  datetime64[ns]
unique_carrier           object
airline_id               int64
tail_num                 object
fl_num                   int64
origin_airport_id        int64
origin_airport_seq_id    int64
origin_city_market_id    int64
origin                   object
origin_city_name         object
dest_airport_id          int64
dest_airport_seq_id      int64
dest_city_market_id      int64
dest                     object
dest_city_name           object
crs_dep_time             int64
dep_time                 float64
dep_delay                float64
taxi_out                 float64
wheels_off               float64
wheels_on                float64
taxi_in                  float64
crs_arr_time             int64
arr_time                 float64
arr_delay                float64
cancelled                float64
cancellation_code        object
carrier_delay            float64
weather_delay            float64
nas_delay                float64
security_delay           float64
late_aircraft_delay      float64
unnamed: 32              float64
dtypes: datetime64[ns](1), float64(15), int64(10), object(7)
memory usage: 832.7+ MB

Digression on pandas

Woo, we have a moderate-sized dataframe with a lot of columns, many which are NAN or non-intuitive. Let's define a few indices on the data, which assign metadata to each row and allow for fancy selection and operations along the way.


In [10]:
hdf = flights.set_index(["unique_carrier", "origin", "dest", "tail_num", "fl_date"]).sort_index()
hdf[hdf.columns[:4]].head()


Out[10]:
airline_id fl_num origin_airport_id origin_airport_seq_id
unique_carrier origin dest tail_num fl_date
AA ABQ DFW N3AAAA 2017-02-07 19805 1353 10140 1014003
N3ABAA 2017-01-15 19805 2611 10140 1014003
2017-01-29 19805 1282 10140 1014003
2017-02-10 19805 1353 10140 1014003
2017-07-04 19805 2214 10140 1014003

I think this clears up some thinking about the rows -- indexing by the flight operator, the airport origin, airport destination, the plan id, and the date of the flight make it clear what each row is.

Selecting the data out in useful ways is somewhat straightforward, using the .loc semantics, which allow for label-oriented indexing in a dataframe.


In [11]:
hdf.loc[["AA"], ["dep_delay"]].head()


Out[11]:
dep_delay
unique_carrier origin dest tail_num fl_date
AA ABQ DFW N3AAAA 2017-02-07 -7.0
N3ABAA 2017-01-15 0.0
2017-01-29 -3.0
2017-02-10 -1.0
2017-07-04 -8.0

What if we wanted to get ANY flight from denver to albuquerque? Pandas IndexSlice is a brilliant help here.

The semantics work as follows:

: is "include all labels from this level of the index

hdf.loc[pd.IndexSlice[:, ["DEN"], ["ABQ"]], ["dep_delay"]]

translates to

hdf.loc[pd.IndexSlice[ALL CARRIERS, origin=["DEN"], dest=["ABQ"], ALL_TAILS, ALL_DATES], ["dep_delay"]]


In [12]:
(hdf.loc[pd.IndexSlice[:, ["DEN"], ["ABQ"]],
         ["dep_delay"]]
 .sort_values("dep_delay", ascending=False)
 .head()
)


Out[12]:
dep_delay
unique_carrier origin dest tail_num fl_date
WN DEN ABQ N7721E 2017-07-30 203.0
UA DEN ABQ N24211 2017-02-16 194.0
WN DEN ABQ N422WN 2017-07-28 188.0
N7710A 2017-07-17 180.0
N756SA 2017-06-25 178.0

we can also use the powerful query function, which allows a limited vocabulary to be executed on a dataframe and is wildly useful for slightly more clear operations.


In [13]:
(hdf
 .query("origin == 'DEN' and dest == 'ABQ'")
 .loc[:, "dep_delay"]
 .to_frame()
 .sort_values("dep_delay", ascending=False)
 .head()
)


Out[13]:
dep_delay
unique_carrier origin dest tail_num fl_date
WN DEN ABQ N7721E 2017-07-30 203.0
UA DEN ABQ N24211 2017-02-16 194.0
WN DEN ABQ N422WN 2017-07-28 188.0
N7710A 2017-07-17 180.0
N756SA 2017-06-25 178.0

These days, I prefer query most of the time, particularly for exploration, but .loc with explicit indices can be far faster in many cases.

back to gathering and inspection

So, at this stage, it seems reasonable to examine some of the data and see what might be problematic or needs further work.

We can see that several columns that should be datetimes are not "dep_time", etc. and that the City names are not really city names but City, State pairs. In Toms' series, the cleaning operations are done in a different post, but I will copy them here to fit in our framework. I think these functions can be safelycounted as advanced preprocessing and cleaning.


In [14]:
flights.head()
flights.dtypes


Out[14]:
fl_date unique_carrier airline_id tail_num fl_num origin_airport_id origin_airport_seq_id origin_city_market_id origin origin_city_name ... arr_time arr_delay cancelled cancellation_code carrier_delay weather_delay nas_delay security_delay late_aircraft_delay unnamed: 32
0 2017-01-01 AA 19805 N3CGAA 307 11292 1129202 30325 DEN Denver, CO ... 1328.0 -17.0 0.0 NaN NaN NaN NaN NaN NaN NaN
1 2017-01-01 AA 19805 N3CGAA 307 14107 1410702 30466 PHX Phoenix, AZ ... 1653.0 -8.0 0.0 NaN NaN NaN NaN NaN NaN NaN
2 2017-01-01 AA 19805 N3EVAA 309 11278 1127803 30852 DCA Washington, DC ... 930.0 -14.0 0.0 NaN NaN NaN NaN NaN NaN NaN
3 2017-01-01 AA 19805 N3NAAA 310 13303 1330303 32467 MIA Miami, FL ... 1646.0 -14.0 0.0 NaN NaN NaN NaN NaN NaN NaN
4 2017-01-01 AA 19805 N3FCAA 311 14107 1410702 30466 PHX Phoenix, AZ ... 107.0 28.0 0.0 NaN 0.0 0.0 12.0 0.0 16.0 NaN

5 rows × 33 columns

Out[14]:
fl_date                  datetime64[ns]
unique_carrier                   object
airline_id                        int64
tail_num                         object
fl_num                            int64
origin_airport_id                 int64
origin_airport_seq_id             int64
origin_city_market_id             int64
origin                           object
origin_city_name                 object
dest_airport_id                   int64
dest_airport_seq_id               int64
dest_city_market_id               int64
dest                             object
dest_city_name                   object
crs_dep_time                      int64
dep_time                        float64
dep_delay                       float64
taxi_out                        float64
wheels_off                      float64
wheels_on                       float64
taxi_in                         float64
crs_arr_time                      int64
arr_time                        float64
arr_delay                       float64
cancelled                       float64
cancellation_code                object
carrier_delay                   float64
weather_delay                   float64
nas_delay                       float64
security_delay                  float64
late_aircraft_delay             float64
unnamed: 32                     float64
dtype: object

In [15]:
def extract_city_name(df):
    '''
    Chicago, IL -> Chicago for origin_city_name and dest_city_name. From Augsperger.
    '''
    cols = ['origin_city_name', 'dest_city_name']
    city = df[cols].apply(lambda x: x.str.extract("(.*), \w{2}", expand=False))
    df = df.copy()
    df[['origin_city_name', 'dest_city_name']] = city
    return df


def time_to_datetime(df, columns):
    '''
    Combine all time items into datetimes. From Augsperger.

    2014-01-01,0914 -> 2014-01-01 09:14:00
    '''
    df = df.copy()
    def converter(col):
        timepart = (col.astype(str)
                       .str.replace('\.0$', '')  # NaNs force float dtype
                       .str.pad(4, fillchar='0'))
        return pd.to_datetime(df['fl_date'].astype("str") + ' ' +
                               timepart.str.slice(0, 2) + ':' +
                               timepart.str.slice(2, 4),
                               errors='coerce')
    df[columns] = df[columns].apply(converter)
    return df

Note that both methods accept a pandas.DataFrame and return a pandas.DataFrame . This is critical to our upcoming methodology, and for portability to spark.

It seems obvious, but writing code that operates on immutable data structures is wildly useful for data processing. DataFrames are not immutable, but can be treated as such, as many operations either implicitly return a copy or methods can be written as such. With our methods, we can now create a new top-level function that handles our preprocessing.

It's not too often that your major performance bottleneck in pandas is copying dataframes.

Anyway, we can now integrate our simple gathering method with some of the cleaning methods for a new top-level entry for our exploration.


In [16]:
def read_and_process_flights_data():
    drop_cols = ["unnamed: 32", "security_delay", "late_aircraft_delay",
                 "nas_delay", "origin_airport_id", "origin_city_market_id",
                 "taxi_out", "wheels_off", "wheels_on", "crs_arr_time", "crs_dep_time",
                 "carrier_delay"]
    df = (maybe_pull_airport_data()
          .rename(columns=str.lower)
          .drop(drop_cols, axis=1)
          .pipe(extract_city_name)
          .pipe(time_to_datetime, ['dep_time', 'arr_time'])
          .assign(fl_date=lambda x: pd.to_datetime(x['fl_date']),
                  dest=lambda x: pd.Categorical(x['dest']),
                  origin=lambda x: pd.Categorical(x['origin']),
                  tail_num=lambda x: pd.Categorical(x['tail_num']),
                  unique_carrier=lambda x: pd.Categorical(x['unique_carrier']),
                  cancellation_code=lambda x: pd.Categorical(x['cancellation_code'])))
    return df

In [17]:
## this will take a few minutues with the full 2017 data; far faster with a month's sample
flights = read_and_process_flights_data()

In [18]:
flights.tail()
flights.dtypes
flights.shape


Out[18]:
fl_date unique_carrier airline_id tail_num fl_num origin_airport_seq_id origin origin_city_name dest_airport_id dest_airport_seq_id ... dest dest_city_name dep_time dep_delay taxi_in arr_time arr_delay cancelled cancellation_code weather_delay
3307274 2017-07-31 F9 20436 N952FR 324 1129202 DEN Denver 12264 1226402 ... IAD Washington 2017-07-31 23:43:00 8.0 8.0 2017-07-31 05:00:00 4.0 0.0 NaN NaN
3307275 2017-07-31 F9 20436 N953FR 1877 1320402 MCO Orlando 11109 1110902 ... COS Colorado Springs 2017-07-31 09:35:00 40.0 7.0 2017-07-31 11:06:00 14.0 0.0 NaN NaN
3307276 2017-07-31 F9 20436 N953FR 1877 1110902 COS Colorado Springs 14107 1410702 ... PHX Phoenix 2017-07-31 11:51:00 11.0 4.0 2017-07-31 12:37:00 0.0 0.0 NaN NaN
3307277 2017-07-31 F9 20436 N953FR 1750 1410702 PHX Phoenix 11109 1110902 ... COS Colorado Springs 2017-07-31 13:14:00 -6.0 10.0 2017-07-31 16:11:00 4.0 0.0 NaN NaN
3307278 2017-07-31 F9 20436 N953FR 1750 1110902 COS Colorado Springs 13204 1320402 ... MCO Orlando 2017-07-31 16:55:00 0.0 9.0 2017-07-31 22:16:00 -16.0 0.0 NaN NaN

5 rows × 21 columns

Out[18]:
fl_date                  datetime64[ns]
unique_carrier                 category
airline_id                        int64
tail_num                       category
fl_num                            int64
origin_airport_seq_id             int64
origin                         category
origin_city_name                 object
dest_airport_id                   int64
dest_airport_seq_id               int64
dest_city_market_id               int64
dest                           category
dest_city_name                   object
dep_time                 datetime64[ns]
dep_delay                       float64
taxi_in                         float64
arr_time                 datetime64[ns]
arr_delay                       float64
cancelled                       float64
cancellation_code              category
weather_delay                   float64
dtype: object
Out[18]:
(3307279, 21)

Exploring data

From Tom's post, here's a long method chain that does an awful lot of work to generate a plot of flights per day for the top carriers. I'll break this down a bit after.


In [19]:
(flights
 .dropna(subset=['dep_time', 'unique_carrier'])
 .loc[flights['unique_carrier'].isin(flights['unique_carrier']
                                .value_counts()
                                .index[:5])]
 .set_index('dep_time')
 .groupby(['unique_carrier', pd.TimeGrouper("D")])
 ["fl_num"]
 .count()
 .unstack(0)
 .fillna(0)
 .rename_axis("Flights per Day", axis=1)
 .plot()
)


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1290f99b0>

If we broke this out like many people do, we might end up with code like this, where each step is broken into a variable.


In [20]:
# gets the carriers with the most traffic, hacking with the index. We use this for other ops. 
df_clean = flights.dropna(subset=["dep_time", "unique_carrier"])
top_carriers = flights["unique_carrier"].value_counts().index[:5]
df_clean = df_clean.query("unique_carrier in @top_carriers")
df_clean = df_clean.set_index("dep_time")

carriers_by_hour = (df_clean
                    .groupby(['unique_carrier',
                              pd.TimeGrouper("H")])["fl_num"]
                    .count())
carriers_df = carriers_by_hour.unstack(0)
carriers_df = carriers_df.fillna(0)
carriers_flights_per_day = (carriers_df
                            .rolling(24)
                            .sum()
                            .rename_axis("Flights per Day", axis=1))

carriers_flights_per_day.plot()


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1226a07f0>

Naming things is hard. Given that pandas has exteremely expressive semantics and nearly all analytic methods return a fresh dataframe or series, it makes it straightforward to chain many ops together. This style will lend itself well to spark and should be familiar to those of you who have worked with Scala or other functional languages.

If the chains get very verbose or hard to follow, break them up and put them in a function, where you can keep it all in one place. Try to be very specific about naming your functions (remember, naming things is hard, functions are no different).

In an exploratory context, you might continue adding methods onto your chain until you can expand and continue until you get to your chart or end stage goal. In some cases, saving some exploratory work to varibles is great.

Let's briefly talk about the .assign operator. This operation returns a new column for a dataframe, where the new column can be a constant, some like-indexed numpy array or series, a callable that references the dataframe in question, etc. It's very powerful in method chains and also very useful for keeping your namespace clean.

the semantics of

df.assign(NEW_COLUMN_NAME=lambda df: df["column"] + df["column2"] can be read as

assign a column named "NEW_COLUMN_NAME" to my referenced dataframe that is the sum of "column" and "column2". In the below example , the lambda references the datetime object of the departure time column to extract the hour, which gives us a convenient categorical value for examination.

This is similar to R's mutate function in the dyplr world.

Note -- the top_carriers variable above is a good example of something we might want to keep around, and I'll use it several times in the post.


In [21]:
#taken from Augsperger
(flights[['fl_date', 'unique_carrier', 'tail_num', 'dep_time', 'dep_delay']]
 .dropna()
 .query("unique_carrier in @top_carriers")
 .assign(hour=lambda x: x['dep_time'].dt.hour)
 #.query('5 <= dep_delay < 600')
 .pipe((sns.boxplot, 'data'), 'hour', 'dep_delay')
)


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x120392080>

This enables rapid exploration, and within the interactive context, allows you to copy a cell and change single lines to modify your results.

A heatmap might be a nice way to visualize categories in this data, and the assign syntax allows creating those categoricals seamless.


In [22]:
(flights[['fl_date', "unique_carrier", 'dep_time', 'dep_delay']]
 .dropna()
 .query("unique_carrier in @top_carriers")
 .assign(hour=lambda x: x.dep_time.dt.hour)
 .assign(day=lambda x: x.dep_time.dt.dayofweek)
 .query('-1 < dep_delay < 600')
 .groupby(["day", "hour"])["dep_delay"]
 .median()
 .unstack()
 .pipe((sns.heatmap, 'data'))
)


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x13d6e5f98>

In [23]:
(flights[['fl_date', 'unique_carrier', 'dep_time', 'dep_delay']]
 .query("unique_carrier in @top_carriers")
 .dropna()
 .assign(hour=lambda x: x.dep_time.dt.hour)
 .assign(day=lambda x: x.dep_time.dt.dayofweek)
 #.query('0 <= dep_delay < 600')
 .groupby(["unique_carrier", "day"])["dep_delay"]
 .mean()
 .unstack()
 .sort_values(by=0)
 .pipe((sns.heatmap, 'data'))
)


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1345cada0>

What about some other exploration? Pandas alows for some nifty ways of slicing up data to flexibly apply basic operations.

What if we want to "center" the carrier's delay time at an airport by the mean airport delay? This is a case where we assigning variables might be useful. We'll limit our analysis to the top carrriers / airports, and save some variables for further interactive use.


In [24]:
top_airport_codes = flights["origin"].value_counts().to_frame().head(5).index
top_airport_cities = flights["origin_city_name"].value_counts().head(5).index
top_airport_cities
top_airport_codes


Out[24]:
Index(['Atlanta', 'Chicago', 'Denver', 'Los Angeles', 'New York'], dtype='object')
Out[24]:
CategoricalIndex(['ATL', 'ORD', 'DEN', 'LAX', 'DFW'], categories=['ABE', 'ABI', 'ABQ', 'ABR', 'ABY', 'ACK', 'ACT', 'ACV', ...], ordered=False, dtype='category')

In [26]:
grand_airport_delay = (flights
 .query("unique_carrier in @top_carriers")
 .query("origin in @top_airport_codes")
 .groupby("origin")["dep_delay"] 
 .mean()
 .dropna()
 .to_frame()
)

airport_delay = (flights
.query("unique_carrier in @top_carriers")
.query("origin in @top_airport_codes")
 .set_index("fl_date")
 .groupby([pd.TimeGrouper("H"), "origin"])["dep_delay"] 
 .mean()
 .to_frame()
)

carrier_delay = (flights
.query("unique_carrier in @top_carriers")
.query("origin in @top_airport_codes")
 .set_index("fl_date")
 .groupby([pd.TimeGrouper("H"), "origin", "unique_carrier"])["dep_delay"] 
 .mean()
 .to_frame()
)

airport_delay.head()
carrier_delay.head()


Out[26]:
dep_delay
fl_date origin
2017-01-01 ATL 10.087558
DEN 7.821632
DFW 16.054201
LAX 20.769231
ORD 6.972477
Out[26]:
dep_delay
fl_date origin unique_carrier
2017-01-01 ATL AA 13.187500
DL 6.711900
OO 25.769231
UA 10.875000
WN 21.075630

In [27]:
grand_airport_delay
airport_delay.unstack().head()
carrier_delay.unstack(1).head()


Out[27]:
dep_delay
origin
ATL 12.177453
DEN 9.247587
DFW 12.203081
LAX 15.547959
ORD 13.054121
Out[27]:
dep_delay
origin ATL DEN DFW LAX ORD
fl_date
2017-01-01 10.087558 7.821632 16.054201 20.769231 6.972477
2017-01-02 30.086250 18.655417 22.660194 30.884615 16.823409
2017-01-03 7.546154 19.431942 13.342365 28.501035 15.668737
2017-01-04 4.622887 33.671674 9.635443 22.471239 9.388889
2017-01-05 7.346547 38.205240 16.155556 48.882353 16.287946
Out[27]:
dep_delay
origin ATL DEN DFW LAX ORD
fl_date unique_carrier
2017-01-01 AA 13.187500 35.961538 17.597633 40.432203 6.557576
DL 6.711900 6.600000 -3.125000 10.985714 8.750000
OO 25.769231 4.373016 5.166667 14.280899 6.135135
UA 10.875000 7.126437 -5.545455 11.950617 7.940789
WN 21.075630 6.983425 NaN 17.645455 NaN

Pandas handles alignment along axes, so we can do an operation along an axis with another dataframe with similar index labels.


In [28]:
(carrier_delay
 .unstack(1)
 .div(grand_airport_delay.unstack())
 .head()
 
)
(carrier_delay
 .unstack(1)
 .div(airport_delay.unstack())
 .head()
 
)


Out[28]:
dep_delay
origin ATL DEN DFW LAX ORD
fl_date unique_carrier
2017-01-01 AA 1.082944 3.888749 1.442065 2.600483 0.502338
DL 0.551174 0.713700 -0.256083 0.706570 0.670286
OO 2.116143 0.472882 0.423390 0.918506 0.469977
UA 0.893044 0.770627 -0.454431 0.768629 0.608298
WN 1.730709 0.755162 NaN 1.134905 NaN
Out[28]:
dep_delay
origin ATL DEN DFW LAX ORD
fl_date unique_carrier
2017-01-01 AA 1.307304 4.597703 1.096139 1.946736 0.940494
DL 0.665364 0.843814 -0.194653 0.528942 1.254934
OO 2.554556 0.559093 0.321826 0.687599 0.879908
UA 1.078061 0.911119 -0.345421 0.575400 1.138876
WN 2.089270 0.892835 NaN 0.849596 NaN

Putting that together, we can then get ratios of flight delays to the overall airport delay (grand mean or daily delays).


In [29]:
(carrier_delay
 .unstack(1)
 .div(airport_delay.unstack())
 .stack()
 .reset_index()
 .assign(day=lambda x: x["fl_date"].dt.dayofweek)
 .set_index("fl_date")
 .groupby(["unique_carrier", "day"])
 .mean()
 .dropna()
 .unstack()
 ["dep_delay"]
 .pipe((sns.heatmap, 'data'))
)


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1294b6d68>

In [30]:
(carrier_delay
 .unstack(1)
 .div(grand_airport_delay.unstack())
 .stack()
 .reset_index()
 .assign(day=lambda x: x["fl_date"].dt.dayofweek)
 .set_index("fl_date")
 .groupby(["unique_carrier", "day"])
 .mean()
 .dropna()
 .unstack()
 ["dep_delay"]
 .pipe((sns.heatmap, 'data'))
)


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x129e23ac8>

In [31]:
(carrier_delay
 .unstack(1)
 .subtract(airport_delay.unstack())
 .stack()
 .reset_index()
 .assign(day=lambda x: x["fl_date"].dt.dayofweek)
 .groupby(["unique_carrier", "day"])
 .mean()
 .dropna()
 .unstack()
 ["dep_delay"]
 .pipe((sns.heatmap, 'data'))
)


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x12342deb8>

So, now that we have some working flight data, let's poke at getting some tweets.

Tweets

I've recently refactored the python gnip search api to be a bit more flexible, including making each search return a lazy stream. There are also some tools for programatically generated

the 'city name' column and the airport abbreviation are likely sources of help for finding tweets related to flights / airport data. We'll use those and define a small function to help quickly generate our rules, which are somewhat simplistic but should serve as a reasonable start.


In [32]:
def generate_rules(codes, cities):
    base_rule = """
    ({code} OR "{city} airport") (flying OR flight OR plane OR jet)
    -(football OR
    basketball OR
    baseball OR
    party)
    -is:retweet
    """
    rules = []
    for code, city in zip(list(codes), list(cities)):
        _rule = base_rule.format(code=code, city=city.lower())
        rule = gen_rule_payload(_rule,
                                from_date="2017-01-01",
                                to_date="2017-07-31",
                                max_results=500)
        rules.append(rule)
    return rules

gnip_rules = generate_rules(top_airport_codes, top_airport_cities)
gnip_rules[0]


Out[32]:
'{"query":"(ATL OR \\"atlanta airport\\") (flying OR flight OR plane OR jet) -(football OR basketball OR baseball OR party) -is:retweet","maxResults":500,"toDate":"201707310000","fromDate":"201701010000"}'

the gnip api has some functions to handle our connection information. Please ensure that the environment variable GNIP_PW is set with your password. If it isn't already set, you can set it here.


In [ ]:
# os.environ["GNIP_PW"] = ""

In [ ]:
username = "agonzales@twitter.com"
search_api = "fullarchive"
account_name = "shendrickson"
endpoint_label = "ogformat.json"
og_search_endpoint = gen_endpoint(search_api,
                                  account_name,
                                  endpoint_label,
                                  count_endpoint=False)
og_args = {"username": username,
           "password": os.environ["GNIP_PW"],
           "url": og_search_endpoint}

In our get_tweets function, we wrap some of the functionality of our result stream to collect specific data from tweets into a dataframe.


In [ ]:
def get_tweets(result_stream, label):
    fields = ["id", "created_at_datetime",
              "all_text", "hashtags", "user_id",
              "user_mentions", "screen_name"]
    
    tweet_extracts = []
    for tweet in result_stream.start_stream():
        attrs = [tweet.__getattribute__(field) for field in fields]
        tweet_extracts.append(attrs)
        
    result_stream.end_stream()
    df = pd.DataFrame(tweet_extracts, columns=fields).assign(airport=label)
    return df

We can test this with a single rule.


In [ ]:
rs = ResultStream(**og_args, rule_payload=gnip_rules[0], max_results=1000)

In [ ]:
tweets = get_tweets(result_stream=rs, label=top_airport_codes[0])

In [ ]:
tweets.head()
tweets.shape

Now let's collect tweets for each airport. It might be a hair overkill, but I'll wrap the process up in a function, so we have a similar high point for grabbing our inital data. It will take a minute to grab this data, and for the time being, i'm not going to save it to disk.


In [ ]:
def pull_tweet_data(gnip_rules, results_per_rule=25000):
    streams = [ResultStream(**og_args,
                            rule_payload=rp,
                            max_results=results_per_rule)
               for rp in gnip_rules]

    tweets = [get_tweets(rs, airport)
              for rs, airport
              in zip(streams, top_airport_codes)]
    
    return pd.concat(tweets)

In [ ]:
tweets = pull_tweet_data(gnip_rules)

Given our new data, let's do some quick exploration and cleaning.


In [ ]:
tweets.shape
tweets.head()

In [ ]:
(tweets
 .set_index("created_at_datetime")
 .groupby([pd.TimeGrouper("D")])
 .size()
 .sort_values()
 .tail()
)

In [ ]:
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
(tweets
 .set_index("created_at_datetime")
 .groupby([pd.TimeGrouper("D")])
 .size()
 .plot()
)
 
ax.annotate("united senselessly\nbeating a passenger",
            xytext=("2017-02-01", 1200),
            xy=("2017-04-04", 900),
            arrowprops=dict(facecolor="black", shrink=0.05))

The number of tweets per day by airport rule is a bit odd:


In [ ]:
(tweets
 .drop(["id", "all_text"], axis=1)
 .set_index("created_at_datetime")
 .groupby([pd.TimeGrouper("H"), "airport"])
 ["user_id"]
 .count()
 .unstack()
 .fillna(0)
 .rolling(24).sum()
 .plot()
)

So lets look at what is going on with LAX:


In [ ]:
tweets["airport"].value_counts()
tweets.groupby("airport")["created_at_datetime"].min().sort_values().tail(1)
min_lax_time = tweets.groupby("airport")["created_at_datetime"].min().sort_values().tail(1)[0]

Far, far more people tweeting from LAX than from other airports or the number of extra tweets were dominated by the spikes in the data. Given it's size, this makes some sense, but i would question my rules a bit. Let's even out these samples a hair, by selecting tweets only from when LAX exisited


In [ ]:
(tweets
.drop(["id", "all_text"], axis=1)
.query("created_at_datetime >= @min_lax_time")
 .set_index("created_at_datetime")
 .groupby([pd.TimeGrouper("D"), "airport"])
 ["user_id"]
 .count()
 .unstack()
 .fillna(0)
 #.rolling(7).mbean()
 .plot()
)

Moving on, let's do some more things with our tweets, like parse out the mentions from the dict structure to something more useful.

We'll be making a function that takes a dataframe and returns one, so we can use it in the .pipe method.


In [ ]:
def parse_mentions(df):
    extract_mentions = lambda x: [d["name"] for d in x]
    mentions = (pd.DataFrame([x for x in df["user_mentions"]
                              .apply(extract_mentions)])
                .loc[:, [0, 1]]
                .rename(columns={0: "mention_1", 1: "mention_2"})
               )
    
    return (pd.merge(df,
                      mentions,
                      left_index=True,
                      right_index=True)
             .drop("user_mentions", axis=1)
            )

In [ ]:
airline_name_code_dict = {
    "Southwest Airlines": "WN",
    "Delta": "DL",
    "American Airlines": "AA",
    "United Airlines": "UA",
    "Sky": "Sk"
}

now, what about labeling a row with strictly American Airlines mentions? We could do this a few ways...


In [ ]:
(tweets
 .pipe(parse_mentions)
 .assign(AA=lambda df: (df["mention_1"] == "American Airlines") |
         (df["mention_2"] == "American Airlines"))
 .query("AA == True")
 .head()
 

)

In [ ]:
(tweets
 .pipe(parse_mentions)
 .query("mention_1 == 'American Airlines' or mention_2 == 'American Airlines'")
 .shape
)

In [ ]:
(tweets
 .pipe(parse_mentions)
 .query("mention_1 == 'American Airlines' or mention_2 == 'American Airlines'")
 .query("created_at_datetime >= @min_lax_time")
 .set_index("created_at_datetime")
 .groupby([pd.TimeGrouper("D"), "airport"])
 ["user_id"]
 .count()
 .unstack()
 .fillna(0)
 .rolling(7).mean()
 .plot()
)

Moving on a bit, what about a simple sentiment model? We'll grab a word database that simply matches words to a value and use it as a simple baseline.


In [ ]:
from nltk.tokenize import TweetTokenizer

def get_affin_dict():
    url = "https://raw.githubusercontent.com/fnielsen/afinn/master/afinn/data/AFINN-111.txt"
    affin_words = (pd
                   .read_table(url,
                               sep='\t',
                               header=None)
                   .rename(columns={0: "word", 1: "score"})
                   .to_dict(orient="list")
                  )
    affin_words = {k: v for k, v in
                   zip(affin_words["word"],
                       affin_words["score"])}
    return affin_words



tknizer = TweetTokenizer()

def score_sentiment(words):
    words = set(words) 
    union = words & affin_words.keys()
    return sum([affin_words[w] for w in union])
    
def score_tweet(tweet_text):
    return score_sentiment(tknizer.tokenize(tweet_text))

affin_words = get_affin_dict()

In [ ]:
(tweets
 .assign(sentiment=lambda df: df["all_text"].apply(score_tweet))
 ["sentiment"]
 .plot.hist(bins=20))

In [ ]:
(tweets
 .assign(sentiment=lambda df: df["all_text"].apply(score_tweet))
 .pipe(lambda df: pd.concat([df.query("sentiment <= -5").head(),
                            df.query("sentiment >= 5").head()]))
)

seems semi-reasonable to me!

Let's look at a timeseries of sentiment overall:


In [ ]:
(tweets
 .assign(sentiment=lambda df: df["all_text"].apply(score_tweet))
 .set_index("created_at_datetime")
 .groupby([pd.TimeGrouper("D")])
 ["sentiment"]
 .mean()
 .rolling(2).mean()
 .plot()
)

Since we have our reasonable sentiment and mentions data, let's assign it to a fresh dataframe and continue looking.

Note that a full data pull step at this point might look like

.python
tweets = (pull_tweet_data(gnip_rules)
          .pipe(parse_mentions)
          .assign(sentiment=lambda df: df["all_text"].apply(score_tweet))
         )

In [ ]:
tweets = (tweets
          .pipe(parse_mentions)
          .assign(sentiment=lambda df: df["all_text"].apply(score_tweet))
         )

And let's do some basic exploration of our tweet data.


In [ ]:
(tweets
 .groupby(["airport"])
 ["sentiment"]
 .mean()
 .sort_values()
 .plot.barh()
)

In [ ]:
(tweets
 .assign(day=lambda x: x.created_at_datetime.dt.dayofweek)
 .assign(hour=lambda x: x.created_at_datetime.dt.hour)
 .groupby(["day", "hour"])
 ["sentiment"]
 .mean()
 .unstack()
 .pipe((sns.heatmap, 'data') )
)

In [ ]:
(tweets
 .assign(hour=lambda x: x.created_at_datetime.dt.hour)
 .assign(day=lambda x: x.created_at_datetime.dt.dayofweek)
 .groupby(["airport", "day"])
 ["sentiment"]
 .mean()
 .unstack()
 .sort_values(by=0)
 .pipe((sns.heatmap, 'data') )
)

In [ ]:
(tweets
 .assign(hour=lambda x: x.created_at_datetime.dt.hour)
 .groupby(["airport", "hour"])
 ["sentiment"]
 .mean()
 .unstack()
 .sort_values(by=0)
 .pipe((sns.heatmap, 'data'))
)

In [ ]:
(tweets
 .assign(day=lambda x: x.created_at_datetime.dt.dayofweek)
 .query("airport == 'ATL' and day == 2")
 .sample(10)
 .all_text
       )

(tweets
 .assign(day=lambda x: x.created_at_datetime.dt.dayofweek)
 .query("airport == 'ATL' and day == 5")
 .sample(10)
 .all_text
       )

In [ ]:
tweet_sent_airport = (tweets
 .set_index("created_at_datetime")
 .groupby([pd.TimeGrouper("D"), "airport"])["sentiment"]
 .mean()
)

In [ ]:
delay_sent = (pd.concat([airport_delay, tweet_sent_airport],
                        axis=1,
                        names=("day", "airport"))
              .sort_index())

In [ ]:
for code in top_airport_codes:
    (delay_sent
     .loc[pd.IndexSlice[:, code], :]
     .plot(subplots=True, title="Sentiment and delay time at {}".format(code)))

In [ ]:
delay_sent.loc[pd.IndexSlice[:, "ATL"], :].corr()

In [ ]:
delay_sent.groupby(level=1).corr().T.loc["sentiment"].unstack()["dep_delay"].sort_values()