Data Tidying and Transformation in Python

by David DeTomaso, Diya Das, and Andrey Indukaev

The goal

Data tidying is a necessary first step for data analysis - it's the process of taking your messily formatted data (missing values, unwieldy coding/organization, etc.) and literally tidying it up so it can be easily used for downstream analyses. To quote Hadley Wickham, "Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table."

These data are actually pretty tidy, so we're going to be focusing on cleaning and transformation, but these manipulations will give you some idea of how to tidy untidy data.

The datasets

We are going to be using the data from the R package nycflights13. There are five datasets corresponding to flights departing NYC in 2013. We will load directly into R from the library, but the repository also includes CSV files we created for the purposes of the Python demo.

Python requirements

For this tutorial we'll be using the following packages in Python

  • pandas (depends on numpy)
  • seaborn (depends on matplotlib)

You can install these with either pip or conda

pandas

Pandas is an extremely useful package for data-manipulation in python. It allows for a few things:

  • Mixed types in a data matrix
  • Non-numeric row/column indexes
  • Database-like join/merge/group-by operations
  • Data import/export to a variety of formats (text, Excel, JSON)

The core pandas object is a 'dataframe' - modeled after DataFrames in R


In [ ]:
from __future__ import print_function # For the python2 people
import pandas as pd # This is typically how pandas is loaded

Reading data from a file

Let's read data from a file

There are five tables we'll be using as part of the NYCFlights13 dataset

To view them, first extract the archive that comes with this repo

unzip nycflights13.zip

Now, to read them in as dataframes, we'll use the read_table function from pandas

This is a general purpose function for reading tabular data in a text file format. If you follow the link, you can see that there are many configurable options. We'll just use the defaults (assumes tab-delimited)


In [ ]:
airlines = pd.read_table("airlines.txt")
airports = pd.read_table("airports.txt")
flights = pd.read_table("flights.txt")
planes = pd.read_table("planes.txt")
weather = pd.read_table("weather.txt")

Inspecting a dataframe // What's in the flights dataset?

Let's run through an example using the flights dataset. This dataset includes...well what does it include? You could read the documentation, but let's take a look first.

Anatomy of a pandas DataFrame

There are a couple of concepts that are important to understand when working with dataframes

  • DataFrame class
  • Series
  • Index / Columns

To understand these, lets look at the 'planes' dataframe


In [ ]:
print(type(planes)) # Yup, it's a DataFrame

In [ ]:
# What does it look like?
planes # Jupyter Notebooks do some nifty formatting here

In [ ]:
# How big is it?
print(planes.shape) # Works like numpy

In [ ]:
print(planes.columns) # What are the column labels?

In [ ]:
print(planes.index) # What are the row labels?

In [ ]:
# Let's grab a column
planes['manufacturer']

In [ ]:
# Inspecting this column further
manufacturer = planes['manufacturer']
print(type(manufacturer)) # It's a Series

Series

one-dimensional and only have one set of labels (called index)

DataFrames

two-dimensional and have row-labels (called index) and column-labels (called columns) DataFrames are made up of Series (each column is a series)


In [ ]:
# Indexing into Series
print("Indexing into Series: ", manufacturer[3])

# Indexing into DataFrame
print("Indexing into DataFrame: ", planes.loc[3, 'manufacturer'])

DataFrame Indexing

We already showed that you can grab a column using

dataframe[column_name]

To grab a row, use:

dataframe.loc[row_name]

And to grab a specific element use:

dataframe.loc[row_name, column_name]

In [ ]:
third_row = planes.loc[3] # get the third row
third_row

In [ ]:
print(type(third_row))

Dataframe index

So far the row-index has been numeric (just 0 through ~3300). However, we might want to use labels here too.

To do this, we can select a column to be the dataframe's index Only do this if the column contains unique data


In [ ]:
planes = planes.set_index('tailnum')

# OR

planes = pd.read_table('planes.txt', index_col=0) #Set the first column as the index

planes.loc['N10156']

But now how do I get the 3rd row?

Here's where iloc comes into play.

Works like loc but uses integers


In [ ]:
print(planes.iloc[3]) # Get the third row
print(planes.iloc[:, 3]) # Get the third column

Exploring our dataset - let's look at the flights table


In [ ]:
print('What are the first 5 rows?')
flights.head()

In [ ]:
print('What are the last 5 rows?')
flights.tail()

In [ ]:
print('Sample random rows')
flights.sample(3, axis=0) # Axis 0 represents the rows, axis 1, the columns

Identifying and removing NAs in a dataset

We noticed some NAs above (hopefully). How do you find them and remove observations for which there are NAs?


In [ ]:
print('What are the dimensions of the flights dataframe?\n')
print(flights.shape)

print('Are there any NAs in the flights dataframe?\n')
print(flights.isnull().any())

In [ ]:
print('Selecting for flights where there is complete data, what are the dimensions?\n')
print("Original Matrix Shape:", flights.shape)
null_rows = flights.isnull().any(axis=1) # Rows where any value is null
flights_complete = flights.loc[~null_rows]
print("Complete-rows shape:", flights_complete.shape)

Aside: Why does this work with loc?

Earlier I showed .loc operating on row/column labels.

Well, it can also operate on boolean (true/false) lists (or numpy arrays, or pandas Series)

Above, what is null_rows?


In [ ]:
print(type(null_rows))
null_rows

The great thing about Pandas is that if you pass in a Series, the order of the elements in it doesn't matter anymore. It uses the index to align the Series to the row/column index of the dataframe.

This is very useful when creating a boolean index from one dataframe to be used to select rows in another!

Alternately, with removing NA values there is a dropna function that can be used.

Now...back to flights!


In [ ]:
print('How might I obtain a summary of the original dataset?')
flights.describe() # Similar to R's 'summary'
# use include='all' to include the non-numberic columns too

Performing a function along an axis

Pandas allows easy application of descriptive function along an axis.

any which we used earlier, is an example of that. If the data is boolean, any collapses a series of boolean values into True if any of the values are true (otherwise, False)

Can also use min, max, mean, var, std, count


In [ ]:
# An example
flights['air_time'].mean() # Returns a single value

In [ ]:
subset = flights[['air_time', 'dep_delay', 'arr_delay']]
subset.mean(axis=0) # Axis 0: collapse all rows, result has Index = to original Columns

If you want to apply an arbitrary function along an axis, look into the apply function

Performing column-wise operations while grouping by other columns // Departure delay by airport of origin

Sometimes you may want to perform some aggregate function on data by category, which is encoded in another column. Here we calculate the statistics for departure delay, grouping by origin of the flight - remember this is the greater NYC area, so there are only three origins!


In [ ]:
result = flights_complete.groupby('origin')['dep_delay'].mean()
result

In [ ]:
# What is this object?
print(type(result))

Other descriptive functions work here, like 'std', 'count', 'min', 'max'

Also: describe


In [ ]:
flights_complete.groupby('origin')['dep_delay'].describe()

Merging tables 'vertically' // Subsetting and re-combining flights from different airlines

You will likely need to combine datasets at some point. For simple acts of stitching two dataframes together, the pandas concat method is used.

Let's create a data frame with information on flights by United Airlines and American Airlines only, by creating two data frames via subsetting data about each airline one by one and then merging.

The main requirement is that the columns must have the same names (may be in different order).


In [ ]:
print('Subsetting the dataset to have 2 dataframes')
flightsUA = flights.loc[flights.carrier == 'UA',]
flightsAA = flights.loc[flights.carrier == 'AA',]
print('Checking the number of rows in two dataframes')
print(flightsUA.shape[0] + flightsAA.shape[0])

In [ ]:
print('Combining two dataframes than checking the number of rows in the resulting data frame')
flightsUAandAA = pd.concat([flightsUA,flightsAA], axis=0) # axis=1 would stitch them together horizontally
print(flightsUAandAA.shape[0])

Nothing special, just be sure the dataframes have the columns with the same names and types.


In [ ]:
print('Binding 3 data frames and checking the number of rows')
allthree = pd.concat([flightsUA,flightsAA,flightsUAandAA])
allthree.shape[0]

Merge two tables by a single column // What are the most common destination airports?

The flights dataset has destination airports coded, as three-letter airport codes. I'm pretty good at decoding them, but you don't have to be.


In [ ]:
airports.head()

The airports table gives us a key! Let's merge the flights data with the airports data, using dest in flights and faa in airports.


In [ ]:
print('Merging in pandas')
flights_readdest = flights_complete.merge(airports, left_on='dest', right_on = 'faa', how='left')
flights_readdest.head()

Why did we use how='left'?


In [ ]:
len(set(airports.faa) - set(flights.dest))

There are 1357 airports in the airports table that aren't in the flights table at all.

Here are the different arguments for how and what they'd do:

  • 'left': use all rows for flights_complete, and only rows from airports that match
  • 'right': use all rows for airports, and only rows from flights that match
  • 'inner': use only rows for airports and flights that match on the dest/faa columns
  • 'outer': use all rows from both airports and flights

Well this merged dataset is nice, but do we really need all of this information?


In [ ]:
flights_readdest.columns

In [ ]:
flights_sm = flights_readdest[['origin', 'name', 'year', 'month', 'day', 'air_time']]
flights_sm.head()

In [ ]:
# Renaming is not so simple in pandas
flights_sm = flights_sm.rename(columns = {'name': 'dest'})
flights_sm.head()

Since each operation gives us back a dataframe, they are easily chained:


In [ ]:
airtime = flights_complete.merge(airports, left_on='dest', right_on='faa', how='left') \
    .loc[:, ['origin', 'name', 'air_time']] \
    .groupby(['origin', 'name'])['air_time'] \
    .mean()

print(airtime.shape)
airtime.head()

Goal: What's the longest flight from each airport, on average?

Here, 'airtime' is a little abnormal because it's Index has two levels

- First level is the 'origin'
- Second level is the name of the destination

This is because we grouped by two variables.

Now we need to group by 'origin' and apply the 'max' function. Groupby can work for the levels of a multi-index too


In [ ]:
airtime.groupby(level='origin').max()

In [ ]:
# What if we want to know where the flight goes?

rows = airtime.groupby(level='origin').idxmax() # This returns the indices in airtime where the max was found

airtime[rows] # Index by it to get the max rows

Pivot Table // Average flight time from origin to destination

Let's put destinations in rows and origins in columns, and have air_time as values.


In [ ]:
pvt_airtime = airtime.unstack() # Since airtime has a hierarchical index, we can use unstack
pvt_airtime

However, often you want to pivot just a regular dataframe. I'll create one from airtime for an example:


In [ ]:
airtime_df = pd.DataFrame(airtime).reset_index()
airtime_df.head()

In [ ]:
airtime_pv = airtime_df.pivot(index='origin', 
                columns='name',
                values='air_time')
airtime_pv

Multi-column merge // What's the weather like for departing flights?

Flights...get delayed. What's the first step if you want to know if the departing airport's weather is at all responsible for the delay? Luckily, we have a weather dataset for that.

Let's take a look.


In [ ]:
weather.head()

In [ ]:
print(flights_complete.columns & weather.columns) # What columns do they share?

In [ ]:
flights_weather = flights_complete.merge(weather, 
                         on=["year", "month","day","hour", "origin"])

print(flights_complete.shape)
print(flights_weather.shape)

flights_weather has less rows. Default behavior of merge is 'inner' and so this means there are some flight year/month/day/hour/origin combos where we don't have a weather entry


In [ ]:
# Let's grab flights+weather where the delay was greater than 200 minutes

flights_weather_posdelays = flights_weather.loc[flights_weather.dep_delay > 200]
flights_weather_posdelays.shape

In [ ]:
# Anything unusual about these flights?
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure()
plt.hist(flights_weather.dropna().wind_gust, 30, range=(0, 50), normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().wind_gust, 30, range=(0,50), normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Wind Gust')

plt.figure()
plt.hist(flights_weather.dropna().pressure, 30,  normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().pressure, 30,  normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Pressure')

plt.figure()
plt.hist(flights_weather.dropna().hour, 30,  normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().hour, 30,  normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Hour')

Arranging a dataframe // What's the weather like for the most and least delayed flights?

Let's sort the flights_weather dataframe on dep_delay and get data for the top 10 and bottom 10 delays.


In [ ]:
flights_weather.sort_values('dep_delay').head(10)

In [ ]:
flights_weather.sort_values('dep_delay', ascending=False).head(10)

Some other tidying

Capitalization issues.


In [ ]:
flights_complete.dest.str.lower().head() # For string columns, use .str to access string methods

In [ ]:
flights_complete.dest.str.upper().head()

Wide to long formatted data


In [ ]:
flights_complete.head()

In [ ]:
day_delay = pd.melt(flights_complete, id_vars=['hour', 'time_hour'], value_vars=['dep_delay', 'arr_delay'], var_name='type_of_delay')
day_delay

In [ ]:
plt.figure()
sns.stripplot(x='hour', y='value', hue='type_of_delay', data=day_delay)

Well this is a bit hard to read. What about the first entry for each type of delay in each hour?

Removing duplicates


In [ ]:
day_delay_first = day_delay.drop_duplicates('time_hour', keep='first')
day_delay_first.head()

An incomplete investigation of NAs

Let's examine where there are NAs in the flights dataset.


In [ ]:
flights.isnull().sum(axis=0)

In [ ]:
flights_incomplete = flights.loc[flights.isnull().any(axis=1)]
flights_incomplete.shape

Do flights with NA departure time also have an NA departure delay?


In [ ]:
pd.crosstab(
    index=flights_incomplete.dep_time.isnull(),   # Series of bool values
    columns=flights_incomplete.dep_delay.isnull() # series of bool values
)

Yes