Data Tidying and Transformation in R

by Diya Das, David DeTomaso, and Andrey Indukaev

The goal

Data tidying is a necessary first step for data analysis - it's the process of taking your messily formatted data (missing values, unwieldy coding/organization, etc.) and literally tidying it up so it can be easily used for downstream analyses. To quote Hadley Wickham, "Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table."

These data are actually pretty tidy, so we're going to be focusing on cleaning and manipulation, but these manipulations will give you some idea of how to tidy untidy data.

The datasets

We are going to be using the data from the R package nycflights13. There are five datasets corresponding to flights departing NYC in 2013. We will load directly into R from the library, but the repository also includes CSV files we created for the purposes of the Python demo and can also be used to load the data into our R session.

*** If you've never run Jupyter notebooks with R, please run conda install -c r r-essentials


In [ ]:
options(repos=structure(c(CRAN="http://cran.cnr.berkeley.edu/", 
BioCsoft="http://www.bioconductor.org/packages/release/bioc/")))
ipak <- function(pkg){
     new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
     if (length(new.pkg))
         install.packages(new.pkg, dependencies = TRUE)
     sapply(pkg, require, character.only = TRUE)
 } #https://gist.github.com/stevenworthington/3178163

pkgs <- c("nycflights13",
       "tidyr",
       "dplyr",
       "reshape",
       "ggplot2",
       "data.table")
ipak(pkgs)

In [ ]:
#invisible(sapply(pkgs, library, character.only=TRUE ))

Reading data from a file

Let's read data from a file, though we won't be using it for this exercise.


In [ ]:
unzip("nycflights13.zip")
list.files()
read.delim("airlines.txt")

Inspecting a data frame // What's in the flights dataset?

Let's run through an example using the flights dataset. This dataset includes...well what does it include? You could read the documentation, but let's take a look first.

What are the dimensions of the flights data frame?


In [ ]:
data(flights)
flights <- data.frame(flights) ## dplyr has introduced a new data format that I am ignoring
dim(flights)

What are the first 6 rows?


In [ ]:
head(flights)

What are the last 6 rows?


In [ ]:
tail(flights)

What does the sample function do?


In [ ]:
sample(1:6,2)

What happens when I use sample for indexing?


In [ ]:
flights[sample(1:nrow(flights),10),]

Identifying and removing NAs in a dataset

We noticed some NAs above (hopefully). How do you find them and remove observations for which there are NAs?

Are there any NAs in the flights data frame? Are all values NA?


In [ ]:
print(any(is.na(flights)))
print(all(is.na(flights)))

Selecting for flights where there is complete data, what are the dimensions?

complete.cases returns a logical vector indicating whether all observations in a row are not-NA.


In [ ]:
message('Using base R...')
flights_complete <- flights[complete.cases(flights),]
print(dim(flights_complete))
message('Using tidyR...')
flights_complete2 <- drop_na(flights)

Are they the same datasets?


In [ ]:
identical(flights_complete,flights_complete2)

How might I obtain a summary of the original dataset?


In [ ]:
summary(flights)

Performing a function along an axis // Calculating mean times

R allows easy application of descriptive function along an axis.

any and all, which we used earlier, is an example of that. If the data are boolean, any collapses a series of boolean values into True if any of the values are true. all collapses a series of boolean values into True if all of the values are true.

What's the mean air time?


In [ ]:
mean(flights_complete$air_time)

Can we calculate the mean for multiple subsets of our data at once?


In [ ]:
timedat <- select(flights_complete, air_time, dep_delay, arr_delay)

Find mean of each row...


In [ ]:
head(apply(timedat,1, mean))

Find mean of each column...


In [ ]:
apply(timedat,2, mean)

Find mean of each column...yet another way...


In [ ]:
lapply(timedat, mean)

Let's fix the formatting...


In [ ]:
sapply(timedat, mean)

Using dplyr...


In [ ]:
timedat %>%  summarise_all(mean) %>% print()

Performing column-wise operations while grouping by other columns // Departure delay by airport of origin

Sometimes you may want to perform some aggregate function on data by category, which is encoded in another column. Here we calculate the statistics for departure delay, grouping by origin of the flight - remember this is the greater NYC area, so there are only three origins!

First let's use tapply:


In [ ]:
tapply(flights_complete$dep_delay, flights_complete$origin, summary)

That code is a bit messy, so using the with command for indicating the parent data frame...


In [ ]:
with(flights_complete, 
           tapply(dep_delay, origin, summary)
          )

Using dplyr...


In [ ]:
flights_complete %>% group_by(origin) %>% summarise(avg_dep_delay=mean(dep_delay)) %>% print()

Pipes in R: making code readable

The last sections have used the operator %>%. This symbol is called a pipe. It was introduced in the magrittr package, but dplyr and tidyr also take advantage of this syntax.

Pipes %>% exist because they help tidy up commands when performing a chain of operations. When we want to provide to a function1 some data which is output of a function2, whose input is output from function3, we can end up with some very nested, difficult-to-read commands: function1(function2(function3(data,parameters3),parameters2),parameters1)

Sometimes, with may help simplify your commands, as above, but piping can be more direct.

data %>% function(parameters) is exactly the same as function(data,parameters)

But the code is read (and written) from the left to the right (and not inside-out) and is easier to understand.

A more complicated example of piping // Calculating average departure delay in one call

Let's remove rows with missing values (NA) from flights dataset, then calculate the average departure delay in one call, first using basic syntax then using pipes.

Using standard R syntax:


In [ ]:
with(flights[complete.cases(flights),], 
           tapply(dep_delay, origin, summary)
          )

Same operation using pipes and dplyr:


In [ ]:
flights %>% filter(complete.cases(.)) %>% 
    group_by(origin) %>% 
        summarise(avg_dep_delay = mean(dep_delay)) %>% 
            print()

flights %>% drop_na() %>%
    group_by(origin) %>% 
        summarise(avg_dep_delay = mean(dep_delay)) %>% 
            print()

Notes:

  1. . stands for the data frame in the call of type: data %>% function1(function2(data)).
  2. In case of multi-line call, %>% should be in the end of a line.
  3. There are different schools of thought on piping. Some prefer to make code readable by doing all operations step-by-step.

Merging tables 'vertically' // Subsetting and re-combining flights from different airlines

You will likely need to combine datasets at some point. R provides quite a few tools to do that, and as you've seen, it's possible to do something many different ways.

Here, we present a simple case of 'vertical' merging, using rbind. Let's create a data frame with information on flights by United Airlines and American Airlines only, by creating two data frames via subsetting data about each airline one by one and then merging.

The main requirement is that the columns must have the same names (may be in different order).

Subsetting the dataset to make 2 data frames:


In [ ]:
flightsUA <- flights[flights$carrier == 'UA',]
flightsAA <- flights[flights$carrier == 'AA',]

Checking the number of rows in two data frames:


In [ ]:
print(nrow(flightsUA) + nrow(flightsAA))

Combining two data frames then checking the number of rows in the resulting data frame:


In [ ]:
flightsUAandAA <- rbind(flightsUA,flightsAA)
print(nrow(flightsUAandAA))

Nothing special, just be sure the data frames have the columns with the same names and types.

A useful tip is to use do.call in order to merge more than two data frames. do.call is a function that applies a function to a list of elements.

rbinding 3 data frames and checking the number of rows:


In [ ]:
print(nrow(do.call(rbind, list(flightsUA,flightsAA,flightsUAandAA))))

A useful tip for populating data frames within a loop

'rbind' is really useful for populating a data frames, but it can be a bit slow within loops. Each time we append a row to a data frame within a loop a new copy of a data frame is stored in memory :(

The solution is to create a list of lists and then merge them with do.call rbind combo. But since rbind, as many native R functions, is slow and not memory-efficient, for large datasets one may want to use rbindlist function from data.table package, which does the same operation, but faster.

Let's compare these approaches using the system.time function to see the execution times.


In [ ]:
message('execution time for rbind')
system.time(do.call(rbind, list(flightsUA,flightsAA,flightsUAandAA)))

message('execution time for rbindlist, written in C')
system.time(rbindlist(list(flightsUA,flightsAA,flightsUAandAA)))

And now the example of using rbindlist for populating a data frame. Let's pretend we forgot how to use tapply or group_by (dplyr) and we want to calculate the average arrival and departure delays per airline.


In [ ]:
Start <- Sys.time()
carriers  <- unique(flights_complete$carrier)
resList <- list()
for (i in 1:length(carriers))
{
    meanDepDelay <- mean(flights_complete[flights_complete$carrier == carriers[i],]$dep_delay)
    meanArrDelay <- mean(flights_complete[flights_complete$carrier == carriers[i],]$arr_delay) 
    resList[[i]] <- list(carriers[i],meanDepDelay,meanArrDelay)
}
DelaysByAirline <- rbindlist(resList)
colnames(DelaysByAirline) <- c("carrier","meanDepDelay","meanArrDelay")
End <- Sys.time()
message('It took us')
print(End-Start)
message('and here is the result for Amercian Airlines')
print(DelaysByAirline[DelaysByAirline$carrier == 'AA',])

message('Same result without messing with loops')
Start <- Sys.time()
flights_complete %>% group_by(carrier)%>%
    summarize(meanDepDelay = mean(dep_delay), meanArrDelay = mean(arr_delay))%>%
        filter(carrier == 'AA') %>% print()
End <- Sys.time()
message('And it took us')
print(End-Start)

In most cases loops are possible to avoid, but if you have to write one, the "list of lists" + rbindlist approach may save you a lot of time.

Merge two tables by a single column // What are the most common destination airports?

The flights dataset has destination airports coded, as three-letter airport codes. I'm pretty good at decoding them, but you don't have to be.


In [ ]:
print(head(airports))

The airports table gives us a key! Let's merge the flights data with the airports data, using dest in flights and faa in airports.

This is pretty easy in base R...


In [ ]:
flights_readdest <- merge(flights_complete, airports, by.x='dest', by.y = 'faa', all.x=TRUE)
head(flights_readdest)

And you can do it in dplyr too...


In [ ]:
flights_readdest2 <- left_join(flights_complete, airports, by = c("dest" = "faa"))
head(flights_readdest2)

Why did we use all.x = TRUE and left_join?


In [ ]:
setdiff(flights$dest, airports$faa)

Well this merged dataset is nice, but do we really need all of this information?


In [ ]:
colnames(flights_readdest)

In [ ]:
flights_sm <- select(flights_readdest,origin, dest=name, year, month, day, air_time)
head(flights_sm)

Why would you want to ever use select? dplyr lets you chain operations using the pipes, as discussed above. Let's calculate the average air time for various flight paths, using origin and the readable version of destination airport.


In [ ]:
airtime <- left_join(flights_complete, airports, by = c("dest" = "faa")) %>% 
    select(origin, dest=name, air_time) %>% 
    group_by(origin, dest) %>% 
    summarize(avg_air_time = mean(air_time))

print(head(airtime))
print(dim(airtime))

What's the longest flight from each airport, on average?


In [ ]:
with(airtime, by(avg_air_time, origin,  max)) 
with(airtime, tapply(avg_air_time, origin,  max)) 
airtime %>% group_by(origin) %>% summarise(max(avg_air_time))

Let's find the flight info for the max airtimes from each airport.


In [ ]:
with(airtime, tapply(avg_air_time, origin, function(x) which.max(x)))
message('That\'s the index in the grouped version, but not very useful')
setDT(airtime)[,.SD[which.max(avg_air_time)],by=origin] #data.table package, http://stackoverflow.com/questions/30294088#comment48685310_30294186
airtime %>% group_by(origin) %>% filter(avg_air_time == max(avg_air_time))

Pivot Table // Average flight time from origin to destination

Let's put destinations in rows and origins in columns, and have air_time as values.


In [ ]:
spread(airtime, origin, avg_air_time)

Notice the NA in the last dest row. Can you guess what happened? (This was human input error; we skipped a step in our analysis.)

Multi-column merge // What's the weather like for departing flights?

Flights...get delayed. What's the first step if you want to know if the departing airport's weather is at all responsible for the delay? Luckily, we have a weather dataset for that.

Let's take a look.


In [ ]:
head(weather)
intersect(colnames(flights_complete), colnames(weather)) # What columns do they share?

In [ ]:
flights_weather <- merge(flights_complete, weather, 
                         by=c("year", "month","day","hour", "origin"))
print(dim(flights_complete))
print(dim(flights_weather))

What if you only want to look at long delays?


In [ ]:
flights_weather_posdelays <- filter(flights_weather, dep_delay>200)
print(dim(flights_weather_posdelays))

Arranging a data frame // What's the weather like for the most and least delayed flights?

Let's sort the flights_weather data frame on dep_delay and get data for the top 10 and bottom 10 delays.


In [ ]:
flights_weather %>% arrange(desc(dep_delay)) %>% head(10)
flights_weather %>% arrange(dep_delay) %>% tail(10)

In [ ]:
flights_weather %>% arrange(dep_delay) %>% head(10)

Some other tidying

Capitalization issues

Sometimes your data is messy, and capitalization is inconsistent. A lot of matching functions in R have flags to ignore case, but you can also bypass this issue altogether using tolower and toupper.


In [ ]:
print(head(tolower(flights_complete$dest)))
print(head(toupper(airports$name)))

Wide to long formatted data

Our current flights_complete data frame is wide-formatted data. Each variable (measurement) for an observation (flight) is a separate column in that row.

For some plotting functions, including ggplot in R, it's easier to plot data for multiple series on the same plot axes if the values are all in one column, and the series label is another. To do that, we need to format our data in the long format. In this case, we're going to plot both arrival and departure delays on the same plot, so we need to create a column for "delay time" (here called value because of the melt function) and another column type_of_delay to encode the series.


In [ ]:
print(head(flights_complete))

Using reshape...


In [ ]:
day_delay <- melt(flights_complete, id.vars=c("time_hour"), 
                measure.vars=c("dep_delay","arr_delay"), variable_name = "type_of_delay")
print(head(day_delay))

Using tidyr...


In [ ]:
day_delay <- gather(flights_complete, `dep_delay`,`arr_delay`, 
                    key = "type_of_delay", value="value") # note the backticks here!
print(head(day_delay))

In [ ]:
ggplot(day_delay,aes(x=time_hour,y=value,colour=type_of_delay, group=type_of_delay)) + geom_point()

Well this is a bit hard to read. What about the first entry for each type of delay in each hour?

Removing duplicates, sorting on multiple columns

We're going to use the distinct function from tidyR and first group by time_hour and type_of_delay before removing duplicates. We don't want to remove an entire series as would happen if you only grouped by time_hour on your melted data frame.


In [ ]:
day_delay_first <- distinct(day_delay, time_hour, type_of_delay, .keep_all = TRUE) 
 # keep all columns
print(head(day_delay_first))

In [ ]:
ggplot(day_delay_first,aes(x=time_hour,y=value,colour=type_of_delay, group=type_of_delay)) + geom_point()

An incomplete investigation of NAs

Let's examine where there are NAs in the flights dataset.

We can use which in R, with the arr.ind argument to get (row, col) positions. This is really handy if we just want to examine the columns.


In [ ]:
ind <- which(is.na(flights), arr.ind = TRUE)
print(head(ind))

message('Let\'s make a handy summary using table which gives frequencies of each value:')
ind2 <- table(ind[,2])
print(ind2)

But what do those numbers, 4-15, mean? They're column numbers. We can use colnames to get the original column names, and names to access and rename the ind2 vector. (It's not a matrix/data frame so it doesn't have column names.)


In [ ]:
names(ind2) <- colnames(flights)[as.numeric(names(ind2))]
  # This is kind of fancy-looking, but we are converting the names of ind2 to numeric because
  # they are strings, and using those numeric indices to get the right names from colnames
print(ind2)

Let's just grab our rows with NAs.


In [ ]:
flights_incomplete <- flights[!complete.cases(flights),]
print(dim(flights_incomplete))

Do flights with NA departure time also have an NA departure delay? One way to tell is to check if number of instances where both the departure time and the departure delay are NA is the same as the number of instances where at least one of them is NA.


In [ ]:
print(table(is.na(flights_incomplete$dep_time) & is.na(flights_incomplete$dep_delay)))
print(table(is.na(flights_incomplete$dep_time) | is.na(flights_incomplete$dep_delay)))

Yes. (My sneaking suspicion is that the flights were cancelled.)