Recasting data

Sometimes long data needs to be wide, and sometimes wide data needs to be long. I'll explain.

You are soon going to discover that long before you can visualize data, you need to have it in a form that the visualization library can deal with. One of the ways that isn't immediately obvious is how your data is cast. Most of the data you will encounter will be wide -- each row will represent a single entity with multiple measures for that entity. So think of states. Your dataset could have population, average life expectancy and other demographic data.

But what if your visualization library needs one row for each measure? That's where recasting your data comes in. We can use a library called reshape2 to melt or cast the data, depending on what we need.

So let's transform a dataset we've already used -- registered voters in Nebraska -- from wide data to long data and back again. First, we'll import the library and then open the data.


In [1]:
library(reshape2)

In [2]:
voters <- read.csv("../../Data/registeredvoters.csv")

In [3]:
head(voters)


CountyRepublican10Democrat10Libertarian10Nonpartisan10Total10Republican16Democrat16Nonpartisan16Libertarian16Total16
Adams 10018 5536 6 2972 18532 10746 5027 3591 163 19527
Antelope 3005 1147 0 538 4690 3088 863 594 12 4557
Arthur 284 52 0 10 346 286 37 15 3 341
Banner 424 53 0 53 530 427 38 73 7 545
Blaine 314 56 0 24 394 310 43 29 2 384
Boone 2390 1156 0 408 3954 2469 901 404 11 3785

Making data long, in most cases, is very, very easy. It's simple. We're going to create a new data frame called longvoters, and then melt our voters data into it. Then we'll run head and you'll see each measure gets it's own row -- so each county has 10 rows of data for it.


In [4]:
longvoters <- melt(voters)
head(longvoters)


Using County as id variables
Countyvariablevalue
Adams Republican1010018
Antelope Republican10 3005
Arthur Republican10 284
Banner Republican10 424
Blaine Republican10 314
Boone Republican10 2390

But one problem that isn't immediately clear is where in spreadsheet world, a header like Republican10 and Republican16 makes sense, it doesn't here. We need fields to be able to have a County, a Party, a Year and then a count of those voters. So let's so some more worth with mutate and create those columns we need. And let's start exploring programmatic text manipulations. To simplify, we're going to use a library called stringr to get predictable patterns in our data.

So if you look at the variable field, you see we have the party at the front and the year in the back and the year is always the last two characters of the field. The problem? We don't know WHICH characters those are. But stringr has a powerful tool called str_sub that can get us that, if we know one trick: That telling it to start at -2 means go to the end of the word and move backwards two characters.


In [5]:
library(stringr)
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


In [6]:
longvoters %>% mutate(
    Year = str_sub(variable, start= -2),
)


CountyvariablevalueYear
Adams Republican10 10018 10
Antelope Republican10 3005 10
Arthur Republican10 284 10
Banner Republican10 424 10
Blaine Republican10 314 10
Boone Republican10 2390 10
Box Butte Republican10 4115 10
Boyd Republican10 1036 10
Brown Republican10 1663 10
Buffalo Republican10 15768 10
Burt Republican10 2521 10
Butler Republican10 3044 10
Cass Republican10 8216 10
Cedar Republican10 3072 10
Chase Republican10 1827 10
Cherry Republican10 2945 10
Cheyenne Republican10 4146 10
Clay Republican10 2889 10
Colfax Republican10 2364 10
Cuming Republican10 3639 10
Custer Republican10 5450 10
Dakota Republican10 3577 10
Dawes Republican10 3505 10
Dawson Republican10 7666 10
Deuel Republican10 1027 10
Dixon Republican10 1887 10
Dodge Republican10 10683 10
Douglas Republican10120973 10
Dundy Republican10 1013 10
Fillmore Republican10 2317 10
Nemaha Total16 4563 16
Nuckolls Total16 3162 16
Otoe Total16 10447 16
Pawnee Total16 1858 16
Perkins Total16 1982 16
Phelps Total16 6235 16
Pierce Total16 4872 16
Platte Total16 20116 16
Polk Total16 3516 16
Red Willow Total16 7239 16
Richardson Total16 5669 16
Rock Total16 1024 16
Saline Total16 7830 16
Sarpy Total16 108473 16
Saunders Total16 14696 16
Scotts BluffTotal16 24139 16
Seward Total16 10926 16
Sheridan Total16 4066 16
Sherman Total16 2010 16
Sioux Total16 919 16
Stanton Total16 3782 16
Thayer Total16 3791 16
Thomas Total16 506 16
Thurston Total16 4529 16
Valley Total16 2995 16
Washington Total16 14322 16
Wayne Total16 5236 16
Webster Total16 2413 16
Wheeler Total16 607 16
York Total16 10013 16

If I really wanted to be correct, I could make those four digit years by adding 2000 to what the str_sub finds, but I also have to turn that into a number first.


In [7]:
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
)


CountyvariablevalueYear
Adams Republican10 10018 2010
Antelope Republican10 3005 2010
Arthur Republican10 284 2010
Banner Republican10 424 2010
Blaine Republican10 314 2010
Boone Republican10 2390 2010
Box Butte Republican10 4115 2010
Boyd Republican10 1036 2010
Brown Republican10 1663 2010
Buffalo Republican10 15768 2010
Burt Republican10 2521 2010
Butler Republican10 3044 2010
Cass Republican10 8216 2010
Cedar Republican10 3072 2010
Chase Republican10 1827 2010
Cherry Republican10 2945 2010
Cheyenne Republican10 4146 2010
Clay Republican10 2889 2010
Colfax Republican10 2364 2010
Cuming Republican10 3639 2010
Custer Republican10 5450 2010
Dakota Republican10 3577 2010
Dawes Republican10 3505 2010
Dawson Republican10 7666 2010
Deuel Republican10 1027 2010
Dixon Republican10 1887 2010
Dodge Republican10 10683 2010
Douglas Republican10120973 2010
Dundy Republican10 1013 2010
Fillmore Republican10 2317 2010
Nemaha Total16 4563 2016
Nuckolls Total16 3162 2016
Otoe Total16 10447 2016
Pawnee Total16 1858 2016
Perkins Total16 1982 2016
Phelps Total16 6235 2016
Pierce Total16 4872 2016
Platte Total16 20116 2016
Polk Total16 3516 2016
Red Willow Total16 7239 2016
Richardson Total16 5669 2016
Rock Total16 1024 2016
Saline Total16 7830 2016
Sarpy Total16 108473 2016
Saunders Total16 14696 2016
Scotts BluffTotal16 24139 2016
Seward Total16 10926 2016
Sheridan Total16 4066 2016
Sherman Total16 2010 2016
Sioux Total16 919 2016
Stanton Total16 3782 2016
Thayer Total16 3791 2016
Thomas Total16 506 2016
Thurston Total16 4529 2016
Valley Total16 2995 2016
Washington Total16 14322 2016
Wayne Total16 5236 2016
Webster Total16 2413 2016
Wheeler Total16 607 2016
York Total16 10013 2016

Okay, so I have one side of it. Now I need the other. Luckily, we can tell str_sub to start at a character and end at one. In this case, we want to start at 1 and end at -3.


In [8]:
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
)


CountyvariablevalueYearParty
Adams Republican10 10018 2010 Republican
Antelope Republican10 3005 2010 Republican
Arthur Republican10 284 2010 Republican
Banner Republican10 424 2010 Republican
Blaine Republican10 314 2010 Republican
Boone Republican10 2390 2010 Republican
Box Butte Republican10 4115 2010 Republican
Boyd Republican10 1036 2010 Republican
Brown Republican10 1663 2010 Republican
Buffalo Republican10 15768 2010 Republican
Burt Republican10 2521 2010 Republican
Butler Republican10 3044 2010 Republican
Cass Republican10 8216 2010 Republican
Cedar Republican10 3072 2010 Republican
Chase Republican10 1827 2010 Republican
Cherry Republican10 2945 2010 Republican
Cheyenne Republican10 4146 2010 Republican
Clay Republican10 2889 2010 Republican
Colfax Republican10 2364 2010 Republican
Cuming Republican10 3639 2010 Republican
Custer Republican10 5450 2010 Republican
Dakota Republican10 3577 2010 Republican
Dawes Republican10 3505 2010 Republican
Dawson Republican10 7666 2010 Republican
Deuel Republican10 1027 2010 Republican
Dixon Republican10 1887 2010 Republican
Dodge Republican10 10683 2010 Republican
Douglas Republican10120973 2010 Republican
Dundy Republican10 1013 2010 Republican
Fillmore Republican10 2317 2010 Republican
Nemaha Total16 4563 2016 Total
Nuckolls Total16 3162 2016 Total
Otoe Total16 10447 2016 Total
Pawnee Total16 1858 2016 Total
Perkins Total16 1982 2016 Total
Phelps Total16 6235 2016 Total
Pierce Total16 4872 2016 Total
Platte Total16 20116 2016 Total
Polk Total16 3516 2016 Total
Red Willow Total16 7239 2016 Total
Richardson Total16 5669 2016 Total
Rock Total16 1024 2016 Total
Saline Total16 7830 2016 Total
Sarpy Total16 108473 2016 Total
Saunders Total16 14696 2016 Total
Scotts BluffTotal16 24139 2016 Total
Seward Total16 10926 2016 Total
Sheridan Total16 4066 2016 Total
Sherman Total16 2010 2016 Total
Sioux Total16 919 2016 Total
Stanton Total16 3782 2016 Total
Thayer Total16 3791 2016 Total
Thomas Total16 506 2016 Total
Thurston Total16 4529 2016 Total
Valley Total16 2995 2016 Total
Washington Total16 14322 2016 Total
Wayne Total16 5236 2016 Total
Webster Total16 2413 2016 Total
Wheeler Total16 607 2016 Total
York Total16 10013 2016 Total

Two last things that are bothering me about our data? I don't like the number of voters being called value and we don't need variable. So we'll use dpylr to clean this up a bit.

First we're going to use dpylr's select and a little R trick where you can use a negative sign to mean not this.

Then, we'll use rename, which is backwards to me. If you read it, what it says is rename a column to Count and assign it the values of value. The reverse makes sense to me. Make values = Count, but dplyr does it the other way. If you get an error, you probably did it backwards (I did).


In [9]:
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
) %>% select (-c(variable)) %>% rename(Count=value)


CountyCountYearParty
Adams 10018 2010 Republican
Antelope 3005 2010 Republican
Arthur 284 2010 Republican
Banner 424 2010 Republican
Blaine 314 2010 Republican
Boone 2390 2010 Republican
Box Butte 4115 2010 Republican
Boyd 1036 2010 Republican
Brown 1663 2010 Republican
Buffalo 15768 2010 Republican
Burt 2521 2010 Republican
Butler 3044 2010 Republican
Cass 8216 2010 Republican
Cedar 3072 2010 Republican
Chase 1827 2010 Republican
Cherry 2945 2010 Republican
Cheyenne 4146 2010 Republican
Clay 2889 2010 Republican
Colfax 2364 2010 Republican
Cuming 3639 2010 Republican
Custer 5450 2010 Republican
Dakota 3577 2010 Republican
Dawes 3505 2010 Republican
Dawson 7666 2010 Republican
Deuel 1027 2010 Republican
Dixon 1887 2010 Republican
Dodge 10683 2010 Republican
Douglas 120973 2010 Republican
Dundy 1013 2010 Republican
Fillmore 2317 2010 Republican
Nemaha 4563 2016 Total
Nuckolls 3162 2016 Total
Otoe 10447 2016 Total
Pawnee 1858 2016 Total
Perkins 1982 2016 Total
Phelps 6235 2016 Total
Pierce 4872 2016 Total
Platte 20116 2016 Total
Polk 3516 2016 Total
Red Willow 7239 2016 Total
Richardson 5669 2016 Total
Rock 1024 2016 Total
Saline 7830 2016 Total
Sarpy 108473 2016 Total
Saunders 14696 2016 Total
Scotts Bluff 24139 2016 Total
Seward 10926 2016 Total
Sheridan 4066 2016 Total
Sherman 2010 2016 Total
Sioux 919 2016 Total
Stanton 3782 2016 Total
Thayer 3791 2016 Total
Thomas 506 2016 Total
Thurston 4529 2016 Total
Valley 2995 2016 Total
Washington 14322 2016 Total
Wayne 5236 2016 Total
Webster 2413 2016 Total
Wheeler 607 2016 Total
York 10013 2016 Total

In [10]:
newlongvoters <- longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
) %>% select (-c(variable)) %>% rename(Count=value)

Then, we can put it back together again by casting it using dcast. With dcast, we need to tell it which variable is our main identifier -- which is County -- and what the headers should be. We also have to tell it where the numbers should come from, since we blew it all apart. And you'll see, we've changed the data substantially, but it looks almost identical to the original dataset.


In [11]:
widevoters <- dcast(newlongvoters, County ~ Party+Year, value.var = "Count")
head(widevoters)


CountyDemocrat_2010Democrat_2016Libertarian_2010Libertarian_2016Nonpartisan_2010Nonpartisan_2016Republican_2010Republican_2016Total_2010Total_2016
Adams 5536 5027 6 163 2972 3591 10018 10746 18532 19527
Antelope1147 863 0 12 538 594 3005 3088 4690 4557
Arthur 52 37 0 3 10 15 284 286 346 341
Banner 53 38 0 7 53 73 424 427 530 545
Blaine 56 43 0 2 24 29 314 310 394 384
Boone 1156 901 0 11 408 404 2390 2469 3954 3785

Assignment

Melt the population estimates data from assignment 3. Create long data, where each row is a single year for a single county, with columns for the state, county, year and estimate.

Rubric

  1. Did you import the data correctly?
  2. Did you apply melt correctly?
  3. Did you mutate/rename columns correctly?
  4. Did you explain your steps using Markdown comments?

In [ ]: