One interesting thing you can do with line charts is use them to show how multiple elements of a thing change over the same period of time. The type of chart is called a Stacked Area chart, and there's two kinds -- ones that show the additive total, and ones that show the relative proportions within a population. Let's dive in, looking at a dataset of every prisoner in the Nebraska state corrections system.
In [1]:
library(dplyr)
library(tidyr)
library(ggplot2)
In [2]:
inmates <- read.csv("../../Data/inmateDB.csv", na.string = "NA")
In [3]:
head(inmates)
So what we want is to take this data and group it by race and the year they started their sentence. We've carved out years using date formatting. Let's try a new way -- using tidyr
and separate
.
Sometimes, you want to split a field into multiple fields based on a common separator -- like a slash or a comma or something like that. The separate
function lets you separate that data and name the fields at the same time. So, in our case, we want to split a date based on a slash. So we'll create Day, Month and Year and split it on the slash. Then, we'll group our data by Year and RACE.DESC and count it up. You'll also notice there's a filter in the data -- data before 1980 is pretty spotty and there's a couple of typos where people won't start their sentences for quite some time into the future.
In [4]:
raceyear <- inmates %>%
separate(SENTENCE.BEGIN.DATE, c("Day", "Month", "Year"), "/", convert = TRUE) %>%
filter(Year >= 1980 & Year <= 2017) %>%
group_by(Year, RACE.DESC) %>%
summarize(
count=n()
)
Now, if we use similar ggplot setup to what we've used before and add geom_area()
to our toolbox, we get the following:
In [5]:
ggplot(raceyear, aes(x=Year, y=count, fill=RACE.DESC)) +
geom_area()
Let's add a new trick, and that's renaming the legend title. We'll also add some other labelling here for good practice. Here's the documentation on what you can do with legends in ggplot/).
In [14]:
ggplot(raceyear, aes(x=Year, y=count, fill=RACE.DESC)) +
geom_area() + scale_fill_discrete(name="Inmate Race") +
labs(x="Year", y="Inmates", title="Inmates in Nebraska by race since 1980", subtitle="Nebraska's prisons, like the state, is predominantly white. \nHowever, several races are disporportionately represented.", caption="By Matt Waite")
That shows you the race of inmates by their start date, stacked onto each other, which shows you the breakdown of the total over time in a visual way. But what if we wanted to see the relative proportion? And we wanted the graph to show the total?
So to get the percentage of what a particular race represented among inmates for a given year, we need two pieces of information: The count of that race for that year, and the total number of inmates for that year. Since we have no way to do this in one step, we're going to have to merge two datasets together -- called a join in databases, and also in dplyr.
So similar to what we did before, here's the code needed to just create a count of inmates by year. NOTE: I just edited my previous code, so it was created the same way as the race counts were created. I just altered some variable names and only grouped by year.
In [7]:
yeartotals <- inmates %>%
separate(SENTENCE.BEGIN.DATE, c("Day", "Month", "Year"), "/", convert = TRUE) %>%
filter(Year >= 1980 & Year <= 2017) %>%
group_by(Year) %>%
summarize(
count=n()
)
Now we can join the two datasets we've created together -- raceyear
and yeartotals
. The way joins work is there must be a common element between them to join them together. If you look at our two datasets, both have a Year element that is common between them. So we'll use that.
In [8]:
percents <- raceyear %>%
inner_join(yeartotals, by="Year") %>%
mutate(percentage = (count.x/count.y)*100)
In [16]:
ggplot(percents, aes(x=Year, y=percentage, fill=RACE.DESC)) +
geom_area() + scale_fill_discrete(name="Inmate Race") +
labs(x="Year", y="Inmates", title="Inmates in Nebraska by race since 1980", subtitle="Nebraska's prisons, like the state, is predominantly white. \nHowever, several races are disporportionately represented.", caption="By Matt Waite")
Time to stretch your thinking a little. Using a stacked area chart, I want you to show me the proportions of drug possession and minor in possession calls are to the total call volume at UNL by month. You'll use the same UNLPD data you've had, but you are going to have to figure out how to handle the dataset. There's multiple solutions here, and several questions for you to ponder -- like how do you deal with so many different crimes? Can you chart them all? Should you?
I expect this to challenge you, so the deadline is in one week. While it will challenge you, it is also a critical element of the class. You'll soon be obtaining data and having to make these decisions on how to best visualize the data without the benefit of a rubric or my guidance.
In [ ]: