Stacked bar charts

Let's visualize majors by race and sex at UNL in R.

So first things first, let's get our two libraries we need.


In [1]:
library(dplyr)
library(ggplot2)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Now we grab some data.


In [2]:
enrollment <- read.csv("../../Data/collegeenrollment.csv")

In [3]:
head(enrollment)


CollegeDegreeMajorCodeMajorNameRaceGenderRaceGenderCountTotal
College of Agri Sci and Natl ResourcesB1BC BIOC Biochemistry NonResidentAlienMale NonResidentAlien Male 3 97
College of Agri Sci and Natl ResourcesB1AS ASCI Animal Science NonResidentAlienMale NonResidentAlien Male 0 338
College of Agri Sci and Natl ResourcesB1FW FWL Fisheries and Wildlife NonResidentAlienMale NonResidentAlien Male 0 191
College of Agri Sci and Natl ResourcesB1AP APSC Applied Science NonResidentAlienMale NonResidentAlien Male 1 71
College of Agri Sci and Natl ResourcesB1HO HORT Horticulture NonResidentAlienMale NonResidentAlien Male 1 52
College of Agri Sci and Natl ResourcesB1ED AEDU Agricultural Education NonResidentAlienMale NonResidentAlien Male 0 103

So let's narrow our data down a bit. We'll filter our data down to the College of Journalism and Mass Communications, or as the database calls it "& Mass Comm".

We'll start by creating a new data frame, then use filter(name of source data frame, filter condition).


In [4]:
cojmc <- filter(enrollment, College == "College of Journalism & Mass Comm")

Now we can visualize it. Let's use a stacked bar chart. The difference between what you have here and what you did in Python is, again, vanishingly small. The major difference -- MAJOR -- is that you no longer put field names in quotes. THAT'S IT! That's all.


In [5]:
ggplot(cojmc, aes(MajorName, weight=Count, fill=Race)) + geom_bar() + scale_color_brewer("Race")


Let's do the same thing, except for the 10 largest majors on campus. I took a quick look at the data and figured out that the 10 largest majors on campus have more than 500 people in them total. So that's what our filter condition will be.


In [6]:
largest <- filter(enrollment, Total > 500)

And if we visualize that in the same way, except flipping the bars sideways, this is what we get:


In [7]:
ggplot(largest, aes(MajorName, weight=Count, fill=Race)) + geom_bar() + coord_flip()


If we wanted to map the percentages of the races within each of the 10 largest majors, we could create that column from our existing data. We have the Count and the Total of each Race and Gender Combination, so we can create a new field and calculate that. Whenever we want to change our data frame, we use dplyr's mutate function. You call mutate then the data frame you are mutating, the new field you are creating, and then a formula of some variety to fill it in.


In [8]:
pctlargest <- mutate(largest, Percent = (Count / Total)*100)

In [9]:
ggplot(pctlargest, aes(MajorName, weight=Percent, fill=Race)) + geom_bar() + coord_flip()


But what if we wanted our bar chart of largest majors to be in order of major? We can use arrage for that. But, there's a bit more magic that has to go on. To make the arrage stick, we have to mutate the data frame and make the order permanent by refactoring the label you'll use.

Here is an example of how to do this. This will also introduce another neat trick in dplyr: The %>%. That allows you to chain commands together, so you don't have to run a new cell every time, or create new data frames every time. So I can create a new data frame, sort it and mutate it all in one go. I can even chart it in the same go if I want.


In [10]:
largestsorted <- arrange(largest, desc(Total)) %>% mutate(MajorName = factor(MajorName, MajorName))


Warning message in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
“duplicated levels in factors are deprecated”

In [11]:
ggplot(largestsorted, aes(MajorName, weight=Count, fill=Race)) + geom_bar() + coord_flip()


Warning message in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
“duplicated levels in factors are deprecated”Warning message in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
“duplicated levels in factors are deprecated”

In [ ]: