In [1]:
library(dplyr)
library(ggplot2)
Now we grab some data.
In [2]:
enrollment <- read.csv("../../Data/collegeenrollment.csv")
In [3]:
head(enrollment)
So let's narrow our data down a bit. We'll filter our data down to the College of Journalism and Mass Communications, or as the database calls it "& Mass Comm".
We'll start by creating a new data frame, then use filter(name of source data frame, filter condition).
In [4]:
cojmc <- filter(enrollment, College == "College of Journalism & Mass Comm")
Now we can visualize it. Let's use a stacked bar chart. The difference between what you have here and what you did in Python is, again, vanishingly small. The major difference -- MAJOR -- is that you no longer put field names in quotes. THAT'S IT! That's all.
In [5]:
ggplot(cojmc, aes(MajorName, weight=Count, fill=Race)) + geom_bar() + scale_color_brewer("Race")
Let's do the same thing, except for the 10 largest majors on campus. I took a quick look at the data and figured out that the 10 largest majors on campus have more than 500 people in them total. So that's what our filter condition will be.
In [6]:
largest <- filter(enrollment, Total > 500)
And if we visualize that in the same way, except flipping the bars sideways, this is what we get:
In [7]:
ggplot(largest, aes(MajorName, weight=Count, fill=Race)) + geom_bar() + coord_flip()
If we wanted to map the percentages of the races within each of the 10 largest majors, we could create that column from our existing data. We have the Count and the Total of each Race and Gender Combination, so we can create a new field and calculate that. Whenever we want to change our data frame, we use dplyr's mutate function. You call mutate then the data frame you are mutating, the new field you are creating, and then a formula of some variety to fill it in.
In [8]:
pctlargest <- mutate(largest, Percent = (Count / Total)*100)
In [11]:
ggplot(pctlargest, aes(MajorName, weight=Percent, fill=Race)) + geom_bar() + coord_flip()
But what if we wanted our bar chart of largest majors to be in order of major? We can use arrage for that. But, there's a bit more magic that has to go on. To make the arrage stick, we have to mutate the data frame and make the order permanent by refactoring the label you'll use.
Here is an example of how to do this. This will also introduce another neat trick in dplyr: The %>%. That allows you to chain commands together, so you don't have to run a new cell every time, or create new data frames every time. So I can create a new data frame, sort it and mutate it all in one go. I can even chart it in the same go if I want.
In [16]:
ggplot(pctlargest, aes(reorder(MajorName, -Count), weight=Percent, fill=Race)) + geom_bar() + coord_flip()
In [18]:
ggplot(largest, aes(reorder(MajorName, -Count), weight=Count, fill=Race)) + geom_bar() + coord_flip()
One of the first things you'll notice is that the colors are ... suboptimal. Let's take our first step into changing the look of the default charts. We'll use a library installed called ColorBrewer. Some help with changing colors can be found here.
In [20]:
library("RColorBrewer")
display.brewer.all()
So knowing that information above, we can use one of those color ramps to change our colors. We do that by adding scale_fill_brewer
and the name of the pallete we want to use.
In [33]:
ggplot(largest, aes(reorder(MajorName, -Count), weight=Count, fill=Race)) + geom_bar() + coord_flip() + scale_fill_brewer(palette="Spectral")
But how effective is that? Does that help or hurt?
In [ ]: