Line charts and dates

Line charts are excellent at showing time series data -- data that changes by date or time. And there are two ways of dealing with time series data -- the easy way and the right way.

First, the easy way.


In [1]:
library(ggplot2)

In [2]:
enrollment <- read.csv("../../Data/enrollment.csv")

In [3]:
head(enrollment)


YearDateEnrollment
1967 1967-1-118067
1968 1968-1-119150
1969 1969-1-119618
1970 1970-1-120810
1971 1971-1-121541
1972 1972-1-121581

What we have here is enrollment at UNL since 1967. And, as you can see, I've already pulled the year into a field. So plotting this on a line chart is simply a matter of changing the geometry from what we've previously done to geom_line.


In [4]:
ggplot(enrollment, aes(x=Year, y=Enrollment)) + geom_line()


But what's the problem with this? Where does the Y axis start? Does this chart tell the story of enrollments at UNL over time?

Short answer: No. It makes it look wildly erratic. It's anything but. So we need to change our Y axis scale. And to do that, we introduce scale_y_continuous and scale_x_continuous as commands that we can chain to this.

In this, we're going to create a vector -- a collection of elements -- that set our lower and upper bounds of our variable. So lets start it at 0 and end it at 30,000, which is about 4,000 higher than our max value. Why that? It gives the top a litle space. We could set it to 27,000 or 100,000. What you set it as is what you need to tell the truth about the data. So the code you are adding is + scale_y_continuous(limits = c(0, 30000)) and, for purposes of this example, you can do scale_x_continuous too, though it has no real effect: + scale_x_continuous(limits = c(1967, 2016)).


In [5]:
ggplot(enrollment, aes(x=Year, y=Enrollment)) + geom_line() + scale_y_continuous(limits = c(0, 30000)) + scale_x_continuous(limits = c(1967, 2016))


Note how much flatter the line is? That's more like what enrollment looks like. Up and down, but relatively stable.

Introducing Labels

Labels are important to graphics. Clear labeling of your X and Y axis and a clear title are essential to graphics and your grade. Adding them uses the labs (short for labels) element that you can chain to your graphic. You can explicitly change the x, y and title labels pretty simply.


In [6]:
ggplot(enrollment, aes(x=Year, y=Enrollment)) + geom_line() + scale_y_continuous(limits = c(1000, 30000)) + scale_x_continuous(limits = c(1967, 2016)) + labs(x="Academic year", y="Fall enrollment", title="Enrollment at UNL since 1967", subtitle="Enrollment at UNL grew during the Vietnam War and has remained stable since.", caption="Graphic by Matt Waite")


The right way

The right way to deal with dates in line charts is to convert them to dates during your processing. So let's look at the daily flow of parking tickets on campus. First we'll need to do some data work that you've seen before -- using dpylr to mutate a field with a date without the time.


In [7]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


In [8]:
tickets <- read.csv("../../Data/tickets.csv")

In [9]:
head(tickets)


CitationDateLocationViolation
15078429 2012-04-02 07:15:00 North Stadium Expired Meter
24048318 2012-04-02 07:22:00 Housing No Valid Permit Displayed
24048320 2012-04-02 07:26:00 14th & W Street No Valid Permit Displayed
15078430 2012-04-02 07:36:00 Champions Club Parking in Unauthorized Area
18074937 2012-04-02 07:39:00 Sandoz Expired Meter
18074938 2012-04-02 07:40:00 Sandoz Expired Meter

In [10]:
library(zoo)


Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


In [11]:
ticketsbymonth <- tickets %>% mutate(
    shortdate = format(as.POSIXct(Date, format="%Y-%m-%d")),
    yearmonth = as.yearmon(shortdate, format="%Y-%m")
) %>% group_by(yearmonth, ) %>% summarize(
    count = n()
)

In [12]:
head(ticketsbymonth)


yearmonthcount
Apr 20123473
May 20122572
Jun 20122478
Jul 20122134
Aug 20123774
Sep 20124138

But look at this -- that shortdate still isn't a date. It's a character field.


In [14]:
sapply(ticketsbymonth, class)


yearmonth
'yearmon'
count
'integer'

So what do you do? You can manipulate data in your ggolot steps to turn them into the data you need. So, in this case, we can turn our x variable into a date by using as.Date or as.POSIXct or lubridate if we want. In this case, as.Date works just fine.


In [17]:
ggplot(ticketsbymonth, aes(x=yearmonth, y=count)) + geom_line() + scale_x_yearmon(format = "%B %Y")


One last thing -- we can do better on labels. Let's say we want some notion of where we are in the year, so we'd rather have lables that give us a month and a year. We can do that using scale_x_date and within that, set date_labels and date_breaks. More on that here: http://ggplot2.tidyverse.org/reference/scale_date.html


In [18]:
ggplot(ticketsbyday, aes(x=as.Date(shortdate), y=count)) + geom_line() + scale_x_date(date_labels="%B %y", date_breaks="6 month")


Error in ggplot(ticketsbyday, aes(x = as.Date(shortdate), y = count)): object 'ticketsbyday' not found
Traceback:

1. ggplot(ticketsbyday, aes(x = as.Date(shortdate), y = count))

Let's now clean up our labels.


In [19]:
ggplot(ticketsbyday, aes(x=as.Date(shortdate), y=count)) + geom_line() + scale_x_date(date_labels="%b %y", date_breaks="6 months") + labs(x="Date", y="Tickets", title="Parking tickets by day at UNL")


Error in ggplot(ticketsbyday, aes(x = as.Date(shortdate), y = count)): object 'ticketsbyday' not found
Traceback:

1. ggplot(ticketsbyday, aes(x = as.Date(shortdate), y = count))

Assignment

This parking tickets by day is a little noisy. What would it look like if you:

  1. Grouped it by month?
  2. Or week? What does that graph look like?
  3. How would you label the parts?

Rubric

  1. Did you import the data correctly?
  2. Did you manipulate the data correctly?
  3. Did you chart the data?
  4. Did you change the labels correctly and produce the required charts?
  5. Did you explain your steps in Markdown comments?

In [20]:
library(lubridate)


Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date


In [21]:
ticketsbyweek <- tickets %>% mutate(
    week = floor_date(ymd_hms(Date), "week")
) %>% group_by(week) %>% summarize(
    count = n()
)

In [22]:
head(ticketsbyweek)


weekcount
2012-04-01951
2012-04-08729
2012-04-15777
2012-04-22885
2012-04-29416
2012-05-06407

In [23]:
ggplot(ticketsbyweek, aes(x=week, y=count)) + geom_line()



In [ ]: