Basic data analysis in R

R is a statistical programming language that is purpose built for data analysis.

Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. One of the best libraries, in your professor's opinion, is dplyr, a library for working with data. To use dplyr, you need to import it.


In [1]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we're going to read data from a csv file -- a comma-separated values file.

The code looks like this:

mountainlions <- read.csv("../../Data/mountainlions.csv")

Let's unpack that.

The first part -- mountainlions -- is the name of your variable. A variable is just a name of a thing. In this case, our variable is a data frame, which is R's way of storing data. We can call this whatever we want. I always want to name data frames after what is in it. In this case, we're going to import a dataset of mountain lion sightings from the Nebraska Game and Parks Commission.

The <- bit is the variable assignment operator. It's how we know we're assigning something to a word.

The read.csv bits are pretty obvious. What happens in the quote marks is the path to the data. In there, I have to tell R where it find the data. The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook. In my case, I've got a folder called Data that's two levels up from my work folder. So the ../ means move up one level. So move up one level, move up one level, find Data, then in there is a file called mountainlions.csv.

What you put in there will be different from mine. So your first task is to import the data.


In [2]:
mountainlions <- read.csv("../../Data/mountainlions.csv")

Now we can inspect the data we imported. What does it look like? To do that, we use head(mountainlions) to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter mountainlions and run it.

To get the number of records in our dataset, we run nrow(mountainlions)


In [10]:
head(mountainlions)
nrow(mountainlions)


IDCofirm.TypeCOUNTYDate
1 Track Dawes 9/14/91
2 Mortality Sioux 11/10/91
3 Mortality Scotts Bluff4/21/96
4 Mortality Sioux 5/9/99
5 Mortality Box Butte 9/29/99
6 Track Scotts Bluff11/12/99
393

So what if we wanted to know how many mountain lion sightings there were in each county? To do that by hand, we'd have to take each of the 393 records and sort them into a pile. We'd put them in groups and then count them.

dplyr has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together at some point. So it's a good place to start.

So to do this, we'll take our dataset and we'll introduce a new operator: %>%. The best way to read that operator, in my opinion, is to interpret that as "and then do this." Here's the code:


In [11]:
mountainlions %>%
  group_by(COUNTY) %>%
  summarise(
    count = n(),
  )


COUNTYcount
Banner 6
Blaine 3
Box Butte 4
Brown 15
Buffalo 3
Cedar 1
Cherry 30
Custer 8
Dakota 3
Dawes 111
Dawson 5
Dixon 3
Douglas 2
Frontier 1
Hall 1
Holt 2
Hooker 1
Howard 3
Keith 1
Keya Paha 20
Kimball 1
Knox 8
Lincoln 10
Merrick 1
Morrill 2
Nance 1
Nemaha 5
Platte 1
Polk 1
Richardson 2
Rock 11
Sarpy 1
Saunders 2
Scotts Bluff 26
sheridan 2
Sheridan 35
Sherman 1
Sioux 52
Thomas 5
Thurston 1
Valley 1
Wheeler 1

So let's walk through that. We start with our dataset -- mountainlions -- and then we tell it to group the data by a given field in the data. In this case, we wanted to group together all the counties, signified by the field name COUNTY, which you could get from looking at head(mountainlions). So after we group the data, we need to count them up. In dplyr, we use summarize which can do more than just count things. So inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the counties. So count = n(), says create a new field, called count and set it equal to n(), which might look weird, but it's common in stats. The number of things in a dataset? Statisticians call in n. There are n number of incidents in this dataset. So n() is a function that counts the number of things there are.

And when we run that, we get a list of counties with a count next to them. But it's not in any order. So we'll add another And Then Do This %>% and use arrange. Arrange does what you think it does -- it arranges data in order. By default, it's in ascending order -- smallest to largest. But if we want to know the county with the most mountain lion sightings, we need to sort it in descending order. That looks like this:


In [12]:
mountainlions %>%
  group_by(COUNTY) %>%
  summarise(
    count = n(),
  ) %>% arrange(desc(count))


COUNTYcount
Dawes 111
Sioux 52
Sheridan 35
Cherry 30
Scotts Bluff 26
Keya Paha 20
Brown 15
Rock 11
Lincoln 10
Custer 8
Knox 8
Banner 6
Dawson 5
Nemaha 5
Thomas 5
Box Butte 4
Blaine 3
Buffalo 3
Dakota 3
Dixon 3
Howard 3
Douglas 2
Holt 2
Morrill 2
Richardson 2
Saunders 2
sheridan 2
Cedar 1
Frontier 1
Hall 1
Hooker 1
Keith 1
Kimball 1
Merrick 1
Nance 1
Platte 1
Polk 1
Sarpy 1
Sherman 1
Thurston 1
Valley 1
Wheeler 1

Assignment

Answer this question using what you have learned in this walkthrough.

What are the most common incidents UNL police reported from 2013 and 2016?

To do this, you'll need to download this data.

Rubric

  1. Did you read the data into a dataframe?
  2. Did you use group by syntax correctly?
  3. Did you use summarize syntax correctly?
  4. Did you use arrange syntax correctly?
  5. Did you use Markdown comments to explain your steps?

In [ ]: