Basic data analysis in R

R is a statistical programming language that is purpose built for data analysis.

Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. One of the best libraries, in your professor's opinion, is dplyr, a library for working with data. To use dplyr, you need to import it.



In [1]:

    
library(dplyr)









    



Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we're going to read data from a csv file -- a comma-separated values file.

The code looks like this:

mountainlions <- read.csv("../../Data/mountainlions.csv")

Let's unpack that.

The first part -- mountainlions -- is the name of your variable. A variable is just a name of a thing. In this case, our variable is a data frame, which is R's way of storing data. We can call this whatever we want. I always want to name data frames after what is in it. In this case, we're going to import a dataset of mountain lion sightings from the Nebraska Game and Parks Commission.

The <- bit is the variable assignment operator. It's how we know we're assigning something to a word.

The read.csv bits are pretty obvious. What happens in the quote marks is the path to the data. In there, I have to tell R where it find the data. The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook. In my case, I've got a folder called Data that's two levels up from my work folder. So the ../ means move up one level. So move up one level, move up one level, find Data, then in there is a file called mountainlions.csv.

What you put in there will be different from mine. So your first task is to import the data.



In [2]:

    
mountainlions <- read.csv("../../Data/mountainlions.csv")

Now we can inspect the data we imported. What does it look like? To do that, we use head(mountainlions) to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter mountainlions and run it.

To get the number of records in our dataset, we run nrow(mountainlions)



In [3]:

    
head(mountainlions)
nrow(mountainlions)









    





ID Cofirm.Type COUNTY Date

	1           Track       Dawes       9/14/91     
	2           Mortality   Sioux       11/10/91    
	3           Mortality   Scotts Bluff 4/21/96     
	4           Mortality   Sioux       5/9/99      
	5           Mortality   Box Butte   9/29/99     
	6           Track       Scotts Bluff 11/12/99    









    




393

So what if we wanted to know how many mountain lion sightings there were in each county? To do that by hand, we'd have to take each of the 393 records and sort them into a pile. We'd put them in groups and then count them.

dplyr has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together at some point. So it's a good place to start.

So to do this, we'll take our dataset and we'll introduce a new operator: %>%. The best way to read that operator, in my opinion, is to interpret that as "and then do this." Here's the code:



In [4]:

    
mountainlions %>%
  group_by(COUNTY) %>%
  summarise(
    count = n(),
  )









    





COUNTY count

	Banner        6         
	Blaine        3         
	Box Butte     4         
	Brown        15         
	Buffalo       3         
	Cedar         1         
	Cherry       30         
	Custer        8         
	Dakota        3         
	Dawes       111         
	Dawson        5         
	Dixon         3         
	Douglas       2         
	Frontier      1         
	Hall          1         
	Holt          2         
	Hooker        1         
	Howard        3         
	Keith         1         
	Keya Paha    20         
	Kimball       1         
	Knox          8         
	Lincoln      10         
	Merrick       1         
	Morrill       2         
	Nance         1         
	Nemaha        5         
	Platte        1         
	Polk          1         
	Richardson    2         
	Rock         11         
	Sarpy         1         
	Saunders      2         
	Scotts Bluff  26         
	sheridan      2         
	Sheridan     35         
	Sherman       1         
	Sioux        52         
	Thomas        5         
	Thurston      1         
	Valley        1         
	Wheeler       1

So let's walk through that. We start with our dataset -- mountainlions -- and then we tell it to group the data by a given field in the data. In this case, we wanted to group together all the counties, signified by the field name COUNTY, which you could get from looking at head(mountainlions). So after we group the data, we need to count them up. In dplyr, we use summarize which can do more than just count things. So inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the counties. So count = n(), says create a new field, called count and set it equal to n(), which might look weird, but it's common in stats. The number of things in a dataset? Statisticians call in n. There are n number of incidents in this dataset. So n() is a function that counts the number of things there are.

And when we run that, we get a list of counties with a count next to them. But it's not in any order. So we'll add another And Then Do This %>% and use arrange. Arrange does what you think it does -- it arranges data in order. By default, it's in ascending order -- smallest to largest. But if we want to know the county with the most mountain lion sightings, we need to sort it in descending order. That looks like this:



In [5]:

    
mountainlions %>%
  group_by(COUNTY) %>%
  summarise(
    count = n(),
  ) %>% arrange(desc(count))









    





COUNTY count

	Dawes       111         
	Sioux        52         
	Sheridan     35         
	Cherry       30         
	Scotts Bluff  26         
	Keya Paha    20         
	Brown        15         
	Rock         11         
	Lincoln      10         
	Custer        8         
	Knox          8         
	Banner        6         
	Dawson        5         
	Nemaha        5         
	Thomas        5         
	Box Butte     4         
	Blaine        3         
	Buffalo       3         
	Dakota        3         
	Dixon         3         
	Howard        3         
	Douglas       2         
	Holt          2         
	Morrill       2         
	Richardson    2         
	Saunders      2         
	sheridan      2         
	Cedar         1         
	Frontier      1         
	Hall          1         
	Hooker        1         
	Keith         1         
	Kimball       1         
	Merrick       1         
	Nance         1         
	Platte        1         
	Polk          1         
	Sarpy         1         
	Sherman       1         
	Thurston      1         
	Valley        1         
	Wheeler       1

More basics

In the last example, we grouped some data together and counted it up, but there's so much more you can do. You can do multiple measures in a single step as well.

Let's look at some simple college data.



In [6]:

    
colleges <- read.csv("../../Data/colleges.csv")



In [7]:

    
head(colleges)









    





UnitID Name InState1213 OutOfState1213 GradRate

	151351                                   Indiana University-Bloomington           23116                                    44566                                    75                                       
	171100                                   Michigan State University                24028                                    43986                                    79                                       
	147767                                   Northwestern University                  60840                                    60840                                    93                                       
	204796                                   Ohio State University-Main Campus        24919                                    40327                                    82                                       
	214777                                   Pennsylvania State University-Main Campus 31854                                    44156                                    86                                       
	243780                                   Purdue University-Main Campus            23468                                    42270                                    69

In summarize, we can calculate any number of measures. Here, we'll use R's built in mean and median functions to calculate ... well, you get the idea.



In [8]:

    
colleges %>%
  summarise(
    count = n(),
    instatemean = mean(InState1213),
    outstatemean = mean(OutOfState1213),
    instatemedian = median(InState1213),
    outstatemedian = median(OutOfState1213),
  )









    





count instatemean outstatemean instatemedian outstatemedian

	14      27652.86 42821.5 24473.5 42194

Now, what if we just wanted to see the University of Nebraska-Lincoln? So we can compare it to the mean and median. To do that, we use filter, which does what it says on the tin. You can simply filter the things you want (or don't want) so your numbers reflect the things you are just looking at. So in this case, we're going to get all the records where the Name equals "University of Nebraska-Lincoln".



In [14]:

    
colleges %>% filter(Name == "University of Nebraska-Lincoln")









    





UnitID Name InState1213 OutOfState1213 GradRate

	181464                        University of Nebraska-Lincoln 21700                         34450                         65

Assignment

We're going to put it all together now. We're going to calculate the mean and median salaries of job titles at the University of Nebraska-Lincoln.

Answer this question:

What are the top median salaries by job title at UNL? And how does that compare to the average salary for that position?

To do this, you'll need to download this data.

Rubric

Did you read the data into a dataframe?
Did you use group by syntax correctly?
Did you use summarize syntax correctly?
Did you use filter syntax correctly?
Did you use Markdown comments to explain your steps?



In [ ]:

ID	Cofirm.Type	COUNTY	Date
1	Track	Dawes	9/14/91
2	Mortality	Sioux	11/10/91
3	Mortality	Scotts Bluff	4/21/96
4	Mortality	Sioux	5/9/99
5	Mortality	Box Butte	9/29/99
6	Track	Scotts Bluff	11/12/99

COUNTY	count
Banner	6
Blaine	3
Box Butte	4
Brown	15
Buffalo	3
Cedar	1
Cherry	30
Custer	8
Dakota	3
Dawes	111
Dawson	5
Dixon	3
Douglas	2
Frontier	1
Hall	1
Holt	2
Hooker	1
Howard	3
Keith	1
Keya Paha	20
Kimball	1
Knox	8
Lincoln	10
Merrick	1
Morrill	2
Nance	1
Nemaha	5
Platte	1
Polk	1
Richardson	2
Rock	11
Sarpy	1
Saunders	2
Scotts Bluff	26
sheridan	2
Sheridan	35
Sherman	1
Sioux	52
Thomas	5
Thurston	1
Valley	1
Wheeler	1

UnitID	Name	InState1213	OutOfState1213	GradRate
151351	Indiana University-Bloomington	23116	44566	75
171100	Michigan State University	24028	43986	79
147767	Northwestern University	60840	60840	93
204796	Ohio State University-Main Campus	24919	40327	82
214777	Pennsylvania State University-Main Campus	31854	44156	86
243780	Purdue University-Main Campus	23468	42270	69