R is a statistical programming language that is purpose built for data analysis.
Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. One of the best libraries, in your professor's opinion, is dplyr
, a library for working with data. To use dplyr, you need to import it.
In [1]:
library(dplyr)
The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we're going to read data from a csv file -- a comma-separated values file.
The code looks like this:
mountainlions <- read.csv("../../Data/mountainlions.csv")
Let's unpack that.
The first part -- mountainlions
-- is the name of your variable. A variable is just a name of a thing. In this case, our variable is a data frame, which is R's way of storing data. We can call this whatever we want. I always want to name data frames after what is in it. In this case, we're going to import a dataset of mountain lion sightings from the Nebraska Game and Parks Commission.
The <-
bit is the variable assignment operator. It's how we know we're assigning something to a word.
The read.csv
bits are pretty obvious. What happens in the quote marks is the path to the data. In there, I have to tell R where it find the data. The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook. In my case, I've got a folder called Data that's two levels up from my work folder. So the ../
means move up one level. So move up one level, move up one level, find Data, then in there is a file called mountainlions.csv.
What you put in there will be different from mine. So your first task is to import the data.
In [2]:
mountainlions <- read.csv("../../Data/mountainlions.csv")
Now we can inspect the data we imported. What does it look like? To do that, we use head(mountainlions)
to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter mountainlions
and run it.
To get the number of records in our dataset, we run nrow(mountainlions)
In [3]:
head(mountainlions)
nrow(mountainlions)
So what if we wanted to know how many mountain lion sightings there were in each county? To do that by hand, we'd have to take each of the 393 records and sort them into a pile. We'd put them in groups and then count them.
dplyr
has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together at some point. So it's a good place to start.
So to do this, we'll take our dataset and we'll introduce a new operator: %>%
. The best way to read that operator, in my opinion, is to interpret that as "and then do this." Here's the code:
In [4]:
mountainlions %>%
group_by(COUNTY) %>%
summarise(
count = n(),
)
So let's walk through that. We start with our dataset -- mountainlions
-- and then we tell it to group the data by a given field in the data. In this case, we wanted to group together all the counties, signified by the field name COUNTY, which you could get from looking at head(mountainlions)
. So after we group the data, we need to count them up. In dplyr, we use summarize
which can do more than just count things. So inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the counties. So count = n(),
says create a new field, called count
and set it equal to n()
, which might look weird, but it's common in stats. The number of things in a dataset? Statisticians call in n. There are n number of incidents in this dataset. So n()
is a function that counts the number of things there are.
And when we run that, we get a list of counties with a count next to them. But it's not in any order. So we'll add another And Then Do This %>% and use arrange
. Arrange does what you think it does -- it arranges data in order. By default, it's in ascending order -- smallest to largest. But if we want to know the county with the most mountain lion sightings, we need to sort it in descending order. That looks like this:
In [5]:
mountainlions %>%
group_by(COUNTY) %>%
summarise(
count = n(),
) %>% arrange(desc(count))
In [6]:
colleges <- read.csv("../../Data/colleges.csv")
In [7]:
head(colleges)
In summarize, we can calculate any number of measures. Here, we'll use R's built in mean
and median
functions to calculate ... well, you get the idea.
In [8]:
colleges %>%
summarise(
count = n(),
instatemean = mean(InState1213),
outstatemean = mean(OutOfState1213),
instatemedian = median(InState1213),
outstatemedian = median(OutOfState1213),
)
Now, what if we just wanted to see the University of Nebraska-Lincoln? So we can compare it to the mean and median. To do that, we use filter
, which does what it says on the tin. You can simply filter the things you want (or don't want) so your numbers reflect the things you are just looking at. So in this case, we're going to get all the records where the Name equals "University of Nebraska-Lincoln".
In [14]:
colleges %>% filter(Name == "University of Nebraska-Lincoln")
We're going to put it all together now. We're going to calculate the mean and median salaries of job titles at the University of Nebraska-Lincoln.
Answer this question:
What are the top median salaries by job title at UNL? And how does that compare to the average salary for that position?
To do this, you'll need to download this data.
In [ ]: