In [1]:
###############################################################################
# Information
###############################################################################
# Created by Linwood Creekmore III
# Started Mar 2016
# For Johns Hopkins University Data Scinece Certification track; Reproducible Research
# Github = https://github.com/linwoodc3
options(jupyter.plot_mimetypes = 'image/png')
Repoducible analysis is all about equipping independent researchers and readers with the tools, data, and analytic code needed to reproduce the results of a study or scientific endeavor. This study uses RMarkdown; read knitr in a knutshell for more information and tricks. In this context, the task explores personal activity monitoring data. The tools used for plotting and data transformation are imported below.
I also completed this assignment in a Jupyter Notebook Kernel for R langauge. The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text, which matches exactly what literate statistical programming aims to be (from our lectures). In these notebook, you can type and work with code interactively, and also download the finished product as Markdown, LaTex (PDF), ipython notebook, reST,R, or HTML. Learn more about using Jupyter Notebook with R here.
In [ ]:
###############################################################################
# Check for required packages and load
###############################################################################
warnings = FALSE
list.of.packages <- c("dplyr", "tidyr", "RColorBrewer","ggthemes","ggplot2")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(RColorBrewer)
library(dplyr)
library(tidyr)
library(ggthemes)
library(ggplot2)
RColorBrewer, ggplot2, and ggthemes are all used for visualization, while dplyr and tidyr support data cleaning and transformation.
In [3]:
###############################################################################
# Looking for locations of files
###############################################################################
if (!file.exists('data')) {
dir.create('data')}
if (!file.exists('./data/files')) {
dir.create('./data/files')}
This code will check to see if the file exists in the local working directory or workspace, and if not, then downloads the csv from the source and loads it into a dataframe
called activity.
In [4]:
###############################################################################
# Download the zip file
###############################################################################
# downloading the raw zip file and saving as a temporary file or, just passing step if file exists
temp <- tempfile()
if (!file.exists('./data/files/activity.csv')) {
print(paste0("You did not have the file; downloading.... "))
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip",temp)
unzip(temp,exdir = './data/files')
print(paste0("You did not have the file; download complete. Proceeding.... "))
} else {
print(paste0("You have the target zip file. Proceeding.... "))
}
In [5]:
dir("data/files")
Out[5]:
In [6]:
activity <- read.csv('./data/files/activity.csv', sep = ",",na.strings = "NA", )
In [7]:
str(activity)
In [8]:
summary(activity$steps)
Out[8]:
We will use dplyr to calculate the mean total number of steps taken per day. dplyr is a great tool to group data and calculate values over these grouped sets. Below, we use the chaining method of dplyr functions to manipulate the activity data.
First, calculate the total number of steps taken per day.
In [9]:
per_day <- activity%>%
na.omit()%>% # removing the na data
group_by(date)%>%
summarise(total.steps=sum(steps),
average.steps = mean(steps))
head(per_day)
Out[9]:
Next we create a histogram using the geom_histogram function from ggplot2. Note, we are also using the Economist theme from ggthemes to create a polished look.
In [10]:
g <- ggplot(per_day,aes(total.steps))
g + geom_histogram(binwidth = 1000, aes(fill=..count..))+
theme_economist()+
stat_function(fun = dnorm, colour = "red")+
scale_y_continuous(breaks=c(0,4,8,12))+
scale_fill_continuous(guide_legend(title.position="right",title='Number of Occurences'),
breaks = seq(0,10,by=2),
labels = seq(0,10,by=2))+
scale_colour_economist()+
labs(x="Steps taken in a single day",y="Count",
title="What is the most frequent number of steps taken in a day?")
With an understanding of frequently occuring values, let's look at the overall mean and median of the total number of steps taken per day (e.g. over the entire data set).
In [52]:
# The mean
print(paste0("The mean is ",mean(per_day$total.steps)))
# The median
print(paste0("The median is ",median(per_day$total.steps)))
In [11]:
interval_activity <- activity%>%
na.omit()%>% # removing the na data
group_by(interval)%>%
summarise(average.steps = mean(steps))
With the data selected, the next ste is to make a time series plot and make the busiest interval stand out.
In [12]:
head(interval_activity)
Out[12]:
In [13]:
typeof(interval_activity$average.steps[2])
Out[13]:
In [53]:
print(paste0("The 5-minute interval with the maximum average steps is ",
interval_activity$interval[interval_activity$average.steps == max(interval_activity$average.steps)]))
In [54]:
l <- ggplot(interval_activity,aes(x=interval,y=average.steps))
l + geom_line(aes(colour=average.steps))+
scale_x_continuous(breaks=seq(0,2500,by=250))+
theme_economist()+
geom_text(aes(label=ifelse(interval_activity$average.steps == max(interval_activity$average.steps),
as.numeric(interval_activity$interval),'')))+
labs(x="5 minute intervals",y="Average Steps",
title="What is the busiest 5-minute interval?")
In [16]:
# sum of NA
sum(is.na(activity))
Out[16]:
Just to be sure, let's see if the NA is confined to one column or spread across. We need to get a final number of 2304 if the missing data is spread over the columns.
In [56]:
# Testing NAs across columns
sapply(activity,function(x) sum(is.na(x)))
Out[56]:
Good! The NAs are confined to the steps column. What percentage is missing?
In [57]:
# Percentage of NA columns in steps.
sum(is.na(activity$steps))/length(activity$steps)*100
Out[57]:
In [59]:
new_activity <- activity%>%
group_by(interval)%>%
mutate(steps = ifelse(is.na(steps),mean(steps,na.rm=TRUE),steps))
filled_activity <- new_activity%>%
group_by(date)%>%
summarise(total.steps=sum(steps),
average.steps = mean(steps))
head(new_activity); head(filled_activity)
Out[59]:
Out[59]:
Let's make a quick histogram to see the data.
In [60]:
g <- ggplot(filled_activity,aes(total.steps))
g + geom_histogram(binwidth = 1000, aes(fill=..count..))+
theme_economist()+
stat_function(fun = dnorm, colour = "red")+
scale_y_continuous(breaks=c(0,4,8,12))+
scale_fill_continuous(guide_legend(title.position="right",title='Number of Occurences'),
breaks = seq(0,10,by=2),
labels = seq(0,10,by=2))+
scale_colour_economist()+
labs(x="Steps taken in a single day",y="Count",
title="Filled: What is the most frequent number of steps taken in a day?")
Here are the mean and median.
In [63]:
# The mean
print(paste0("The new mean is ",mean(filled_activity$total.steps)))
# The median
print(paste0("The new sum is ",median(filled_activity$total.steps)))
The mean and median changed slighly but not much in terms of the total magnitude. The mean is pretty much the same. The impact is that the histogram values are higher for the filled data set.
Understanding the difference in weekday and weekend patterns means we must convert our dates to representations of days of the week, and then classify them into weekend and weekday factors. A great walkthrough was on stackoverflow, converting weekdays to week and weekend factors.
In [66]:
new_activity$date <- as.Date(new_activity$date)
weekdays1 <- c('Monday',"Tuesday","Wednesday","Thursday","Friday")
new_activity$wDay <- c('weekend','weekday')[(weekdays(new_activity$date) %in% weekdays1)+ 1L]
head(new_activity)
Out[66]:
Let's check the factors and build a data set with averages.
In [68]:
weekdays(new_activity$date[6])
# Building a new data set
filled_interval_activity <- new_activity%>%
group_by(interval,wDay)%>%
summarise(average.steps = mean(steps))
head(filled_interval_activity)
Out[68]:
Out[68]:
Now to the final step, the plot comparing the average number of steps averaged across all weekends and weekdays in a time series plot of 5-minute intervales. We are just adding a facet_grid based on the newly created wDay column.
In [69]:
m <- ggplot(filled_interval_activity,aes(x=interval,y=average.steps, fill=wDay))
In [70]:
m + geom_line(aes(colour=wDay)) + facet_grid(.~wDay)+ scale_fill_brewer(palette = "Paired")+theme_economist()+labs(x="5 minute intervals",y="Average Steps",
title="Weeked vs. Week: What is the busiest 5-minute interval?")