In [1]:
###############################################################################
# Information
###############################################################################

# Created by Linwood Creekmore III
# Started Mar 2016
# For Johns Hopkins University Data Scinece Certification track; Reproducible Research
# Github = https://github.com/linwoodc3

options(jupyter.plot_mimetypes = 'image/png')

The Purpose of Reproducible Research

Repoducible analysis is all about equipping independent researchers and readers with the tools, data, and analytic code needed to reproduce the results of a study or scientific endeavor. This study uses RMarkdown; read knitr in a knutshell for more information and tricks. In this context, the task explores personal activity monitoring data. The tools used for plotting and data transformation are imported below.

Alternative Literate Statistical Programming

I also completed this assignment in a Jupyter Notebook Kernel for R langauge. The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text, which matches exactly what literate statistical programming aims to be (from our lectures). In these notebook, you can type and work with code interactively, and also download the finished product as Markdown, LaTex (PDF), ipython notebook, reST,R, or HTML. Learn more about using Jupyter Notebook with R here.


In [ ]:
###############################################################################
# Check for required packages and load 
###############################################################################
warnings = FALSE
list.of.packages <- c("dplyr", "tidyr", "RColorBrewer","ggthemes","ggplot2")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(RColorBrewer)
library(dplyr)
library(tidyr)
library(ggthemes)
library(ggplot2)

RColorBrewer, ggplot2, and ggthemes are all used for visualization, while dplyr and tidyr support data cleaning and transformation.


In [3]:
###############################################################################
# Looking for locations of files
###############################################################################

if (!file.exists('data')) {
        dir.create('data')}
if (!file.exists('./data/files')) {
        dir.create('./data/files')}

Loading and preprocessing the data

This code will check to see if the file exists in the local working directory or workspace, and if not, then downloads the csv from the source and loads it into a dataframe called activity.


In [4]:
###############################################################################
# Download the zip file
###############################################################################

# downloading the raw zip file and saving as a temporary file or, just passing step if file exists
temp <- tempfile()
if (!file.exists('./data/files/activity.csv')) {
        print(paste0("You did not have the file; downloading.... "))        
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip",temp)
        unzip(temp,exdir = './data/files')
        print(paste0("You did not have the file; download complete. Proceeding.... "))
} else {
        print(paste0("You have the target zip file. Proceeding.... "))        
}


[1] "You have the target zip file. Proceeding.... "

In [5]:
dir("data/files")


Out[5]:
'activity.csv'

In [6]:
activity <- read.csv('./data/files/activity.csv', sep = ",",na.strings = "NA", )

In [7]:
str(activity)


'data.frame':	17568 obs. of  3 variables:
 $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ date    : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

In [8]:
summary(activity$steps)


Out[8]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00    0.00   37.38   12.00  806.00    2304 

What is mean total number of steps taken per day?

We will use dplyr to calculate the mean total number of steps taken per day. dplyr is a great tool to group data and calculate values over these grouped sets. Below, we use the chaining method of dplyr functions to manipulate the activity data.

First, calculate the total number of steps taken per day.


In [9]:
per_day <- activity%>%
    na.omit()%>% # removing the na data
    group_by(date)%>%
    summarise(total.steps=sum(steps),
              average.steps = mean(steps))

head(per_day)


Out[9]:
datetotal.stepsaverage.steps
12012-10-021260.4375
22012-10-031135239.41667
32012-10-041211642.06944
42012-10-051329446.15972
52012-10-061542053.54167
62012-10-071101538.24653


Next we create a histogram using the geom_histogram function from ggplot2. Note, we are also using the Economist theme from ggthemes to create a polished look.


In [10]:
g <- ggplot(per_day,aes(total.steps))
g + geom_histogram(binwidth = 1000, aes(fill=..count..))+
        theme_economist()+
        stat_function(fun = dnorm, colour = "red")+
        scale_y_continuous(breaks=c(0,4,8,12))+
        scale_fill_continuous(guide_legend(title.position="right",title='Number of Occurences'),
                           breaks = seq(0,10,by=2),
                           labels = seq(0,10,by=2))+
        scale_colour_economist()+
        labs(x="Steps taken in a single day",y="Count",
             title="What is the most frequent number of steps taken in a day?")


With an understanding of frequently occuring values, let's look at the overall mean and median of the total number of steps taken per day (e.g. over the entire data set).


In [52]:
# The mean
print(paste0("The mean is ",mean(per_day$total.steps)))

# The median
print(paste0("The median is ",median(per_day$total.steps)))


[1] "The mean is 10766.1886792453"
[1] "The median is 10765"

What is the average daily activity pattern?

The next exercise is to create a time series plot that looks at the average number of steps taken 5-minute intervals across all days. Again, we use dplyr to index and select the correct data.


In [11]:
interval_activity <- activity%>%
    na.omit()%>% # removing the na data
    group_by(interval)%>%
    summarise(average.steps = mean(steps))

With the data selected, the next ste is to make a time series plot and make the busiest interval stand out.


In [12]:
head(interval_activity)


Out[12]:
intervalaverage.steps
101.716981
250.3396226
3100.1320755
4150.1509434
5200.0754717
6252.09434

In [13]:
typeof(interval_activity$average.steps[2])


Out[13]:
'double'

In [53]:
print(paste0("The 5-minute interval with the maximum average steps is ",
interval_activity$interval[interval_activity$average.steps == max(interval_activity$average.steps)]))


[1] "The 5-minute interval with the maximum average steps is 835"

In [54]:
l <- ggplot(interval_activity,aes(x=interval,y=average.steps))
l + geom_line(aes(colour=average.steps))+
scale_x_continuous(breaks=seq(0,2500,by=250))+
theme_economist()+
geom_text(aes(label=ifelse(interval_activity$average.steps == max(interval_activity$average.steps),
                           as.numeric(interval_activity$interval),'')))+
        labs(x="5 minute intervals",y="Average Steps",
             title="What is the busiest 5-minute interval?")


Imputing missing values

The next step in the analysis is to calculate and report the total number of missing values in the data set.


In [16]:
# sum of NA

sum(is.na(activity))


Out[16]:
2304

Just to be sure, let's see if the NA is confined to one column or spread across. We need to get a final number of 2304 if the missing data is spread over the columns.


In [56]:
# Testing NAs across columns

sapply(activity,function(x) sum(is.na(x)))


Out[56]:
steps
2304
date
0
interval
0

Good! The NAs are confined to the steps column. What percentage is missing?


In [57]:
# Percentage of NA columns in steps.

sum(is.na(activity$steps))/length(activity$steps)*100


Out[57]:
13.1147540983607

Strategy to fill missing values

13%. Not bad!. Let's replace the missing numbers with the mean and create a new filled_activity dataset. To be clear, we use a simple method to just insert the average in place of the NA values. We use dplyr chaining to complete this.


In [59]:
new_activity <- activity%>%
    group_by(interval)%>%
    mutate(steps = ifelse(is.na(steps),mean(steps,na.rm=TRUE),steps))
    


filled_activity <- new_activity%>%
    group_by(date)%>%
    summarise(total.steps=sum(steps),
              average.steps = mean(steps))


    
    
head(new_activity); head(filled_activity)


Out[59]:
stepsdateinterval
11.7169812012-10-010
20.33962262012-10-015
30.13207552012-10-0110
40.15094342012-10-0115
50.07547172012-10-0120
62.094342012-10-0125
Out[59]:
datetotal.stepsaverage.steps
12012-10-0110766.1937.3826
22012-10-021260.4375
32012-10-031135239.41667
42012-10-041211642.06944
52012-10-051329446.15972
62012-10-061542053.54167

Let's make a quick histogram to see the data.


In [60]:
g <- ggplot(filled_activity,aes(total.steps))
g + geom_histogram(binwidth = 1000, aes(fill=..count..))+
        theme_economist()+
        stat_function(fun = dnorm, colour = "red")+
        scale_y_continuous(breaks=c(0,4,8,12))+
        scale_fill_continuous(guide_legend(title.position="right",title='Number of Occurences'),
                           breaks = seq(0,10,by=2),
                           labels = seq(0,10,by=2))+
        scale_colour_economist()+
        labs(x="Steps taken in a single day",y="Count",
             title="Filled: What is the most frequent number of steps taken in a day?")


Here are the mean and median.


In [63]:
# The mean
print(paste0("The new mean is ",mean(filled_activity$total.steps)))

# The median
print(paste0("The new sum is ",median(filled_activity$total.steps)))


[1] "The new mean is 10766.1886792453"
[1] "The new sum is 10766.1886792453"

The mean and median changed slighly but not much in terms of the total magnitude. The mean is pretty much the same. The impact is that the histogram values are higher for the filled data set.

Are there differences in activity patterns between weekdays and weekends?

Understanding the difference in weekday and weekend patterns means we must convert our dates to representations of days of the week, and then classify them into weekend and weekday factors. A great walkthrough was on stackoverflow, converting weekdays to week and weekend factors.


In [66]:
new_activity$date <- as.Date(new_activity$date)
weekdays1 <- c('Monday',"Tuesday","Wednesday","Thursday","Friday")
new_activity$wDay <- c('weekend','weekday')[(weekdays(new_activity$date) %in% weekdays1)+ 1L]
head(new_activity)


Out[66]:
stepsdateintervalwDay
11.7169812012-10-010weekday
20.33962262012-10-015weekday
30.13207552012-10-0110weekday
40.15094342012-10-0115weekday
50.07547172012-10-0120weekday
62.094342012-10-0125weekday

Let's check the factors and build a data set with averages.


In [68]:
weekdays(new_activity$date[6])

# Building a new data set
filled_interval_activity <- new_activity%>%
    group_by(interval,wDay)%>%
    summarise(average.steps = mean(steps))


head(filled_interval_activity)


Out[68]:
'Monday'
Out[68]:
intervalwDayaverage.steps
10weekday2.251153
20weekend0.2146226
35weekday0.445283
45weekend0.04245283
510weekday0.1731656
610weekend0.01650943

Now to the final step, the plot comparing the average number of steps averaged across all weekends and weekdays in a time series plot of 5-minute intervales. We are just adding a facet_grid based on the newly created wDay column.


In [69]:
m <- ggplot(filled_interval_activity,aes(x=interval,y=average.steps, fill=wDay))

In [70]:
m + geom_line(aes(colour=wDay)) + facet_grid(.~wDay)+ scale_fill_brewer(palette = "Paired")+theme_economist()+labs(x="5 minute intervals",y="Average Steps",
             title="Weeked vs. Week: What is the busiest 5-minute interval?")