Subway Data Analysis

Introduction

The NYC public transportantion system - Metro Transit Authority - provides data for download via csv files. Part of the information available are data from the subway turnstiles, containing weekly logs for cumulative entries and exits by turnstile and by subway station during a provided timeframe.

For this project, we will only use the information available at: http://web.mta.info/developers/turnstile.html.

About this project

For this project, you will apply the knowledge acquired in the first month of this course. We will practice basic data acquisition and data cleaning tasks to find out fundamental stuff about the data using what we learned in the Statistics course.

The goal of this project is to explore the relationship between data from the NYC Subway turnstiles and the city weather. For this, besides data from the subway, we will also need data from the weather in NYC.

Here are the main points that will be considered in this work:

  • Gathering data from the Internet
  • Using Statistics for Data Analysis
  • Data handling and simple graphics creation with Pandas

How to find help: We suggest that you try the following channels, in the following order:

Type of Question\Channels Google Forum Slack Email
Pandas and Python Programming 1 2 3
Projects Requiriments 1 2 3
Projects Specific Parts 1 2 3

Here is the address for each of these channels:

The student is expected to submit this report including:

  • All TODO's completed, as they are crucial for the code to run accordingly
  • The ipynb file, exported as html

To submit this project, go to the classroom, and submit your zipped .ipynb and html.

Reminders

Before we start, there are a few things you must have in mind while using iPython notebooks:

  • Remember you can see, in the left side of a code cell, when was the last time it ran, if there is a number inside the keys.
  • When starting a new session in the notebook, please make sure to run all cells up to the point where you last left it. Even if the output can still be viewed from the moment you ran your cells in the previews session, the kernel starts in a new state, so you will need to reload all data, etc. in a new session.
  • The previous point is useful to have in mind if your answers do not match what is expected from the quizzes in the classroom. Try reloading the data and running all processing steps, one by one, to make sure you're working with the same variables and data from each step of the quizz.

Session 1 - Data Gathering

Exercise 1.1

Let's do it!! Now it's your turn to gather data. Please write bellow a Python code to access the link http://web.mta.info/developers/turnstile.html and download all files from June 2017. The file must be named turnstile_100617.txt, where 10/06/17 is the file's date.

Please see below a few commands that might help you:

Use the urllib library to open and redeem a webpage. Use the command below, where url is the webpage path to the following file:

u = urllib.urlopen(url)
html = u.read()

Use the BeautifulSoup library to search for the link to the file you want to donwload in the page. Use the command below to create your soup object and search for all 'a' tags in the document:

soup = BeautifulSoup(html, "html.parser")
links = soup.find_all('a')

A tip to only download the files from June is to check data in the name of the file. For instance, to donwload the 17/06/2017 file, please see if the link ends with "turnstile_170610.txt". If you forget to do this, you will download all files from that page. In order to do this, you can use the following command:

if '1706' in link.get('href'):

Our final tip is to use the command bellow to download the txt file:

urllib.urlretrieve(link_do_arquivo, filename)

Please remember - you first have to load all packages and functions that will be used in your analysys.


In [1]:
import urllib
from bs4 import BeautifulSoup

#your code here

Exercise 1.2

Write down a function that takes the list of all names of the files you downloaded in Exercise 1.1 and compile them into one single file. There must be only one header line in the output file.

For example, if file_1 has: line 1... line 2...

and the other file, file_2, has: line 3... line 4... line 5...

We must combine file_1 and file_2 into one master file, as follows:

'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn' line 1... line 2... line 3... line 4... line 5...


In [ ]:
def create_master_turnstile_file(filenames, output_file):
    with open(output_file, 'w') as master_file:
        master_file.write('C/A,UNIT,SCP,STATION, LINENAME, DIVISION, DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
        for filename in filenames:
            # your code here

Exercise 1.3

For this exercise, you will write a function that reads the master_file created in the previous exercise and load it into a Pandas Dataframe. This function can be filtered, so that the Dataframe only has lines where column "DESCn" has the value "Regular".

For example, if the Pandas Dataframe looks like this:

,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231

The Dataframe must look like the following, after filtering only the lines where column DESCn has the value REGULAR:

0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177

In [ ]:
import pandas

def filter_by_regular(filename):
    
    turnstile_data = # your code here
    # more of your code here
    return turnstile_data

Exercise 1.4

The NYC Subway data has cumulative entry and exit data in each line. Let's assume you have a Dataframe called df, which contains only lines for one particular turnstile (unique SCP, C/A, and UNIT). The following function must change these cumulative entries for counting all entries since the last reading (entries from the last line of the Dataframe).

More specifically, there are two things you should do:

1 - Create a new column, called ENTRIESn_hourly 2 - Insert in this column the difference between ENTRIESn in the current and the previous column. If the line has any NAN, fill it out/replace by 1.

Tip: The funtions shift() and fillna() in Pandas might be usefull for this exercise.

Below you will find and example of how your Dataframe should look by the end of this exercise:

    C/A  UNIT       SCP     DATEn     TIMEn    DESCn  ENTRIESn    EXITSn  ENTRIESn_hourly
0     A002  R051  02-00-00  05-01-11  00:00:00  REGULAR   3144312   1088151                1
1     A002  R051  02-00-00  05-01-11  04:00:00  REGULAR   3144335   1088159               23
2     A002  R051  02-00-00  05-01-11  08:00:00  REGULAR   3144353   1088177               18
3     A002  R051  02-00-00  05-01-11  12:00:00  REGULAR   3144424   1088231               71
4     A002  R051  02-00-00  05-01-11  16:00:00  REGULAR   3144594   1088275              170
5     A002  R051  02-00-00  05-01-11  20:00:00  REGULAR   3144808   1088317              214
6     A002  R051  02-00-00  05-02-11  00:00:00  REGULAR   3144895   1088328               87
7     A002  R051  02-00-00  05-02-11  04:00:00  REGULAR   3144905   1088331               10
8     A002  R051  02-00-00  05-02-11  08:00:00  REGULAR   3144941   1088420               36
9     A002  R051  02-00-00  05-02-11  12:00:00  REGULAR   3145094   1088753              153
10    A002  R051  02-00-00  05-02-11  16:00:00  REGULAR   3145337   1088823              243

In [ ]:
import pandas

def get_hourly_entries(df):
    
    
    #your code here
    return df

Exercise 1.5

Do the same thing you did in the previous exercise, but taking into account the exits, column EXITSn. For this, you need to create a column called EXITSn_hourly and insert the difference between the column EXITSn in the current line vs he previous line. If there is any NaN, fill it out/replace by 0.


In [ ]:
import pandas

def get_hourly_exits(df):
    
    #your code here
    return df

Exercise 1.6

Given an entry variable that represents time, in the format:      "00:00:00" (hour: minutes: seconds)      Write a function to extract the hour part from the time in the entry variable And return it as an integer. For example:          1) if hour is 00, your code must return 0          2) if hour is 01, your code must return 1          3) if hour is 21, your code must return 21          Please return te hour as an integer.


In [ ]:
def time_to_hour(time):
    
    hour = # your code here
    return hour

Exercise 2 - Data Analysis

Exercise 2.1

To understand the relationship between the Subway activity and the weather, please complete the data from the file already downloaded with the weather data. We provided you with the file containing NYC weather data and made it available with the Support Material. You can access it through the link: https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv

Now that we have our data in a csv file, write Python code that reads this file and saves it into a Pandas Dataframe.

Tip:

Use the command below to read the file:

pd.read_csv('output_list.txt', sep=",")

In [2]:
import pandas as pd

filename = "turnstile_data_master_with_weather.csv"


#your code here

Exercise 2.2

Now, create a function that calculates the number of rainy days. For this, return the count of the number of days where the column "rain" is equal to 1.

Tip: You might think that interpreting numbers as integers or floats might not      work at first. To handle this issue, it might be useful to convert      these numbers into integers. You can do this by writting cast (column as integer).      So, for example, if we want to launch the column maxtempi as an integer, we have to      write something like cast (maxtempi as integer) = 76, instead of just      where maxtempi = 76.


In [6]:
def num_rainy_days(df):

    


    #your code here
    return

Exercise 2.3

Calculate if the day was cloudy or not (0 or 1) and the maximum temperature for fog (i.e. the maximum temperature      for cloudy days).


In [8]:
def max_temp_aggregate_by_fog(df):
    
    #your code here 
    return

*Exercise 2.4

Now, calculate the mean for 'meantempi' for the days that are Saturdays or Sundays (weekend):


In [ ]:
def avg_weekend_temperature(filename):
  
    mean_temp_weekends = 
    
    return mean_temp_weekends

*Exercise 2.5

Calculate the mean of the minimum temperature 'mintempi' for the days when the minimum temperature was greater that 55 degrees:


In [ ]:
def avg_min_temperature(filename):

    avg_min_temp_rainy = 
    
    return avg_min_temp_rainy

*Exercise 2.6

Before you make any analysis, it might be useful to look at the data we want to analyse. More specifically, we will evaluate the entries by hour in our data from the NYC Subway to determine the data distribution. This data is stored in the column ['ENTRIESn_hourly'].      Draw two histogramns in the same axis, to show the entries when it's raining vs when it's not. Below, you will find an example of how to draw histogramns with Pandas and Matplotlib:     

Turnstile_weather ['column_to_graph']. Hist ()

In [ ]:
import numpy as np
import pandas
import matplotlib.pyplot as plt

def entries_histogram(turnstile_weather):
    
    
    
    plt.figure()
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is raining
    turnstile_weather['...'] # your code here to plot a histogram for hourly entries when it is not raining
    return plt

*Exercise 2.7

The data you just plotted is in what kind of distribution? Is there a difference in distribution between rainy and non-rainy days?

Answer : Replace this text with your answer!

*Exercise 2.8

Build a function that returns:

  1. The mean of entries when it's raining
  2. The mean of entries when it's not raining

In [ ]:
import numpy as np

import pandas

def means(turnstile_weather):
    
    
    ### YOUR CODE HERE ###
    
    return with_rain_mean, without_rain_mean, p # leave this line for the grader

Answer to the following questions according to your functions' exits:

  1. What is the mean of entries when it's raining?
  2. What is the mean of entries when it's not raining?

Answer : Replace this text with your answer!

Exercise 3 - Map Reduce

Exercise 3.1

The entry for this exercise is the same file from the previous session (Exercise 2). You can download the file from this link:

https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv

Now, we will create a mapper. For each entry line, the mapper exit must PRINT (not return) UNIT as a key, and the number of ENTRIESn_hourly as the value. Separate the key and the value with a tab. For example: 'R002 \ t105105.0'

Export your mapper into a file named mapper_result.txt and send it with your submission. The code for exporting your mapper is already written in the code bellow.


In [ ]:
import sys

def mapper():
    

    for line in sys.stdin:
        # your code here


mapper()
sys.stdin = open('turnstile_data_master_with_weather.csv')
sys.stdout = open('mapper_result.txt', 'w')

Exercise 3.2

Now, create the reducer. Given the mapper result from the previous exercise, the reducer must print (not return) one line per UNIT, with the total number of ENTRIESn_hourly during May (which is our data duration), separated by a tab. An example of exit line from the reducer may look like this: 'R001 \ t500625.0'

You can assume that the entry for the reducer is ordered in a way that all lines corresponding to a particular unit are grouped. However, the reducer exit will have repetition, as there are stores that appear in different files' locations.

Export your reducer into a file named reducer_result.txt and send it with your submission.


In [ ]:
def reducer():
    

    for line in sys.stdin:
        # your code here

        
reducer()