The NYC public transportantion system - Metro Transit Authority - provides data for download via csv files. Part of the information available are data from the subway turnstiles, containing weekly logs for cumulative entries and exits by turnstile and by subway station during a provided timeframe.
For this project, we will only use the information available at: http://web.mta.info/developers/turnstile.html.
For this project, you will apply the knowledge acquired in the first month of this course. We will practice basic data acquisition and data cleaning tasks to find out fundamental stuff about the data using what we learned in the Statistics course.
The goal of this project is to explore the relationship between data from the NYC Subway turnstiles and the city weather. For this, besides data from the subway, we will also need data from the weather in NYC.
Here are the main points that will be considered in this work:
PandasHow to find help: We suggest that you try the following channels, in the following order:
| Type of Question\Channels | Forum | Slack | ||
|---|---|---|---|---|
| Pandas and Python Programming | 1 | 2 | 3 | |
| Projects Requiriments | 1 | 2 | 3 | |
| Projects Specific Parts | 1 | 2 | 3 |
Here is the address for each of these channels:
The student is expected to submit this report including:
To submit this project, go to the classroom, and submit your zipped .ipynb and html.
Before we start, there are a few things you must have in mind while using iPython notebooks:
Let's do it!! Now it's your turn to gather data. Please write bellow a Python code to access the link http://web.mta.info/developers/turnstile.html and download all files from June 2017. The file must be named turnstile_100617.txt, where 10/06/17 is the file's date.
Please see below a few commands that might help you:
Use the urllib library to open and redeem a webpage. Use the command below, where url is the webpage path to the following file:
u = urllib.urlopen(url)
html = u.read()
Use the BeautifulSoup library to search for the link to the file you want to donwload in the page. Use the command below to create your soup object and search for all 'a' tags in the document:
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all('a')
A tip to only download the files from June is to check data in the name of the file. For instance, to donwload the 17/06/2017 file, please see if the link ends with "turnstile_170610.txt". If you forget to do this, you will download all files from that page. In order to do this, you can use the following command:
if '1706' in link.get('href'):
Our final tip is to use the command bellow to download the txt file:
urllib.urlretrieve(link_do_arquivo, filename)
Please remember - you first have to load all packages and functions that will be used in your analysys.
In [1]:
import urllib
from bs4 import BeautifulSoup
#your code here
Write down a function that takes the list of all names of the files you downloaded in Exercise 1.1 and compile them into one single file. There must be only one header line in the output file.
For example, if file_1 has: line 1... line 2...
and the other file, file_2, has: line 3... line 4... line 5...
We must combine file_1 and file_2 into one master file, as follows:
'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn' line 1... line 2... line 3... line 4... line 5...
In [ ]:
def create_master_turnstile_file(filenames, output_file):
with open(output_file, 'w') as master_file:
master_file.write('C/A,UNIT,SCP,STATION, LINENAME, DIVISION, DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
for filename in filenames:
# your code here
For this exercise, you will write a function that reads the master_file created in the previous exercise and load it into a Pandas Dataframe. This function can be filtered, so that the Dataframe only has lines where column "DESCn" has the value "Regular".
For example, if the Pandas Dataframe looks like this:
,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231
The Dataframe must look like the following, after filtering only the lines where column DESCn has the value REGULAR:
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
In [ ]:
import pandas
def filter_by_regular(filename):
turnstile_data = # your code here
# more of your code here
return turnstile_data
The NYC Subway data has cumulative entry and exit data in each line. Let's assume you have a Dataframe called df, which contains only lines for one particular turnstile (unique SCP, C/A, and UNIT). The following function must change these cumulative entries for counting all entries since the last reading (entries from the last line of the Dataframe).
More specifically, there are two things you should do:
1 - Create a new column, called ENTRIESn_hourly 2 - Insert in this column the difference between ENTRIESn in the current and the previous column. If the line has any NAN, fill it out/replace by 1.
Tip: The funtions shift() and fillna() in Pandas might be usefull for this exercise.
Below you will find and example of how your Dataframe should look by the end of this exercise:
C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn ENTRIESn_hourly
0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR 3144312 1088151 1
1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR 3144335 1088159 23
2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR 3144353 1088177 18
3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR 3144424 1088231 71
4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR 3144594 1088275 170
5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR 3144808 1088317 214
6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR 3144895 1088328 87
7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR 3144905 1088331 10
8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR 3144941 1088420 36
9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR 3145094 1088753 153
10 A002 R051 02-00-00 05-02-11 16:00:00 REGULAR 3145337 1088823 243
In [ ]:
import pandas
def get_hourly_entries(df):
#your code here
return df
Do the same thing you did in the previous exercise, but taking into account the exits, column EXITSn. For this, you need to create a column called EXITSn_hourly and insert the difference between the column EXITSn in the current line vs he previous line. If there is any NaN, fill it out/replace by 0.
In [ ]:
import pandas
def get_hourly_exits(df):
#your code here
return df
Given an entry variable that represents time, in the format: "00:00:00" (hour: minutes: seconds) Write a function to extract the hour part from the time in the entry variable And return it as an integer. For example: 1) if hour is 00, your code must return 0 2) if hour is 01, your code must return 1 3) if hour is 21, your code must return 21 Please return te hour as an integer.
In [ ]:
def time_to_hour(time):
hour = # your code here
return hour
To understand the relationship between the Subway activity and the weather, please complete the data from the file already downloaded with the weather data. We provided you with the file containing NYC weather data and made it available with the Support Material. You can access it through the link: https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
Now that we have our data in a csv file, write Python code that reads this file and saves it into a Pandas Dataframe.
Tip:
Use the command below to read the file:
pd.read_csv('output_list.txt', sep=",")
In [2]:
import pandas as pd
filename = "turnstile_data_master_with_weather.csv"
#your code here
Now, create a function that calculates the number of rainy days. For this, return the count of the number of days where the column "rain" is equal to 1.
Tip: You might think that interpreting numbers as integers or floats might not work at first. To handle this issue, it might be useful to convert these numbers into integers. You can do this by writting cast (column as integer). So, for example, if we want to launch the column maxtempi as an integer, we have to write something like cast (maxtempi as integer) = 76, instead of just where maxtempi = 76.
In [6]:
def num_rainy_days(df):
#your code here
return
In [8]:
def max_temp_aggregate_by_fog(df):
#your code here
return
In [ ]:
def avg_weekend_temperature(filename):
mean_temp_weekends =
return mean_temp_weekends
In [ ]:
def avg_min_temperature(filename):
avg_min_temp_rainy =
return avg_min_temp_rainy
Before you make any analysis, it might be useful to look at the data we want to analyse. More specifically, we will evaluate the entries by hour in our data from the NYC Subway to determine the data distribution. This data is stored in the column ['ENTRIESn_hourly']. Draw two histogramns in the same axis, to show the entries when it's raining vs when it's not. Below, you will find an example of how to draw histogramns with Pandas and Matplotlib:
Turnstile_weather ['column_to_graph']. Hist ()
In [ ]:
import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
plt.figure()
turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is raining
turnstile_weather['...'] # your code here to plot a histogram for hourly entries when it is not raining
return plt
Answer : Replace this text with your answer!
In [ ]:
import numpy as np
import pandas
def means(turnstile_weather):
### YOUR CODE HERE ###
return with_rain_mean, without_rain_mean, p # leave this line for the grader
Answer to the following questions according to your functions' exits:
Answer : Replace this text with your answer!
The entry for this exercise is the same file from the previous session (Exercise 2). You can download the file from this link:
Now, we will create a mapper. For each entry line, the mapper exit must PRINT (not return) UNIT as a key, and the number of ENTRIESn_hourly as the value. Separate the key and the value with a tab. For example: 'R002 \ t105105.0'
Export your mapper into a file named mapper_result.txt and send it with your submission. The code for exporting your mapper is already written in the code bellow.
In [ ]:
import sys
def mapper():
for line in sys.stdin:
# your code here
mapper()
sys.stdin = open('turnstile_data_master_with_weather.csv')
sys.stdout = open('mapper_result.txt', 'w')
Now, create the reducer. Given the mapper result from the previous exercise, the reducer must print (not return) one line per UNIT, with the total number of ENTRIESn_hourly during May (which is our data duration), separated by a tab. An example of exit line from the reducer may look like this: 'R001 \ t500625.0'
You can assume that the entry for the reducer is ordered in a way that all lines corresponding to a particular unit are grouped. However, the reducer exit will have repetition, as there are stores that appear in different files' locations.
Export your reducer into a file named reducer_result.txt and send it with your submission.
In [ ]:
def reducer():
for line in sys.stdin:
# your code here
reducer()