Problem Set 4: Visualizing Subway Data

Quiz 1: Visualization 1

If you are performing manipulations to the dataframe but are getting a SettingWithCopyWarning, try adding the following line to your code to suppress the warning:

pandas.options.mode.chained_assignment = None

or apply the following line to your dataframe (substituting df for your dataframe variable's name):

df.is_copy = False

Due to a versions conflict with the latest pandas and ggplot packages, it is not currently possible to successfully make geom_bar or geom_histogram plots. Please try another kind of plot, or work on a local installation. If you attempt to make these plots, you will see the error message:

TypeError: pivot_table() got an unexpected keyword argument 'rows'

Warning: this grader will only accept ggplot plots!


In [92]:
% ls


1-uIDS-courseNotes/ LICENSE             data/               material/
1-uIDS-quiz/        README.md           image/

In [166]:
import pandas as pd
from pandas import *
import ggplot as gg
from ggplot import *

def plot_weather_data(turnstile_weather):
    '''
    You are passed in a dataframe called turnstile_weather. 
    Use turnstile_weather along with ggplot to make a data visualization
    focused on the MTA and weather data we used in assignment #3.  
    You should feel free to implement something that we discussed in class 
    (e.g., scatterplots, line plots, or histograms) or attempt to implement
    something more advanced if you'd like.  

    Here are some suggestions for things to investigate and illustrate:
     * Ridership by time of day or day of week
     * How ridership varies based on Subway station (UNIT)
     * Which stations have more exits or entries at different times of day
       (You can use UNIT as a proxy for subway station.)

    If you'd like to learn more about ggplot and its capabilities, take
    a look at the documentation at:
    https://pypi.python.org/pypi/ggplot/
     
    You can check out:
    https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
     
    To see all the columns and data points included in the turnstile_weather 
    dataframe. 
     
    However, due to the limitation of our Amazon EC2 server, we are giving you a random
    subset, about 1/3 of the actual data in the turnstile_weather dataframe.
    '''

    # to supress the warning
    pd.options.mode.chained_assignment = None
    
    # 1. Create desired data
    
    # Create an anonymous function 'day_of_the_week' to return the day of week as the '%w' format (0 for Sunday, ... 6 for Saturday).
    # More detail: https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior
    day_of_week = lambda x: datetime.strptime(x, '%Y-%M-%d').strftime('%w')
    
    # Apply our above method to convert day of week to desired format from 'DATEn'
    turnstile_weather['day'] = turnstile_weather['DATEn'].apply(day_of_week)

    # Reform turnstile_weather dataframe to new form, group them by day, summarize all values.
    turnstile_weather = turnstile_weather[['day','ENTRIESn_hourly']].groupby('day', as_index=False).sum()
    
    # Rename all the elements in day column to string format like: 'Sunday' ...
    turnstile_weather['day'] = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
#     print turnstile_weather
    
    # Produce plot
    plot = gg.ggplot(turnstile_weather, gg.aes(x='day', y='ENTRIESn_hourly')) +  \
            geom_bar( stat="identity", color='orange', fill='orange') + \
            xlab('Day in week') + ylab('Total of entries') + ggtitle('Ridership by day of the week') 
    
    return plot

turnstile_weather = pd.read_csv('data/turnstile_data_master_with_weather.csv')
# %matplotlib inline
plot_weather_data(turnstile_weather)


Out[166]:
<ggplot: (301323637)>

Quiz: 2 - Make Another Visualization

If you are performing manipulations to the dataframe but are getting a SettingWithCopyWarning, try adding the following line to your code to suppress the warning:

pandas.options.mode.chained_assignment = None

or apply the following line to your dataframe (substituting df for your dataframe variable's name):

df.is_copy = False

Due to a versions conflict with the latest pandas and ggplot packages, it is not currently possible to successfully make geom_bar or geom_histogram plots. Please try another kind of plot, or work on a local installation. If you attempt to make these plots, you will see the error message:

TypeError: pivot_table() got an unexpected keyword argument 'rows'

Warning: this grader will only accept ggplot plots!


In [159]:
from pandas import *
from ggplot import *

def plot_weather_data(turnstile_weather):
    ''' 
    plot_weather_data is passed a dataframe called turnstile_weather. 
    Use turnstile_weather along with ggplot to make another data visualization
    focused on the MTA and weather data we used in Project 3.
    
    Make a type of visualization different than what you did in the previous exercise.
    Try to use the data in a different way (e.g., if you made a lineplot concerning 
    ridership and time of day in exercise #1, maybe look at weather and try to make a 
    histogram in this exercise). Or try to use multiple encodings in your graph if 
    you didn't in the previous exercise.
    
    You should feel free to implement something that we discussed in class 
    (e.g., scatterplots, line plots, or histograms) or attempt to implement
    something more advanced if you'd like.

    Here are some suggestions for things to investigate and illustrate:
     * Ridership by time-of-day or day-of-week
     * How ridership varies by subway station (UNIT)
     * Which stations have more exits or entries at different times of day
       (You can use UNIT as a proxy for subway station.)

    If you'd like to learn more about ggplot and its capabilities, take
    a look at the documentation at:
    https://pypi.python.org/pypi/ggplot/
     
    You can check out the link 
    https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
    to see all the columns and data points included in the turnstile_weather 
    dataframe.
     
   However, due to the limitation of our Amazon EC2 server, we are giving you a random
    subset, about 1/3 of the actual data in the turnstile_weather dataframe.
    '''

    # to supress the warning
    pd.options.mode.chained_assignment = None
    
    # 1. Create desired data
    
    # Create an anonymous function 'day_of_the_week' to return the day of week as the '%w' format (0 for Sunday, ... 6 for Saturday).
    # More detail: https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior
    
    # Reform turnstile_weather dataframe to new form, group them by day, summarize all values.
    
    turnstile_weather = turnstile_weather[['Hour','ENTRIESn_hourly']].groupby('Hour', as_index=False).sum()

    
    # Produce plot
    
    plot = gg.ggplot(turnstile_weather, gg.aes(x='Hour', y='ENTRIESn_hourly')) +  scale_y_continuous(limits=(0.5))+\
            geom_line( stat="identity", color='orange', fill='orange') + \
            xlab('Hour in day') + ylab('Total of entries') + ggtitle('Ridership by hour in a day') 
    
    return plot

turnstile_weather = pd.read_csv('data/turnstile_data_master_with_weather.csv')
# 1. Data overview
# print(turnstile_weather.describe())
# print(turnstile_weather.head())

# %matplotlib inline
plot_weather_data(turnstile_weather)


Out[159]:
<ggplot: (295165845)>