LIS590DV Final Project: Task One.

Group: Whale.

Yingjun Guan, Xiaoliang Jiang, Xinyu Zhang, Jialu Wang. The first task is based on the Champaign-Urbana Metro Transit District (CUMTD). From the data source (http://developer.cumtd.com/), the data and the corresponding documentation can be found. The data involves the information of agency (agency.txt), running schedule (calendar.txt), running exception schedule (calendar_dates.txt), stops (stops.txt), stop time(stop_time.txt), routes of all trafic (routes.txt), shapes of the routes - timely records rather than the stops(shapes.txt), daily time schedule (trips.txt), and the fare information (fare_rules.txt and fare_attributes.txt)

We did two parts for task one: categorical and geographical. Based on the basic evaluation of each table, we created charts to represent traffic running days, stops per trip, amount of trips per stop, as well as stops per location. We also created a chart of physical path for every vehicle takes, and made heat map to analyze stop density, then tested accessibility and walkability with different sized buffer areas. Those charts help us answered twelve questions.

Strength:

1. The first pie chart effectively represented four categories of bus running days. The 105 days included bus running five days per week. Most of them run through Monday to Friday, and the others run through Sunday to Thursday. This information could help students to make schedules.

2. To represent maximum range of ‘distribution of trips per stop’, we added a bar chart of 20 stops with most trips traveling by. This chart described the stops with more trips traveling by.

3. The ‘physical path for every vehicle’ represent each route, which is different from stop plot.

4. We did walkable areas to describe walkability of each stops. There are different walkable buffer areas from one minute to ten minutes for passengers’ different demands. The purpose of making those walkable area is testing accessibility of CUMTD bus system, checking how much area it can cover, and identify any blind spots. Also, since we added alpha value for those buffer area, these graphs can also represent density feature.

5. Moreover, we added physical path with walkable area to help user to identify each route.

6. We made heat maps to represent stop density, which also verified the correctness of ‘walkable area & MTD routes’.

7. Thanks to Xiaoliang, he added map on stop distribution graph, which allowed us to see stop names and coordinates by hovering, and road names by zooming.

8. We counted the velocity of trip, which is also able to represent how bus speed vary based on time of day, region, route.

Weakness:

1. We divided bus running schedules into four categories, and tried many color schemes for them. However, we still believe there is another way to represent this information better.

2. For ‘stops per location’, ‘stops per trips’, and ‘distribution of trips per stop’, each of these graph has too many elements on its x-axis, thus it is difficult to represent each name on x-axis. We have to use number to express x-axis.

3. In ‘stops per trip’, there are many trips on each route. It will be better if we can classify trips into specific routes (trip in same color).

4. In ‘distribution of trips per stop’, we can use log to present y-axis for better view.

5. There will be a better way to represent y-axis of ‘20 stops with most trips traveling by’.

6. The ‘physical path for every vehicle’ could be better if we can add it with map.

7. The walkable area graph could be better if we can combine them with maps.

Wish to do:

1. For ‘20 stops with most trips traveling by’, we think it will be better if we can represent them on a map.

2. The ‘physical path for every vehicle’ could be better if we can add it with map.

3. Because buses run different on weekday/weekend, daytime/nighttime, we want to add a drop down menu for ‘physical path for every vehicle’ to represent buses on different time.

4. We want to let the ploting area of heat map be a perfact square.

5. We want to add map on walkable area graphs.

6. Put daytime trip speed and nighttime trip speed on seperate graphs, and compare them.

1. What is CUMTD running schedule?

By playing with the data, the first thing that can be checked is the running schedule. All the traffic runs from Dec. 18, 2016 to May 13 2017, 147 days in all. For all 288 different trips, 43 of them run for 105 days, 23 of them fun for 89 days, 12 of them run for 80 days, 2 of them run for 77 days, 17 for 70 days, 12 for 64 days, 1 for 61 days, 15 for 20-50 days, 124 for 10-20 days, 40 for less than 10 days.


In [1]:
#1. Yingjun Guan
from pylab import *

# make a square figure and axes
figure(1, figsize=(6,6))
ax = axes([0.1, 0.1, 0.8, 0.8])

# The slices will be ordered and plotted counter-clockwise.
labels = '105 days', '51-100 days', '10-50days', '<10 days'
fracs = [43/289, 67/289, 139/289, 40/289]
explode=(0.05, 0, 0, 0)

pie(fracs, explode=explode, labels=labels,
                autopct='%1.1f%%', shadow=True, startangle=90)
                # The default startangle is 0, which would start
                # the Frogs slice on the x-axis.  With startangle=90,
                # everything is rotated counter-clockwise by 90 degrees,
                # so the plotting starts on the positive y-axis.

title('Pie chart for traffic running days (out of 147)', bbox={'facecolor':'0.8', 'pad':5})

show()


From the figure, it can be seen that 14.9% of the buses run 105 days out of 147 days, which is every weekday (Mon to Fri of each week), and 13.8% of buses run less than 10 days, which is only on special days or special uses.


In [2]:
# 2. Jialu Wang
#enable plotting
%matplotlib inline

#import packages
import matplotlib.pyplot as plt
import numpy as np
import csv
import collections
from collections import Counter

#set graph size
plt.rcParams["figure.figsize"] = (20,10)

In [6]:
#read the file
fn = "stop_times.txt"
with open(fn, "r") as f:
    reader = csv.reader(f)
    header = next(reader)
    data = {}
    for column in header:
        data[column] = []
    for row in reader:
        for column, value in zip(header, row):
            data[column].append(value)
            
fn1 = "stops.txt"
with open(fn1, "r") as f1:
    reader1 = csv.reader(f1)
    header1 = next(reader1)
    data1 = {}
    for column1 in header1:
        data1[column1] = []
    for row1 in reader1:
        for column1, value1 in zip(header1, row1):
            data1[column1].append(value1)
            
fn2 = "routes - routes.csv.csv"
with open(fn2, "r") as f2:
    reader2 = csv.reader(f2)
    header2 = next(reader2)
    data2 = {}
    for column2 in header2:
        data2[column2] = []
    for row2 in reader2:
        for column2, value2 in zip(header2, row2):
            data2[column2].append(value2)

i. agency.txt File

An Agency is an operator of a public transit network, often a public authority.

There is only one instance in the file specifying the full name, URLs, phone numbers, and language indicators of our studying object CUMTD.

ii. stops.txt File

A stop is a location where vehicles stop to pick up or drop off passengers. Stops are defined in the file stops.txt.

Stops can be grouped together, such as when there are multiple stops within a single station. There are 2,496 different stops in 1,353 different stations, as in Figure.1. There are 448 stations with single stops, 746 stations with two stops, 97 stations with 3 stops, 46 stations with 4 stops, 15 stations with 5 stops and one station with 6 stops. The mean of number of stops per station is 1.84 while the median is 2.

Stations with the most stops are: ('MTD7311', 6), ('MTD2554', 5), ('MTD1333', 5), ('MTD2353', 5), ('MTD7267', 5), ('MTD5671', 5), ('MTD2643', 5), ('MTD3014', 5), ('MTD3451', 5), ('MTD7036', 5), ('MTD3250', 5), ('MTD3254', 5), ('MTD3747', 5), ('MTD6052', 5), ('MTD3562', 5), ('MTD4573', 5), station and number of stops respectively.

2. How many stops on each station?


In [7]:
location_count={}
for code in data1['stop_code']:
    if code not in location_count.keys():
        location_count[code]=1
    else:
        location_count[code]+=1
        
e = Counter(location_count)
sorted_e=sorted(e.items(), key=lambda e: e[1])
x_val = [x[0] for x in sorted_e]
y_val = [x[1] for x in sorted_e]
x_pos = np.arange(len(x_val)) 
plt.bar(x_pos,y_val,align='center', width=0.5, color='c')
plt.ylabel('Amount of stops',fontsize=15)
plt.title('Stops per location',fontsize=15)
plt.show()


Stops may also have zone identifiers, to group them together into zones. This can be used together with Fare Attributes and Fare Rules for zone-based ticketing. However, as there are only two fare categories, we skip this step.

iii. stop_times.txt File

A StopTime defines when a vehicle arrives at a location, how long it stays there, and when it departs. StopTimes define the path and schedule of Trips. Each trip stops has the unique attribute trip_id referenced from the trips.txt. There are 242,858 different trip_id. The attribute arrival_time and departure_time specifies the arrival and departure time from a specific stop for a specific trip on a route. However, the arrival and departure time are the same for most trip_ids except 642 instance (0.26%).

a. With the 5,498 different trips, we study number of stops per trip as in Figure.2. The distribution is exponential,with the minimum of 2 (214 counts), the maximum of 147 (2 counts), the mean of 44.17 and the median of 30.

3. How many stops on each trip?


In [8]:
#amount of stops per trip
trip_count={}
for id in data['trip_id']:
    if id not in trip_count.keys():
        trip_count[id]=1
    else:
        trip_count[id]+=1

a = Counter(trip_count)
sorted_a=sorted(a.items(), key=lambda a: a[1])
x_val = np.arange(len(sorted_a))
y_val = [x[1] for x in sorted_a]
plt.bar(x_val,y_val,align='center', width=0.6, color='r')
plt.ylabel('Amount of stops',fontsize=15)
plt.xlabel('Trips', fontsize=15)
plt.title('Distribution of amount of stops per trip',fontsize=15)
plt.show()



In [5]:
len(sorted_a)


Out[5]:
5498

b. We also study how many trip stops in total in a single day in a single stop (Figure.3). With the 2,496 different stops (as in stops.txt), the minimum #trip stops in a single stop per day is 1 with 55 counts while the maximum is as high as 1732 with only 1 count, the average is 97.30 times and the median is 42 times.

3. How many trips on each stop?


In [9]:
#amount of stops per stop
stop_count={}
for id in data['stop_id']:
    if id not in stop_count.keys():
        stop_count[id]=1
    else:
        stop_count[id]+=1
        
b = Counter(stop_count)
sorted_b=sorted(b.items(), key=lambda b: b[1])
x_val = np.arange(len(sorted_b))
y_val = [x[1] for x in sorted_b]
plt.bar(x_val,y_val,align='center', width=0.6, color='g')
plt.ylabel('Amount of trips',fontsize=15)
plt.xlabel('Stops', fontsize=15)
plt.title('Distribution of amount of trips per stop',fontsize=15)
plt.show()



In [7]:
len(sorted_b)


Out[7]:
2496

For better understanding of the statistics, we highlight the 20 stops with the most trip stops everyday. They are: ('ODSS:1', 877), ('GWNNV:4', 908), ('CHEMLS:1', 910), ('WLNTLGN:1', 911), ('WLNTUNI:2', 913), ('LNCLNKLRNY:1', 954), ('GDWNCB:1', 977), ('GDWNMRL:2', 985), ('PAMD:2', 1091), ('PLAZA:3', 1174), ('WRTCHAL:4', 1247), ('LSE:8', 1286), ('PLAZA:4', 1292), ('IT:5', 1317), ('ARYWRT:3', 1451), ('GRNMAT:1', 1617), ('IU:2', 1629), ('GRNMAT:3', 1660), ('IU:1', 1674), ('PAR:2', 1732), stop_id and counts of trip stops respectively (Figure.4). These 20 stops are also highlighted in the map of stops.

4. What are the top 20 stops with most trip stops?


In [106]:
c=Counter(stop_count).most_common(20)
c.sort(key=lambda x: x[1]) 
x_val = list(zip(*c))[0]
y_val = list(zip(*c))[1]
x_pos = np.arange(len(x_val)) 
plt.bar(x_pos, y_val,align='center',width=0.6, color='#9BD3F0')
plt.xticks(x_pos, x_val,fontsize=8) 
plt.ylabel('Amount of trips',fontsize=15)
plt.title('20 stops with most trips traveling by',fontsize=15)
plt.show()


iv. routes.txt File

GTFS Routes are equivalent to "Lines" in public transportation systems. Routes are defined in the file routes.txt, and are made up of one or more Trips. The difference between routes and trips is that a Trip occurs at a specific time while a Route is time-independent.

5. How many buses on each route?


In [11]:
import plotly
import plotly.plotly as py
from plotly.graph_objs import *
import pandas as pd
import math
from IPython.display import Image
import time

plotly.tools.set_credentials_file(username='xjiang36', api_key='uZyWsdSH3xd9bxUefIFf')

In [14]:
dfroutes = pd.read_csv("routes.txt",encoding='iso-8859-1')
dftrips = pd.read_csv("trips.txt",encoding='iso-8859-1')
routeclean=dftrips["route_id"].value_counts().reset_index().rename(columns={'index': 'x'})
def Nameclean(dataset,a):
    wordlist=["SILVER","ILLINI","TEAL","YELLOW","GREEN","BROWN","GREY","GOLD","LIME","BLUE","RED","BROWN","BRONZE","ORANGE","LAVENDER","RUBY"]
    for j in range(len(wordlist)):
        for i in range(len(dataset)):
            if dataset[a][i].find(wordlist[j])>=0:
                dataset[a][i]=wordlist[j]
Nameclean(routeclean,"x")
sumroute=routeclean[:18]
cleanedroute=routeclean["x"].value_counts().reset_index().rename(columns={'index': 'name'})
for j in range(len(cleanedroute["name"])):
    rsum=0
    for i in range(len(routeclean)):
        if routeclean["x"][i]==cleanedroute["name"][j]:
            rsum+=routeclean["route_id"][i]
    cleanedroute["x"][j]=rsum
colorbar0=[]
Nameclean(dfroutes,"route_id")
for i in range(len(cleanedroute['name'])):
    for j in range(len(dfroutes['route_id'])):
        if cleanedroute['name'][i]==dfroutes['route_id'][j]:
            colorbar0.append("#%s"%dfroutes['route_color'][j])
            break
            
import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Bar(
    x=cleanedroute["name"],
    y=cleanedroute["x"],
    marker=dict(
        #color=['#66FF66','#FFFF66','#E0E0E0','','#666600','#A0A0A0','#FF6666','#B266FF','#CCCC00','#663300','#FFFF99','#FF9933','#FF0000','#66FFFF','#0000FF','#FF66B2','#000066','#330000']),
        color=colorbar0),
)

data = [trace0]
layout = go.Layout(
    title='Buses on each route',
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='color-bar')
#py.image.save_as(fig,'Whale-plot.png')


/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:18: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[14]:

6. How many routes on each color?


In [15]:
dfroutes = pd.read_csv("routes.txt",encoding='iso-8859-1')
dftrips = pd.read_csv("trips.txt",encoding='iso-8859-1')
routeclean=dftrips["route_id"].value_counts().reset_index().rename(columns={'index': 'x'})
def Nameclean(dataset,a):
    wordlist=["SILVER","ILLINI","TEAL","YELLOW","GREEN","BROWN","GREY","GOLD","LIME","BLUE","RED","BROWN","BRONZE","ORANGE","LAVENDER","RUBY"]
    for j in range(len(wordlist)):
        for i in range(len(dataset)):
            if dataset[a][i].find(wordlist[j])>=0:
                dataset[a][i]=wordlist[j]
Nameclean(routeclean,"x")
sumroute=routeclean[:18]
cleanedroute=routeclean["x"].value_counts().reset_index().rename(columns={'index': 'name'})
trace0 = go.Bar(
    x=cleanedroute["name"],
    y=cleanedroute["x"],
    marker=dict(
        #color=['#66FF66','#FFFF66','#E0E0E0','','#666600','#A0A0A0','#FF6666','#B266FF','#CCCC00','#663300','#FFFF99','#FF9933','#FF0000','#66FFFF','#0000FF','#FF66B2','#000066','#330000']),
        color=colorbar0),
)

data = [trace0]
layout = go.Layout(
    title='Distribution of each route by color',
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='color-bar')
#py.image.save_as(fig,'Whale-plot2.png')


/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[15]:

Here we find that there are different route_ids within a certain type of route_color. For instance, the route TEAL has route_id: TEAK, TEAL LATE NIGHT SUNDAY, TEAL SATURDAY, TEAL LATE NIGHT SATURDAY, TEAL LATE NIGHT, TEAL EVENING SATURDAY, TEAL EVENING, TEAL SUNDAY. They should be basically same routes while TEAL runs daily routes and other run for special dates. We here study how many different routes each color has (Figure.5).There are 100 routes in 17 colors, the minimum color ‘Navy’ has only one route while the maximum ‘Green’ has 18 routes, the mean number of routes per color is 5.88 and the median is 4.


In [11]:
len(x_val)


Out[11]:
1353

7. How the physical path for every vehicle looks like?

Data are latitudes and longitudes of points from shapes.txt, grouped by shape_id. Paths covered Champaign and Urbana, and denser in campus district.


In [68]:
# 3. Xinyu Zhang
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import os
import csv

In [69]:
df = pd.read_csv("/Users/celine/Desktop/5DataVisual/google_transit/shapes.txt")
df2=df.groupby('shape_id')

In [76]:
# from matplotlib import cm
plt.rcParams["figure.figsize"] = (20, 20)
mycolor=plt.cm.jet
color_id=np.linspace(0,1,677)
s=0
for name, group in df2:
    s=s+1 
#     print(name)
    #group.plot('shape_pt_lat','shape_pt_lon')
    plt.plot(group['shape_pt_lon'],group['shape_pt_lat'], color=plt.cm.jet(s/677), alpha = 0.1, linewidth = 1.5)
plt.show()



In [77]:
# 4. Xiaoliang Jiang
fn = "stops.txt"
with open(fn, "r") as f:
    reader = csv.reader(f)
    header = next(reader)
    data = {}
    for column in header:
        data[column] = []
    for row in reader:
        for column, value in zip(header, row):
            data[column].append(value)

In [78]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
    def columns(self):
        return self.data.keys()
    
    def filter_eq(self, column, value):
        good = (self.data[column] == value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_lt(self, column, value):
        good = (self.data[column] < value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_gt(self, column, value):
        good = (self.data[column] > value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_ne(self, column, value):
        good = (self.data[column] != value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def size(self):
        for key in self.data:
            return self.data[key].size

    def split(self, column):
        new_datasets = {}
        for split_value in np.unique(self.data[column]):
            new_datasets[split_value] = self.filter_eq(column, split_value)
        return new_datasets

    def stats(self):
        statistics = {}
        for key in self.data:
            if self.data[key].dtype not in ("float", "int"):
                continue
            values = self.data[key]
            statistics[key] = (values.min(), values.max(), values.std(), values.mean())
        return statistics
    
    def compare(self, other):
        stats1 = self.stats()
        stats2 = other.stats()
        for column in self.columns():
            if column not in stats1: continue
            print("Column '{0:25s}'".format(column))
            for s1, s2 in zip(stats1[column], stats2[column]):
                print("    {0} vs {1}".format(s1, s2))
    
    def plot(self, x_column, y_column):
        plt.plot(self.data[x_column], self.data[y_column], '.')

In [79]:
header


Out[79]:
['stop_id',
 'stop_code',
 'stop_name',
 'stop_desc',
 'stop_lat',
 'stop_lon',
 'zone_id',
 'stop_url',
 'location_type',
 'parent_station']

8. What is the walkability for CUMTD bus stop?


In [80]:
stopsdata= Dataset(data)
value_types = {'stop_ids': 'str',
               'stop_code': 'str',
               'stop_name':'str',
               'stop_desc':'str',
               'stop_lat':'float',
               'stop_lon':'float',
               'zone_id':'float',
               'stop_url':'str',
               'location_type':'str',
               'parent_station':'str'}
for v in stopsdata.columns():
    stopsdata.convert(v, value_types.get(v, "str"))

In [81]:
plt.subplot(221)
plt.rcParams["figure.figsize"] = (20, 20)
plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas in 1 minutes for each stop", fontsize=15)
plt.plot(data["stop_lon"],data["stop_lat"],c='#00ff80',marker='o',markersize=7,mec='none',ls='',alpha=0.05)
plt.subplot(222)
plt.rcParams["figure.figsize"] = (20, 20)
plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas in 2 minutes for each stop", fontsize=15)
plt.plot(data["stop_lon"],data["stop_lat"],c='#80ff00',marker='o',markersize=15,mec='none',ls='',alpha=0.05)
plt.subplot(223)
plt.rcParams["figure.figsize"] = (20, 20)
plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas in 5 minutes for each stop", fontsize=15)
plt.plot(data["stop_lon"],data["stop_lat"],c='#ffff00',marker='o',markersize=32,mec='none',ls='',alpha=0.05)
plt.subplot(224)
plt.rcParams["figure.figsize"] = (20, 20)
plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas in 10 minutes for each stop", fontsize=15)
plt.plot(data["stop_lon"],data["stop_lat"],c='#ff0000',marker='o',markersize=65,mec='none',ls='',alpha=0.05)


Out[81]:
[<matplotlib.lines.Line2D at 0x10e6ff2e8>]

explaination

Accroding to the google map, 0.02 longitude at 40.06N latitude equals 1.1 miles. So the circle here represent a circle area with r=0.275mile which means an area which only take 5 minutes to walk.


In [83]:
stats=stopsdata.stats()
plt.rcParams["figure.figsize"] = (20, 15)
stats=stopsdata.stats()
lon_min=stats["stop_lon"][0]
lon_max=stats["stop_lon"][1]
lat_min=stats["stop_lat"][0]
lat_max=stats["stop_lat"][1]
num_bins=16
lon=np.mgrid[lon_min:lon_max:(num_bins+1)*1j]
lat=np.mgrid[lat_min:lat_max:(num_bins+1)*1j]
tree_count=np.zeros((num_bins,num_bins))
for i in range(num_bins):
    left_lat=lat[i]
    right_lat=lat[i+1]
    filter_lat_left=stopsdata.filter_gt("stop_lat",left_lat)
    filter_lat_right=filter_lat_left.filter_lt("stop_lat",right_lat)
    for j in range(num_bins):
        left_lon=lon[j]
        right_lon=lon[j+1]
        filter_lon_left=filter_lat_right.filter_gt("stop_lon",left_lon)
        filter_lon_right=filter_lon_left.filter_lt("stop_lon",right_lon)
        tree_count[i,j] +=filter_lon_right.size()
#plt.xlim(lon_min,lon_max)
#plt.ylim(lat_min,lat_max)
plt.subplot(221)
plt.imshow(tree_count, extent=(lon_min,lon_max,lat_min,lat_max),origin="lower",cmap =plt.cm.gray_r,interpolation='none')
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("The distribution of stops", fontsize=25)
color_bar=plt.colorbar()
color_bar.set_label("Count")
plt.subplot(222)
plt.imshow(tree_count, extent=(lon_min,lon_max,lat_min,lat_max),origin="lower",cmap =plt.cm.Blues,interpolation='none')
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("The distribution of stops", fontsize=25)
color_bar=plt.colorbar()
color_bar.set_label("Count")
plt.subplot(223)
plt.imshow(tree_count, extent=(lon_min,lon_max,lat_min,lat_max),origin="lower", cmap = plt.cm.afmhot,interpolation='none')
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("The distribution of stops", fontsize=25)
color_bar=plt.colorbar()
color_bar.set_label("Count")

plt.subplot(224)
plt.imshow(tree_count, extent=(lon_min,lon_max,lat_min,lat_max),origin="lower", cmap = plt.cm.BuGn,interpolation='none')
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("The distribution of stops", fontsize=25)
color_bar=plt.colorbar()
color_bar.set_label("Count")


9. What is density of bus stops? (graph above)


In [84]:
import plotly
import plotly.plotly as py
from plotly.graph_objs import *
import pandas as pd
import math
from IPython.display import Image
import time
plotly.tools.set_credentials_file(username='xjiang36', api_key='uZyWsdSH3xd9bxUefIFf')

In [87]:
dftrips = pd.read_csv("trips.txt",encoding='iso-8859-1')
dfshapes = pd.read_csv("/Users/celine/Desktop/5DataVisual/google_transit/shapes.txt",encoding='iso-8859-1')

In [88]:
dfroutes = pd.read_csv("routes.txt",encoding='iso-8859-1')
dftrips = pd.read_csv("trips.txt",encoding='iso-8859-1')
routeclean=dftrips["route_id"].value_counts().reset_index().rename(columns={'index': 'x'})
def Nameclean(dataset,a):
    wordlist=["SILVER","ILLINI","TEAL","YELLOW","GREEN","BROWN","GREY","GOLD","LIME","BLUE","RED","BROWN","BRONZE","ORANGE","LAVENDER","RUBY"]
    for j in range(len(wordlist)):
        for i in range(len(dataset)):
            if dataset[a][i].find(wordlist[j])>=0:
                dataset[a][i]=wordlist[j]
Nameclean(routeclean,"x")
sumroute=routeclean[:18]
cleanedroute=routeclean["x"].value_counts().reset_index().rename(columns={'index': 'name'})
for j in range(len(cleanedroute["name"])):
    rsum=0
    for i in range(len(routeclean)):
        if routeclean["x"][i]==cleanedroute["name"][j]:
            rsum+=routeclean["route_id"][i]
    cleanedroute["x"][j]=rsum
colorbar0=[]
Nameclean(dfroutes,"route_id")
for i in range(len(cleanedroute['name'])):
    for j in range(len(dfroutes['route_id'])):
        if cleanedroute['name'][i]==dfroutes['route_id'][j]:
            colorbar0.append("#%s"%dfroutes['route_color'][j])
            break


/Users/celine/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/Users/celine/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:18: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

10. Does the amount of buses on each route correlate with stop density?


In [105]:
#Xiaoliang Jiang
colors=['#008063', '#fcee1f', '#d1d3d4', '#5a1d5a', '#808285', '#006991', '#a78bc0', '#eb008b', '#b2d235', '#823822', '#c7994a', '#f99f2a', '#9e8966', '#ed1c24', '#355caa', '#2b3088', '#000000', '#ffbfff']
names=['GREEN','YELLOW','SILVER','ILLINI','GREY','TEAL','LAVENDER','RUBY','LIME','BROWN','GOLD','ORANGE','BRONZE','RED','BLUE','NAVY','RAVEN','PINK']
stats=stopsdata.stats()
plt.rcParams["figure.figsize"] = (10, 5)
stats=stopsdata.stats()
lon_min=stats["stop_lon"][0]
lon_max=stats["stop_lon"][1]
lat_min=stats["stop_lat"][0]
lat_max=stats["stop_lat"][1]
num_bins=16
lon=np.mgrid[lon_min:lon_max:(num_bins+1)*1j]
lat=np.mgrid[lat_min:lat_max:(num_bins+1)*1j]
tree_count=np.zeros((num_bins,num_bins))
for i in range(num_bins):
    left_lat=lat[i]
    right_lat=lat[i+1]
    filter_lat_left=stopsdata.filter_gt("stop_lat",left_lat)
    filter_lat_right=filter_lat_left.filter_lt("stop_lat",right_lat)
    for j in range(num_bins):
        left_lon=lon[j]
        right_lon=lon[j+1]
        filter_lon_left=filter_lat_right.filter_gt("stop_lon",left_lon)
        filter_lon_right=filter_lon_left.filter_lt("stop_lon",right_lon)
        tree_count[i,j] +=filter_lon_right.size()
#plt.xlim(lon_min,lon_max)
#plt.ylim(lat_min,lat_max)

plt.imshow(tree_count, extent=(lon_min,lon_max,lat_min,lat_max),origin="lower",interpolation='none',cmap =plt.cm.gray_r)
plt.xlabel("Longitude",fontsize=10)
plt.ylabel("Latitude",fontsize=10)
plt.title("The distribution of stops & amount of bus on each route", fontsize=15)
for name in cleanedroute["name"]:
    tempshapeID=dftrips[dftrips["route_id"]==name]["shape_id"]
    count=0
    for i in tempshapeID:
        tempshapeIDvalue=i
        count+=1
        if count>20:
            break
        subrows=dfshapes[dfshapes["shape_id"]==tempshapeIDvalue]
        plt.plot(subrows["shape_pt_lon"],subrows["shape_pt_lat"],c=colors[names.index(name)],linewidth=cleanedroute['x'][cleanedroute['name']==name]/200,mec='none',ls='-')#,alpha=0.05)
        
plt.xlim([lon_min,lon_max])
plt.ylim([lat_min,lat_max])
color_bar=plt.colorbar()
color_bar.set_label("Count")
plt.plot()


Out[105]:
[]

This graph combined the distribution of stops (heat map represent density) and the amount of bus on each route (line width represent bus amount). As we can see the black block and the area with most buses are not exactly over lapped.


In [48]:
plt.rcParams["figure.figsize"] = (20, 20)

plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas in 10 minutes for each stop", fontsize=25)
plt.plot(data["stop_lon"],data["stop_lat"],c='#ffcccc',marker='o',markersize=169,mec='none',ls='') #15min

plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas in 5 minutes for each stop", fontsize=25)
plt.plot(data["stop_lon"],data["stop_lat"],c='#ffe5cc',marker='o',markersize=56,mec='none',ls='') #10min

plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas in 2 minutes for each stop", fontsize=25)
plt.plot(data["stop_lon"],data["stop_lat"],c='#ffffcc',marker='o',markersize=28,mec='none',ls='') #5min

plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas in 1/2/5/10 minutes for each stop", fontsize=25)
plt.plot(data["stop_lon"],data["stop_lat"],c='#e5ffcc',marker='o',markersize=11.2,mec='none',ls='')#2min

mycolor=plt.cm.jet
color_id=np.linspace(0,1,677)
s=0
for name, group in df2:
    s=s+1 
#     print(name)
    #group.plot('shape_pt_lat','shape_pt_lon')
    plt.plot(group['shape_pt_lon'],group['shape_pt_lat'], color=plt.cm.jet(s/677), alpha = 0.2, linewidth = 2.5)
plt.show()


300 size = 2.65 mile r = 1.325 mile = 2.1324 km 80m/min 300 size = 26.66 min 11.2528 size = 1min

-88.35 to -88.15 10.58mile=17.03km 1 latitude =111km

0.12 latitude=13.32km 1.278513.32 0.121.2785=0.15342 40.10+-0.07671=


In [46]:
plt.rcParams["figure.figsize"] = (20, 20)
plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas & MTD routes", fontsize=25)
plt.xlim(-88.339,-88.139)
plt.ylim(40.02329,40.17671)
r=11.2528
colorlist=("#ffcccc","#ffd5cc","#ffddcc","#ffe6cc","#ffeecc","#fff7cc","#ffffcc","#f7ffcc","#eeffcc","#e6ffcc","#ddffcc","#d5ffcc")

for i in range(12,0,-1):
    plt.plot(data["stop_lon"],data["stop_lat"],color=colorlist[12-i],marker='o',markersize=11.2528*i,mec='none',ls='')

mycolor=plt.cm.jet
color_id=np.linspace(0,1,677)
s=0
for name, group in df2:
    s=s+1 
#     print(name)
    #group.plot('shape_pt_lat','shape_pt_lon')
    plt.plot(group['shape_pt_lon'],group['shape_pt_lat'], color=plt.cm.binary(s/677), alpha = 0.2, linewidth = 2.5)
plt.show()



In [90]:
plt.rcParams["figure.figsize"] = (20, 20)
plt.grid()
plt.xlabel("Longitude",fontsize=15)
plt.ylabel("Latitude",fontsize=15)
plt.title("Walkable areas & MTD routes", fontsize=25)
plt.xlim(-88.339,-88.139)
plt.ylim(40.02329,40.17671)
r=11.2528
colorlist=("#ffcccc","#ffd5cc","#ffddcc","#ffe6cc","#ffeecc","#fff7cc","#ffffcc","#f7ffcc","#eeffcc","#e6ffcc","#ddffcc","#d5ffcc")

for i in range(12,0,-1):
    plt.plot(data["stop_lon"],data["stop_lat"],color=colorlist[12-i],marker='o',markersize=11.2528*i,mec='none',ls='')
for name in cleanedroute["name"]:
    tempshapeID=dftrips[dftrips["route_id"]==name]["shape_id"]
    count=0
    for i in tempshapeID:
        tempshapeIDvalue=i
        count+=1
        if count>20:
            break
        subrows=dfshapes[dfshapes["shape_id"]==tempshapeIDvalue]
        plt.plot(subrows["shape_pt_lon"],subrows["shape_pt_lat"],c=colors[names.index(name)],linewidth=2,mec='none',ls='-')#,alpha=0.05)



In [50]:
import pandas as pd
import numpy as np
import plotly
import plotly.plotly as py
from plotly.graph_objs import *

In [ ]:


In [51]:
plotly.tools.set_credentials_file(username='alexbear', api_key='L6m9DmfDjqrksfHtUH5j')
mapbox_access_token = 'pk.eyJ1IjoieGlhb2xpYW5namlhbmciLCJhIjoiY2l6OWJhZTZqMDFoMDJ3cG82Znhja3dodCJ9.88BUSeqz4H2xm8bemGd4VQ'

In [52]:
df = pd.read_csv('stops.txt',encoding='iso-8859-1')

11. How stops located on map?


In [53]:
data = Data([
    Scattermapbox(
        lat=df['stop_lat'],
        lon=df['stop_lon'],
        mode='markers',
        marker=Marker(
            color='#FF9933',
            opacity=0.4,
            size=9
        ),
        text='Stopname: '+df['stop_name'],
    )
])
layout = Layout(
    title="Distribution of MTD stops in Champaign <br>(Hover for breakdown)",
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.11,
            lon=-88.23
        ),
        pitch=0,
        zoom=10
    ),
)

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Multiple Mapbox')


Out[53]:

In [92]:
import pandas as pd
df1 = pd.read_csv('/Users/celine/Desktop/5DataVisual/google_transit/shapes.txt')
#print(df1.head())

In [93]:
aa=df1['shape_id'][0]
good=(df1["shape_id"]==aa)
#df1[good]

In [95]:
import pandas as pd
df2 = pd.read_csv('trips.txt')
#df2.head()
good2=(df2['shape_id']==aa)
#df2[good2]
bb=df2["trip_id"][4704]
#bb  # target trip_id

In [96]:
import pandas as pd
df3 = pd.read_csv('stop_times.txt')
#df3.head()
good3=(df3['trip_id']==bb)
#df3[good3]

In [97]:
list_arrival_time=df3['arrival_time'][195862:195901].tolist()
list_stop_id=df3['stop_id'][195862:195901].tolist()

In [98]:
import pandas as pd
df4 = pd.read_csv('stops.txt')
#df4.head()
cc=list_stop_id[0]
#cc
good4=(df4['stop_id']==cc)
df4['stop_lat'][good4],df4['stop_lon'][good4]
df4['stop_lat'][1840],df4['stop_lon'][1840]


Out[98]:
(40.114158330000002, -88.173105000000007)

In [99]:
list_stop_lat=[]
list_stop_lon=[]
for stops in list_stop_id:
    #print(stops)
    goodtemp=(df4['stop_id']==stops)
    list_stop_lat.append(df4['stop_lat'][goodtemp])
    list_stop_lon.append(df4['stop_lon'][goodtemp])
    #list_stop_lat+=df4['stop_lat'][good4]
    #list_stop_lon+=df4['stop_lon'][good4]
    
#list_stop_lat

#list_row=[1840,393,986,618,2195,2407,223,704,442,2399,1111,1684,404,961,380,390,1266,1878,1598,150,1071,540,1131,180,2312,617,1803,2400,2262,1181,561,1894,1143,2000,790,1406,1035,2389]

In [100]:
list_row=[1840,393,986,618,2195,2407,223,704,442,2399,1111,1684,404,961,380,390,1266,1878,1598,150,1071,540,1131,180,2312,617,1803,2400,2262,1181,561,1894,1143,2000,790,1406,1035,2389,480]
list_stop_lat=[]
list_stop_lon=[]
for stops in list_row:
    #print(stops)
    #goodtemp=(df4['stop_id']==1840)
    list_stop_lat.append(df4['stop_lat'][stops])
    list_stop_lon.append(df4['stop_lon'][stops])
    #list_stop_lat+=df4['stop_lat'][good4]
    #list_stop_lon+=df4['stop_lon'][good4]
    
#list_stop_lat

In [101]:
list_distance=[]
for i in range(len(list_stop_lat)):
    good5=(df1["shape_id"]==aa)&(df1['shape_pt_lat']==list_stop_lat[i])&(df1['shape_pt_lon']==list_stop_lon[i])
    list_distance.append(df1['shape_dist_traveled'][good5])
    
#list_distance
row2=[0,116,152,200,275,323,371,430,461,561,697,849,962,1085,1110,1136,1170,1200,1259,1283,1298,1394,1457,1488,1542,1600,1700,1738,1767,1854,1932,1972,2031,2121,2191,2242,2291,2366,2481]
list_distance=[]
for i in row2:
    #print(i)
    #good5=(df1["shape_id"]==aa)&(df1['shape_pt_lat']==list_stop_lat[i])&(df1['shape_pt_lon']==list_stop_lon[i])
    list_distance.append(df1['shape_dist_traveled'][i])
    #print(df1['shape_dist_traveled'][i])
    
#list_distance

#good=(df1["shape_id"]==aa)
list_pt_lat=df1['shape_pt_lat'][good]
list_pt_lon=df1['shape_pt_lon'][good]

In [102]:
from datetime import datetime
format = '%H:%M:%S'
list_arrival=[]
for time in list_arrival_time:
    list_arrival.append(datetime.strptime(time, format))
    
#list_arrival
#print(datetime.strptime(list_arrival_time[1], format))
#- datetime.strptime(time1, format)

In [103]:
list_velocity=[]
for i in range(38):
    ds=list_distance[i+1]-list_distance[i]
    dt=(list_arrival[i+1]-list_arrival[i]).total_seconds()
    if dt==0:
        dt=15
        
    list_velocity.append(ds/dt)
    
#list_velocity

12. How quickly do buses go?


In [104]:
%matplotlib inline
import matplotlib.pyplot as plt

import matplotlib as mpl
import matplotlib.pyplot as plt
min, max = (2, 18)
step = 2
# Setting up a colormap that's a simple transtion
mymap = mpl.colors.LinearSegmentedColormap.from_list('mycolors',['blue','red'])
# Using contourf to provide my colorbar info, then clearing the figure
Z = [[0,0],[0,0]]
levels = range(min,max+step,step)
CS3 = plt.contourf(Z, levels, cmap=mymap)
plt.clf()
figure3=plt.figure(figsize=(12,8))
plt.plot(list_stop_lon,list_stop_lat,'x')
plt.legend(["stop"])
plt.xlabel("Longitude")
plt.ylabel("Latitude")

for i in range(38):
    #min=min()
    start=row2[i]
    end=row2[i+1]
    #print(start,end)
    x=list_pt_lon[start:end]
    y=list_pt_lat[start:end]
    #print(x)
    #print(y)
    z=list_velocity[i]
    r = (float(z)-min)/(max-min)
    #print(r)
    g = 0
    b = 1-r
    plt.plot(x,y,color=(r,g,b), linewidth=3)

plt.xlim([-88.25,-88.16])
plt.ylim(40.11,40.12)
plt.colorbar(CS3) # using the colorbar info I got from contourf
plt.show()


<matplotlib.figure.Figure at 0x1048da828>

One trip of Orange line, trip id is @2.0.86175868@][1458585713139]/96__GN7_MF, the color represent the velocity of the bus, which is km/s.


In [ ]: