Assignment 1

Connect to the database to get the data and do the analysis.
Perform a basic statistical analysis of the time DOT 311 (table is called dot_311) complaints are open (subtract closed date from created date)
Submit the code through Github and type up your results in your PR.



In [1]:

    
import pandas as pd
import pg8000
from datetime import datetime
import numpy as np
import matplotlib
%matplotlib inline

Preparing the data



In [2]:

    
# Set up the database
conn = pg8000.connect(host='training.c1erymiua9dx.us-east-1.rds.amazonaws.com', port=5432, database='training', user='dot_student', password='qgis')



In [3]:

    
# Rollback database connection in case of problems
conn.rollback()



In [4]:

    
# Getting the data from the database
cursor = conn.cursor()
db = list()
cursor.execute("SELECT * FROM dot_311")
for item in cursor.fetchall():
    db.append(item)



In [5]:

    
# Getting the headers from the database
cursor.execute("SELECT column_name FROM information_schema.columns WHERE table_name='dot_311'")
header = list()
for item in cursor.fetchall():
    header.append(item[0])



In [6]:

    
# Creating the Pandas Dataframe
df = pd.DataFrame(data=db)



In [7]:

    
# Substituting the column names
df.columns = header



In [8]:

    
df.head()









    Out[8]:






  
    
      
      gid
      unique_key
      agency
      agency nam
      complaint
      descriptor
      location t
      incident z
      incident a
      street nam
      ...
      bridge h_2
      garage lot
      ferry dire
      ferry term
      latitude
      longitude
      location
      geom
      created_date
      closed_date
    
  
  
    
      0
      2
      32570549
      DOT
      Department of Transportation
      Traffic Signal Condition
      Controller
      None
      10301
      None
      None
      ...
      None
      None
      None
      None
      40.613803753157917
      -74.113373738864198
      (40.61380375315792, -74.1133737388642)
      0101000020E61000006829ED83418752C0D7FC121F914E...
      2016-02-01 00:12:00
      2016-02-01 01:15:00
    
    
      1
      3
      32572958
      DOT
      Department of Transportation
      Traffic Signal Condition
      Controller
      None
      10301
      None
      None
      ...
      None
      None
      None
      None
      40.626545297464929
      -74.091582334480577
      (40.62654529746493, -74.09158233448058)
      0101000020E610000011DF267CDC8552C08809E5A23250...
      2016-02-01 00:14:00
      2016-02-01 00:54:00
    
    
      2
      4
      32573576
      DOT
      Department of Transportation
      Traffic Signal Condition
      Controller
      None
      10310
      None
      None
      ...
      None
      None
      None
      None
      40.629249554787492
      -74.123220752604595
      (40.62924955478749, -74.1232207526046)
      0101000020E610000004A84BD9E28752C0DC6AD93F8B50...
      2016-02-01 00:15:00
      2016-02-01 01:30:00
    
    
      3
      5
      32572389
      DOT
      Department of Transportation
      Traffic Signal Condition
      Controller
      None
      None
      None
      None
      ...
      None
      None
      None
      None
      None
      None
      None
      0101000020E610000004A84BD9E28752C0DC6AD93F8B50...
      2016-02-01 00:17:00
      2016-02-01 02:35:00
    
    
      4
      6
      32570548
      DOT
      Department of Transportation
      Traffic Signal Condition
      Controller
      None
      None
      None
      None
      ...
      None
      None
      None
      None
      None
      None
      None
      0101000020E610000004A84BD9E28752C0DC6AD93F8B50...
      2016-02-01 00:19:00
      2016-02-01 01:00:00
    
  

5 rows × 55 columns



In [9]:

    
df[['created_date', 'closed_date',]].head(5)









    Out[9]:






  
    
      
      created_date
      closed_date
    
  
  
    
      0
      2016-02-01 00:12:00
      2016-02-01 01:15:00
    
    
      1
      2016-02-01 00:14:00
      2016-02-01 00:54:00
    
    
      2
      2016-02-01 00:15:00
      2016-02-01 01:30:00
    
    
      3
      2016-02-01 00:17:00
      2016-02-01 02:35:00
    
    
      4
      2016-02-01 00:19:00
      2016-02-01 01:00:00



In [10]:

    
# Creating a new column with the time elapsed betweedn creation of the ticket and its closing. 
df['time_elapsed'] = df['closed_date'] - df['created_date']
df[['created_date', 'closed_date', 'time_elapsed']].head(5)









    Out[10]:






  
    
      
      created_date
      closed_date
      time_elapsed
    
  
  
    
      0
      2016-02-01 00:12:00
      2016-02-01 01:15:00
      01:03:00
    
    
      1
      2016-02-01 00:14:00
      2016-02-01 00:54:00
      00:40:00
    
    
      2
      2016-02-01 00:15:00
      2016-02-01 01:30:00
      01:15:00
    
    
      3
      2016-02-01 00:17:00
      2016-02-01 02:35:00
      02:18:00
    
    
      4
      2016-02-01 00:19:00
      2016-02-01 01:00:00
      00:41:00



In [13]:

    
# It takes time to process a request. Unfortunately, there are negative values in the dataset. That's not possible in reality...
# Let's check how many times are negative. 
df[df['time_elapsed'] < '00:00:00']['unique_key'].count()









    Out[13]:





582



In [12]:

    
# Let's check how many items with 'normal' timestamps are in the dataset. 
df['unique_key'].count()









    Out[12]:





5851



In [219]:

    
# There are also a lot of tickets that have been opened and closed at the same time. 
df[df['time_elapsed'] == '00:00:00']['unique_key'].count()









    Out[219]:





521



In [220]:

    
# What percentage of the times are wrong? 
100 * df[df['time_elapsed'] <= '00:00:00']['unique_key'].count() / df['unique_key'].count()









    Out[220]:





18.851478379764142



In [53]:

    
# Every fith entry seems to be strange. That is a lot. If the wrong ones are filled randomly, 
# we could go on. If most of them are from the same department or about the same topic, 
# we would have to think about it.
print("This are all complaint categories:\n\n", df['complaint'].value_counts())
wrong = df[df['time_elapsed'] <= '00:00:00']

print("This are the categories affected by non plausible values:\n\n", wrong['complaint'].value_counts(), "\n\n")
print("In the category Street Light Condition", 100 * wrong[wrong['complaint'] == 'Street Light Condition']['complaint'].value_counts() / df[df['complaint'] == 'Street Light Condition']['complaint'].value_counts(), "percent are wrong.")









    



This are all complaint categories:

 Street Light Condition        2040
Street Condition              1916
Traffic Signal Condition       808
Broken Muni Meter              525
Sidewalk Condition             106
Highway Condition               90
Street Sign - Damaged           83
Street Sign - Missing           67
Street Sign - Dangling          55
DOT Literature Request          53
Broken Parking Meter            32
Curb Condition                  21
Bridge Condition                18
Ferry Inquiry                   13
Ferry Complaint                  7
Bus Stop Shelter Placement       7
Highway Sign - Damaged           4
Municipal Parking Facility       3
Public Toilet                    2
Parking Card                     1
Name: complaint, dtype: int64
This are the categories affected by non plausible values:

 Street Light Condition      968
Street Condition            134
Traffic Signal Condition      1
Name: complaint, dtype: int64 


In the category Street Light Condition Street Light Condition    47.45098
Name: complaint, dtype: float64 percent are wrong.



In [56]:

    
# About half of the entries about "Street Light Condition" is wrong. Let's exclude these from our analysis.
# We also exclude the other wrong data. We need to mention it in our article, though.

df_plausible = df[df['complaint'] != 'Street Light Condition']
df_plausible = df_plausible[df_plausible['time_elapsed'] > '00:00:00']
df.head(3)









    Out[56]:






  
    
      
      gid
      unique_key
      agency
      agency nam
      complaint
      descriptor
      location t
      incident z
      incident a
      street nam
      ...
      garage lot
      ferry dire
      ferry term
      latitude
      longitude
      location
      geom
      created_date
      closed_date
      time_elapsed
    
  
  
    
      0
      2
      32570549
      DOT
      Department of Transportation
      Traffic Signal Condition
      Controller
      None
      10301
      None
      None
      ...
      None
      None
      None
      40.613803753157917
      -74.113373738864198
      (40.61380375315792, -74.1133737388642)
      0101000020E61000006829ED83418752C0D7FC121F914E...
      2016-02-01 00:12:00
      2016-02-01 01:15:00
      01:03:00
    
    
      1
      3
      32572958
      DOT
      Department of Transportation
      Traffic Signal Condition
      Controller
      None
      10301
      None
      None
      ...
      None
      None
      None
      40.626545297464929
      -74.091582334480577
      (40.62654529746493, -74.09158233448058)
      0101000020E610000011DF267CDC8552C08809E5A23250...
      2016-02-01 00:14:00
      2016-02-01 00:54:00
      00:40:00
    
    
      2
      4
      32573576
      DOT
      Department of Transportation
      Traffic Signal Condition
      Controller
      None
      10310
      None
      None
      ...
      None
      None
      None
      40.629249554787492
      -74.123220752604595
      (40.62924955478749, -74.1232207526046)
      0101000020E610000004A84BD9E28752C0DC6AD93F8B50...
      2016-02-01 00:15:00
      2016-02-01 01:30:00
      01:15:00
    
  

3 rows × 56 columns



In [57]:

    
# To be able to plot it, we will add a new column with the time converted to minutes. 
df_plausible['time_elapsed_minutes'] = df_plausible['time_elapsed'].apply(lambda x: x / np.timedelta64(1,'m'))
df_plausible[['time_elapsed', 'time_elapsed_minutes']].head()









    Out[57]:






  
    
      
      time_elapsed
      time_elapsed_minutes
    
  
  
    
      0
      01:03:00
      63.0
    
    
      1
      00:40:00
      40.0
    
    
      2
      01:15:00
      75.0
    
    
      3
      02:18:00
      138.0
    
    
      4
      00:41:00
      41.0



In [80]:

    
# Let's get an overview by plotting it.
ax = df_plausible['time_elapsed_minutes'].hist(bins=60)
ax.set_title("The city takes little time to work on complaints")
ax.set_xlabel("Time in Minutes")
ax.set_ylabel("Number of Complaints")









    Out[80]:





<matplotlib.text.Text at 0x7f077fc74748>



In [82]:

    
# How long does it them take to close a complaint? Let's check the mean, the quartiles, the min and max.
df_plausible['time_elapsed'].describe()









    Out[82]:





count                       3564
mean      8 days 14:30:00.796857
std      14 days 23:30:11.891432
min              0 days 00:00:01
25%       0 days 12:19:35.250000
50%       1 days 02:57:40.500000
75%       5 days 21:13:29.500000
max             86 days 06:08:31
Name: time_elapsed, dtype: object



In [83]:

    
# Since we don't nearly havee a bell shaped curve, we probably better use the median instead of the mean. 
df_plausible['time_elapsed'].median()









    Out[83]:





Timedelta('1 days 02:57:40.500000')



In [84]:

    
# Let's just look at the complaints that were closed in the first hour after their opening.
df_plausible[df_plausible['time_elapsed_minutes'] < 60]['time_elapsed_minutes'].hist(bins=60)









    Out[84]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f077fbf7e48>



In [87]:

    
# Let's check what types of complaints take the longest and the shortest to getting closed.
df_plausible.groupby(by='complaint')['time_elapsed_minutes'].median().sort_values()









    Out[87]:





complaint
Traffic Signal Condition        122.000000
Public Toilet                   492.200000
Highway Condition              1449.116667
DOT Literature Request         1529.750000
Street Condition               1556.266667
Bridge Condition               2453.916667
Ferry Inquiry                  2457.866667
Bus Stop Shelter Placement     2655.233333
Sidewalk Condition             2841.466667
Ferry Complaint                3270.833333
Curb Condition                 5022.966667
Street Sign - Missing          8219.400000
Street Sign - Damaged          8883.433333
Street Sign - Dangling         9392.458333
Highway Sign - Damaged        25954.566667
Parking Card                  35793.000000
Municipal Parking Facility    40212.333333
Broken Muni Meter             58977.233333
Broken Parking Meter          60230.033333
Name: time_elapsed_minutes, dtype: float64



In [96]:

    
# Let's plot that.
df_plausible.groupby(by='complaint')['time_elapsed_minutes'].median().sort_values(ascending=False).plot(kind='barh', legend=False)









    Out[96]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f077f754198>



In [101]:

    
df_plausible[df_plausible['complaint'] == 'Traffic Signal Condition']['time_elapsed'].median()









    Out[101]:





Timedelta('0 days 02:02:00')



In [102]:

    
df_plausible[df_plausible['complaint'] == 'Public Toilet']['time_elapsed'].median()









    Out[102]:





Timedelta('0 days 08:12:12')



In [103]:

    
df_plausible[df_plausible['complaint'] == 'Broken Parking Meter']['time_elapsed'].median()









    Out[103]:





Timedelta('41 days 19:50:02')

	gid	unique_key	agency	agency nam	complaint	descriptor	location t	incident z	incident a	street nam	...	bridge h_2	garage lot	ferry dire	ferry term	latitude	longitude	location	geom	created_date	closed_date
0	2	32570549	DOT	Department of Transportation	Traffic Signal Condition	Controller	None	10301	None	None	...	None	None	None	None	40.613803753157917	-74.113373738864198	(40.61380375315792, -74.1133737388642)	0101000020E61000006829ED83418752C0D7FC121F914E...	2016-02-01 00:12:00	2016-02-01 01:15:00
1	3	32572958	DOT	Department of Transportation	Traffic Signal Condition	Controller	None	10301	None	None	...	None	None	None	None	40.626545297464929	-74.091582334480577	(40.62654529746493, -74.09158233448058)	0101000020E610000011DF267CDC8552C08809E5A23250...	2016-02-01 00:14:00	2016-02-01 00:54:00
2	4	32573576	DOT	Department of Transportation	Traffic Signal Condition	Controller	None	10310	None	None	...	None	None	None	None	40.629249554787492	-74.123220752604595	(40.62924955478749, -74.1232207526046)	0101000020E610000004A84BD9E28752C0DC6AD93F8B50...	2016-02-01 00:15:00	2016-02-01 01:30:00
3	5	32572389	DOT	Department of Transportation	Traffic Signal Condition	Controller	None	None	None	None	...	None	None	None	None	None	None	None	0101000020E610000004A84BD9E28752C0DC6AD93F8B50...	2016-02-01 00:17:00	2016-02-01 02:35:00
4	6	32570548	DOT	Department of Transportation	Traffic Signal Condition	Controller	None	None	None	None	...	None	None	None	None	None	None	None	0101000020E610000004A84BD9E28752C0DC6AD93F8B50...	2016-02-01 00:19:00	2016-02-01 01:00:00

	created_date	closed_date	time_elapsed
0	2016-02-01 00:12:00	2016-02-01 01:15:00	01:03:00
1	2016-02-01 00:14:00	2016-02-01 00:54:00	00:40:00
2	2016-02-01 00:15:00	2016-02-01 01:30:00	01:15:00
3	2016-02-01 00:17:00	2016-02-01 02:35:00	02:18:00
4	2016-02-01 00:19:00	2016-02-01 01:00:00	00:41:00

	time_elapsed	time_elapsed_minutes
0	01:03:00	63.0
1	00:40:00	40.0
2	01:15:00	75.0
3	02:18:00	138.0
4	00:41:00	41.0