Assignment 1
Perform a basic statistical analysis of the time DOT 311 (table is called dot_311) complaints are open (subtract closed date from created date)
Connect to the database to get the data and do the analysis. Submit the code through Github and type up your results in your PR
Host: training.c1erymiua9dx.us-east-1.rds.amazonaws.com
Port: 5432
Database: training
Username: dot_student
Password: qgis
In [146]:
import pandas as pd
import pg8000 as pg
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime, timedelta
In [82]:
engine = create_engine('postgresql+pg8000://dot_student:qgis@training.c1erymiua9dx.us-east-1.rds.amazonaws.com:5432/training')
In [85]:
con=engine.raw_connection()
In [86]:
df=pd.read_sql_query('select * from dot_311', con=con)
In [88]:
len(df)
Out[88]:
In [89]:
df.head(5)
Out[89]:
In [92]:
df.columns
# at first i thought this meant my data was messed up, but these are all just bytes literal (apparently). i had to stringfy them in place
Out[92]:
In [100]:
df.rename?
In [101]:
for bytes_name in df.columns:
df.rename(columns = {bytes_name:bytes_name.decode('utf-8')}, inplace = True)
#print(bytes_name.decode('utf-8'))
In [102]:
df.columns
Out[102]:
In [105]:
df.groupby(by='agency').describe()
Out[105]:
In [ ]:
# apparently every complaint is DOT? that doesn't seem right but i tried pulling the data again and got the same thing. i might be pulling wrong, but
# we will finish with what we have
In [108]:
df['closed_date'][0]-df['created_date'][0]
Out[108]:
In [110]:
df['open_duration']=df['closed_date']-df['created_date']
In [115]:
df['created_date'].head(5)
Out[115]:
In [114]:
df.plot(kind='scatter', x='created_date', y='open_duration')
In [ ]:
# apparently matplotlib doesnt like datetime64, so lets convert them all
In [129]:
df['created']=pd.to_datetime(df['created_date'])
In [130]:
df['closed']=pd.to_datetime(df['closed_date'])
In [132]:
df['duration']=df['closed']-df['created']
In [135]:
plt.plot_date(df['created'], df['duration'])
Out[135]:
In [ ]:
# that is just about the most worthless chart ive ever seen. what even is that?
In [140]:
df['duration'].describe()
Out[140]:
In [142]:
# oh that might be why. the minimum time is -19 days, neat
In [151]:
df_2=df[df['duration'] > timedelta(minutes=1)]
In [152]:
plt.plot_date(df_2['created'], df_2['duration'])
Out[152]:
In [153]:
# i am still not certain what that might mean, but its better than the first one
In [158]:
df_2['duration'].describe()
Out[158]:
In [161]:
df_2.index=df_2['created']
In [162]:
df_2.resample('H').count().plot(y='unique_key')
Out[162]:
In [163]:
# thats kind of a more meaningful chart? im getting somewhere, i think
In [167]:
df_2['duration'].groupby(df_2.index.hour).describe()
Out[167]:
In [168]:
# well, that was ticket duration by hour opened in the day. now to plot that somehow
In [177]:
df_2['duration'].groupby(df_2.index.hour).plot(y='duration')
In [ ]: