Focus on time analysis in this notebook. The dataset covers crime records in San Francisco from 2013-01-06 to 2015-05-13.
Examine patterns of crimes over years
Examine seasonal/monthly patterns of crimes
Examine hourly patterns
Example weekly patterns
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib.colors import rgb2hex
In [2]:
train = pd.read_csv('train.csv')
train.shape
Out[2]:
In [3]:
train['Dates'] = pd.to_datetime(train['Dates'])
train['Dates'].describe()
Out[3]:
In [4]:
train.head()
Out[4]:
In [5]:
train['year'] = train['Dates'].dt.year
train['month'] = train['Dates'].dt.month
train['day'] = train['Dates'].dt.day
train['dayofweek'] = train['Dates'].dt.dayofweek
train['hour'] = train['Dates'].dt.hour
There are 39 types of crimes in the dataset. The bottom categories have less than 1000 in total counts. For simplicity and clear visualization, subset the dataset to focus on top 10 crimes.
In [6]:
crime_counts = pd.Series(train['Category'].value_counts())
crime_counts
Out[6]:
In [7]:
top10_crime = crime_counts.index[:10]
train_subset = train[train['Category'].isin(top10_crime)]
print train_subset.shape
train_subset['Category'].value_counts()
Out[7]:
To look for changes over years for top 10 crimes, use resample() to plot crime counts over years. As data from year 2015 is incomplete, take year 2015 out and focus on year 2003 to 2014. Among top 10 crimes, larceny/theft and non-criminal increased signficantly since 2010, whereas vehicle theft and drug/narcotic decreased significantly since 2005. Assault is stable
In [8]:
# use top10 crimes
train_ts = train_subset.set_index("Dates",drop=True)[['Category']]
#2015 is only half a year, take that out
train_ts2 = train_ts.iloc[train_ts.index.year<2015]
In [11]:
#count crimes for each category and create columns
for col in top10_crime:
train_ts2[col] = (train_ts2['Category'] == col).astype(int)
In [12]:
plt.figure(figsize=(12,6))
colors=[rgb2hex(color) for color in sns.color_palette("Paired", 10)]
for i,col in enumerate(train_ts2['Category'].unique()):
train_ts2.resample('AS-JAN',how=sum)[col].plot(label=col,color=colors[i])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Out[12]:
To get a clean look at each crime category, plot them individually. Larceny/theft and non-criminal increased signficantly since 2010, whereas vehicle theft and drug/narcotic decreased significantly since 2005. Assault is stable.
In [13]:
fig = plt.figure( figsize = (12,10))
for index,col in enumerate(train_ts2['Category'].unique()):
ax = fig.add_subplot(4,3,index+1)
train_ts2.resample('AS-JAN',how=sum)[col].plot(title=col,ax=ax)
ax.set_xlabel('Year')
plt.tight_layout()
Resample over month to look for seasonal patterns. First plot top 10 crimes together than plot them individually. Most crimes have two peaks in each year.
In [14]:
plt.figure(figsize=(12,6))
colors=[rgb2hex(color) for color in sns.color_palette("Paired", 10)]
for i,col in enumerate(train_ts2['Category'].unique()):
train_ts2.resample('M',how=sum)[col].plot(label=col,color=colors[i])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Out[14]:
In [15]:
fig = plt.figure( figsize = (12,10))
for index,col in enumerate(train_ts2['Category'].unique()):
ax = fig.add_subplot(4,3,index+1)
train_ts2.resample('M',how=sum)[col].plot(title=col,ax=ax)
ax.set_xlabel('Year')
plt.tight_layout()
To get a better look at seasonal patterns, aggregate all crimes for each month. It is clear that October has the highest crime, followed by April, whereas February and December have the lowest crime numer.
In [27]:
g1 = train.groupby(['month','year']).size().reset_index()
g1 = g1.rename(columns={0:'Crime Counts'})
sns.boxplot(x='month',y='Crime Counts',data=g1)
plt.title('Crime by Month',fontsize=20)
Out[27]:
Another way to look for seasonal pattern for each type of crime is to aggregate and normalize, to take account of different baseline numbers and plot over each month. As shown below, most top 10 crimes follow the same seasonable pattern as the overall crime, peaking in October and April, and reaching low points in December and Febuary.
In [29]:
g1 = train_subset.groupby(['Category','month']).size().reset_index()
g1 = g1.rename(columns={0:'crime count'})
g2 = g1.groupby('Category')['crime count']
g1['normalized count'] = g2.apply(lambda x: (x - x.mean()) / x.std()) #add a column
plt.figure(figsize=(12,6))
sns.pointplot(x='month',y='normalized count',hue='Category',data=g1)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Normalized Counts')
plt.title('Normalized crime counts by month',fontsize=20)
Out[29]:
To find out patterns of crimes over 24 hours, aggregate all crimes and plot over 24 hours. Then for top 10 crimes, aggregate, normalize and plot over 24 hours. Most crimes follow the same pattern as the overall crime patterns. 2-6am has the lowest crime number. The number starts to increase in the morning and peaking at noon, then more or less stable till midnight.
In [30]:
g3 = train.groupby(['hour','day','month','year']).size().reset_index()
sns.boxplot(x='hour',y=0,data=g3)
plt.ylabel('Crime Counts')
plt.title('Crime by Hour of the day',fontsize=20)
plt.ylim([0,100])
Out[30]:
In [31]:
g1 = train_subset.groupby(['Category','hour']).size().reset_index()
g1 = g1.rename(columns={0:'crime count'})
g2 = g1.groupby('Category')['crime count']
g1['normalized count'] = g2.apply(lambda x: (x - x.mean()) / x.std()) #add a column
plt.figure(figsize=(12,6))
sns.pointplot(x='hour',y='normalized count',hue='Category',data=g1)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Normalized counts')
plt.title('Normalized crime counts by hour',fontsize=20)
Out[31]:
In [20]:
fig = plt.figure( figsize = (12,10))
for index,col in enumerate(g1['Category'].unique()):
temp = g1[g1['Category']==col]
ax = fig.add_subplot(4,3,index+1)
sns.pointplot(x='hour',y='normalized count',data=temp,ax=ax)
ax.set_xlabel('Hour')
ax.set_ylabel('')
ax.set_title(col)
plt.tight_layout()
To find weekly pattern, aggregate all crimes and plot over day of week. Friday has the highest number and Sunday has the lowest number but the diffrence is small. For top 10 crimes, after aggregation and normalization, most crimes peak on Friday and decrease over the weekend, with the exception of assault. Assault peaks on Saturday and Sunday. Larceny/theft and vandalism peak on Friday and Saturday then decrease on Sunday.
In [32]:
g2 = train.groupby(['dayofweek','month','year']).size().reset_index()
sns.boxplot(x='dayofweek',y=0,data=g2)
plt.ylabel('Crime Counts')
plt.title('Crime by day of week',fontsize=20)
Out[32]:
In [33]:
g1 = train_subset.groupby(['Category','dayofweek']).size().reset_index()
g1 = g1.rename(columns={0:'crime count'})
g2 = g1.groupby('Category')['crime count']
g1['normalized count'] = g2.apply(lambda x: (x - x.mean()) / x.std()) #add a column
plt.figure(figsize=(12,6))
sns.pointplot(x='dayofweek',y='normalized count',hue='Category',data=g1)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Normalized counts')
plt.title('Normalized crime counts by day of week',fontsize=20)
Out[33]:
In [23]:
fig = plt.figure( figsize = (12,10))
for index,col in enumerate(g1['Category'].unique()):
temp = g1[g1['Category']==col]
ax = fig.add_subplot(4,3,index+1)
sns.pointplot(x='dayofweek',y='normalized count',data=temp,ax=ax)
ax.set_xlabel('Day of Week')
ax.set_ylabel('')
ax.set_title(col)
plt.tight_layout()
In [ ]: