In this notebook I do first some data cleaning, then I perform some basic analysis of the dataset (no. cases per year, correlation population/no. of cases etc...).
I analyze finally the simplified municipalities indicators. In particular, I use Lasso Regression to find out the features that are most correlated to the no. of cases. I also analyze the features themself, to see how they are correlated to each other and how they are clustered together.
In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)
%matplotlib inline
The following are the dataset files containing KPI data. They have to be loaded and concatenated as a single pandas dataframe.
In [2]:
data_dir = './'
In [3]:
kpi_files = [data_dir + kpi for kpi in ['kpis_1998_2003.csv',
'kpis_2004_2008.csv',
'kpis_2009_2011.csv',
'kpis_2012_2013.csv']]
I want to be sure that headers are consistent for all KPI files. I raise an exception if that is not the case.
In [4]:
# Check that all headers are the same
curHeaders = None
for csv in kpi_files:
with open(csv, 'r') as f:
for line in f:
if curHeaders is None:
curHeaders = line.strip()
elif curHeaders != line.strip():
raise Exception('KPI headers mismatch')
break
In [5]:
kpis = pd.concat([pd.read_csv(f) for f in kpi_files])
'Total number of KPIs: {}'.format(len(kpis.index))
Out[5]:
I load now the other CSV files as pandas dataframes.
In [6]:
municipality_indicators = pd.read_csv(data_dir + 'municipality_indicators.csv')
simplified_municipality_indicators = pd.read_csv(data_dir + 'simplified_municipality_indicators.csv')
school_fire_cases = pd.read_csv(data_dir + 'school_fire_cases_1998_2014.csv')
After some manual data wrangling, I've noticed that KPI files contain the same header line multiple times. I guess each file is a concationation of many original smaller files. I will remove this lines manually.
In [7]:
# Rows to be removed will have field 'kpi' equal to string 'kpi', 'period' equal to 'period'
# and so on. One single check on the first attribute should be enough.
kpis = kpis[kpis['kpi'] != 'kpi']
'Total number of KPIs after cleaning: {}'.format(len(kpis.index))
Out[7]:
The value field contains sometimes the string "None". For the purposes of this analysis it should be fine to set it to zero. Then I convert both period and value to numeric types. They will be used below during the analysis.
In [8]:
print(kpis.dtypes)
kpis['period'] = kpis['period'].astype(int)
kpis['value'] = kpis['value'].replace(['None'], [0.]).astype(float)
print(kpis.dtypes)
Check that municipality names of KPIs data match those of simplified municipality indicators (SMIs).
In [9]:
names_simplified_municipality_indicators = set(simplified_municipality_indicators['name'])
names_kpis = set(kpis['municipality_name'])
assert names_simplified_municipality_indicators ^ names_kpis == set()
In [10]:
municipality_types = set(simplified_municipality_indicators['municipalityType'])
'Number of municipality types = {}'.format(len(municipality_types))
Out[10]:
In [11]:
total_fire_cases = school_fire_cases['Cases'].sum()
cases_years = school_fire_cases['Year']
print('Number of unique years = {}'.format(len(cases_years.unique())))
period_desc = '{}-{}'.format(cases_years.min(), cases_years.max())
print('Number of total fire cases in period {} = {}'.format(period_desc, total_fire_cases))
print('Total number of municipalities = {}'.format(len(school_fire_cases['Municipality'].unique())))
In [12]:
total_cases_by_municipality = school_fire_cases.groupby('Municipality').sum()['Cases'].sort_values(ascending=False)
In [13]:
max_cases_per_year = school_fire_cases.sort_values(by='Cases', ascending=False) \
.groupby('Year', as_index=False) \
.first()
print('The following municipalities were the ones the highest number of cases during the period {}:\n{}' \
.format(period_desc, max_cases_per_year['Municipality'].unique()))
In [14]:
piechart_data = total_cases_by_municipality[:20]
others = total_cases_by_municipality[20:]
piechart_data.set_value('Others', others.sum())
f, ax = plt.subplots(figsize=(11, 4))
plt.axis('equal');
plt.pie(piechart_data, labels=piechart_data.index);
In [15]:
cases_by_year = school_fire_cases.groupby('Year')
f, ax = plt.subplots(figsize=(11, 4))
plt.xlabel('Year')
plt.ylabel('No. of cases')
_ = plt.plot(cases_by_year.sum()['Cases'])
In [16]:
print('Average cases = {}, standard deviation = {}, median = {}, 75th percentile = {}'.format(total_cases_by_municipality.mean(),
total_cases_by_municipality.std(),
total_cases_by_municipality.quantile(.5),
total_cases_by_municipality.quantile(.75)))
f, ax = plt.subplots(figsize=(11, 4))
plt.xlabel('No. of cases')
plt.ylabel('No. of municipalities')
_ = plt.hist(total_cases_by_municipality, bins=100)
I show some misc info about population and no. of cases and then I compute and show their correlation.
In [17]:
population = school_fire_cases['Population']
print('Max population = {}, min population = {}'.format(population.max(), population.min()))
cases = school_fire_cases['Cases']
print('Max cases = {}, min cases = {}'.format(cases.max(), cases.min()))
In [18]:
reg = linear_model.LinearRegression()
features = np.array([[pp] for pp in population.values])
targets = np.array([[cc] for cc in cases.values])
reg.fit(features, targets)
print('Slope = {}, intercept = {}, score (R^2) = {}'.format(reg.coef_[0], reg.intercept_, reg.score(features, targets)))
In [19]:
f, ax = plt.subplots(figsize=(11, 4))
plt.xlim([0, 1000000])
plt.ylim([0,60])
plt.scatter(population, cases)
_ = plt.plot(features, reg.predict(features), color='r')
I plot here the mean value of KPIs aggregated by year. I see some similarities with the "Cases by year" plot. Are KPI values and no. of cases correlated? I would have expected the opposite.
In [20]:
kpis_by_municipality = kpis['value'].groupby(kpis['municipality_name'])
kpis_by_period = kpis['value'].groupby(kpis['period'])
In [21]:
f, ax = plt.subplots(figsize=(11, 4))
plt.xlabel('Year')
plt.ylabel('Mean KPI value')
_ = plt.plot(kpis_by_period.mean())
Let's choose a subset of SMIs features that make sense.
In [22]:
mun_indicators_features_list = ['medianIncome',
'youthUnemployment2010',
'youthUnemployment2013',
'unemploymentChange',
'reportedCrime',
'populationChange',
'hasEducation',
'asylumCosts',
'urbanDegree',
'foreignBorn',
'reportedCrimeVandalism',
'youngUnskilled',
'latitude',
'longitude',
'population',
'populationShare65plus',
'refugees',
'rentalApartments',
'fokusRanking',
'foretagsklimatRanking',
'cars',
'motorcycles',
'tractors',
'snowmobiles']
In [23]:
mun_indicators_features = simplified_municipality_indicators.loc[:, mun_indicators_features_list].as_matrix()
y_cases = [total_cases_by_municipality[m] for m in simplified_municipality_indicators['name']]
Using Lasso regression on selected features of the simplified KPIs dataframe to compute the importance of each single feature.
In [24]:
lasso = linear_model.Lasso(alpha=0.1)
lasso.fit(mun_indicators_features, y_cases)
Out[24]:
In [25]:
lasso.coef_
Out[25]:
In [26]:
features_by_coef = sorted(zip(mun_indicators_features_list, lasso.coef_), key=lambda tup: tup[1], reverse=True)
chart_x = [t[0] for t in features_by_coef]
chart_y = [t[1] for t in features_by_coef]
The most positively correlated features are "reportedCrimeVandalism", "foreignBorn" and "hasEducation", the most negatively correlated are "populationShare65plus", "longitude" (but why?), "populationChange", "youngUnskilled" (they didn't go to school?). But "youthUnemployment" and "motorcycles"?
In [27]:
f, ax = plt.subplots(figsize=(11, 4))
plt.xticks(range(len(chart_x)), chart_x)
locs, labels = plt.xticks()
plt.setp(labels, rotation=90)
_ = ax.bar(range(len(chart_x)), chart_y, 0.3, color="blue")
I plot the correlation between the SMIs features as a heatmap.
In [28]:
indicators_and_cases = simplified_municipality_indicators.loc[:, mun_indicators_features_list]
cor_mat = simplified_municipality_indicators.loc[:, mun_indicators_features_list].corr()
In [29]:
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(cor_mat,linewidths=.5, ax=ax);
I show in the following table the most important correlations, filtering out auto-correlations. Obvious: latitude/snowmobiles, tractors/urbanDegree, youthUnemployment2010/youthUnemployment2013. Interesting: hasEducation/mediaIncome, populationChange, populationShare65plus (negative).
In [30]:
threshold = 0.7
important_corrs = (cor_mat[abs(cor_mat) > threshold][cor_mat != 1.0]) \
.unstack().dropna().to_dict()
unique_important_corrs = pd.DataFrame(
list(set([(tuple(sorted(key)), important_corrs[key]) \
for key in important_corrs])), columns=['attribute pair', 'correlation'])
# sorted by absolute value
unique_important_corrs = unique_important_corrs.ix[
abs(unique_important_corrs['correlation']).argsort()[::-1]]
unique_important_corrs
Out[30]:
In [31]:
# See https://www.kaggle.com/cast42/santander-customer-satisfaction/exploring-features
import matplotlib.patches as patches
from scipy.cluster import hierarchy
from scipy.stats.mstats import mquantiles
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
In [32]:
# scale to mean 0, variance 1
train_std = pd.DataFrame(scale(indicators_and_cases))
train_std.columns = indicators_and_cases.columns
m = train_std.corr()
l = linkage(m, 'ward')
In [33]:
mclust = sns.clustermap(m,
linewidths=0,
cmap=plt.get_cmap('RdBu'),
vmax=1,
vmin=-1,
figsize=(14, 14),
row_linkage=l,
col_linkage=l)
# http://stackoverflow.com/a/34697479/297313
_ = plt.setp(mclust.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)