EE-558 A Network Tour of Data Science, EPFL
Joël M. Fonseca, Nelson Antunes, Enguerrand Granoux and Hedi Driss
The aim of this project is to understand the terrorist attacks from different perspectives. The data scientist's toolkit provides a panel of useful methods that allows to highlight the main characteristics of the data we are investigating. The project structure is defined as follows:
I. Data Acquisition
In this part, the goal is to understand the data. What are the information we have ? How is it represented ? Can missing values be recovered without making wrong assumptions ? What are the relevant variables as part of this project ? That is the kind of questions we try to answer.
II. Data Exploration:
Here, we need to bring the data alive, to let it express itself. We present a map that dynamically illustrates the evolution of all the terrorist attacks avaible from the database. We also plot some basic facts, like the most affected countries and the most used types of weapon. Finally, we propose alternative roads that can be further developed like the time series.
III. Data Exploitation:
In this section, we use the power of mathematics to represent the data in a more abstract level, in order to find the hidden structure of the attacks. In particular, Principal Component Analysis (PCA) is used for a specific terrorist group to highlight the pattern of similarities. Then, a spectral embedding with the Laplacian Eigenmaps algorithm is applied to compare the two techniques. In a second part, we will remove the group labels and see if we can recover them with different clustering algorithms.
IV. Conclusion:
Finally, the conclusions drawn from the results founded in the two previous sections are summarized.
In [2]:
# the basics
import numpy as np
import pandas as pd
import seaborn as sns
# useful libraries
import googlemaps
# for visualization
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
# To match name
import difflib as di
# Clustering part
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GMM
# avoid warnings
import warnings
warnings.filterwarnings('ignore')
# automatically reloads the different .py's before running a cell
%load_ext autoreload
%autoreload 2
In this project, we will work on the Global Terorisme Database which is an open-source database containing information about more than 170,000 terrorist attacks around the world from 1970 through 2016 provided by the START consortium. It can be downloaded here. Make sure you download it before running this notebook.
Let's first of all make the data usable for our analysis.
We will first analyze the content of the database to understand its content and then restrict ourselves to the most significant features in order to carry out our task in the best possible way.
In [3]:
attacks=pd.read_csv('globalterrorismdb_0617dist.csv',encoding='ISO-8859-1')
attacks.head()
Out[3]:
In [5]:
attacks.shape
Out[5]:
We see that there is 135 different columns. We won't be using all of them. A complete description of each column can be found here. After thinking, we made a first selection of the most important features we will consider. Furthermore, we renamed all the columns of interest.
In [6]:
attacks.rename(columns={'iyear':'Year','imonth':'Month','iday':'Day','country_txt':'Country','region_txt':'Region', 'city': 'City', 'latitude':'Latitude', 'longitude': 'Longitude', 'attacktype1_txt':'Attack_type','target1':'Target','nkill':'Killed','nwound':'Wounded','summary':'Summary','gname':'Group','targtype1_txt':'Target_type','weaptype1_txt':'Weapon_type','motive':'Motive'},inplace=True)
In [7]:
attacks=attacks[['Year','Month','Day','Country','Region','City','Latitude','Longitude','Attack_type','Killed','Wounded','Target','Summary','Group','Target_type','Weapon_type','Motive']]
attacks.head()
Out[7]:
In [8]:
attacks.isnull().sum()
Out[8]:
Let's look at each of the columns where there is missing information and see what we can do in order to complete it, if possible:
City
: if we know the coordinates we can retrieve the name of the city.Latitude
and Longitude
: if we know the city and the country then we can retrieve the aproximated geographic coordinates.Killed
and Wounded
: we cannot correclty retrieve this information without making wrong assumptions.Target
, Summary
, Motive
: these fields are impossible to infer from the data we have, so we might left them as is.The only field which makes sense to develop for the sequel of our analysis, is the one for the latitude and longitude coordinates. Indeed, as we will be plotting a world map with all the attacks it makes sense to be as complete as possible. Knowing the city won't influence our further analysis. For the two last fields, as discussed, we can't reliably retrieve this information.
To retrieve the latitude and longitude coordinates Google Maps Geocoding API seems to be the more appropriate tool. In order to minimize the number of requests, we have to create a mask to find which coordinates can be recovered.
In [7]:
mask = (attacks['City'].notnull()) & (attacks['City'] != 'Unknown') & (attacks['Longitude'].isnull())
attacks[mask].shape
Out[7]:
We see that 2503 attacks are concerned. Let's see how many we can recover.
In [12]:
gmaps = googlemaps.Client(key='AIzaSyC2wKci31uhUZrjfrSIhbZ2RuS5WtL0KPI')
In [76]:
def fill_lat_long(row):
location = row['Country'] + ', '+ row['City']
info = gmaps.geocode(location)
if info:
lat_lon = info[0]['geometry']['location']
row['Latitude'] = lat_lon['lat']
row['Longitude'] = lat_lon['lng']
return row
In [77]:
attacks[mask] = attacks[mask].apply(lambda row: fill_lat_long(row), axis = 1)
attacks.isnull().sum()
Out[77]:
We have been able to retrieve about 1600 coordinates (64%) using this API. Although it is provided by Google, we see that there is still some cities that are not correctly recognized. It still was worth the try.
In [10]:
# uncomment the line below if you don't have the attack.csv file yet
#attacks.to_csv('attacks.csv')
In [6]:
attacks = pd.read_csv('attacks.csv', index_col=0)
We developed a Javascript website that allows an interactive experience. The code is available, but it should be run locally.
As we have the coordinates for all the attacks of each year, we can dynamically represent the activity of terrorism on a world map. To do so, we will create an animated plot using the Matplotlib Animation and Basemap libraries.
Essentially, the markers represent the attacks for a specific year. Besides the geographic coordinates, each marker $i$ has a corresponding size $s_i$ and color $c_i$ defined as follows:
$$s_i = 10 \text{ abs}(k_i - k_{\text{median}})$$$$c_i = \frac{k_i - k_{\text{min}}}{k_{\text{max}} - k_{\text{min}}}$$where $k_i$ corresponds to the number of killed people for the attack represented by the marker $i$ and $k_{\text{median}}$, $k_{\text{min}}$ and $k_{\text{max}}$ correspond respectively to the median, the min and the max number of killed people for the specific year.
These two indicators combined will give us a much more comprehensive situation of the gravity of the attacks based on the number of killed people.
In [4]:
from visualization import world_view
In [7]:
world_view(attacks)
This visualization allows to grasp very well the geographic context of the terrorist attacks. One can realize the degree of severity of the 9/11 attacks and perceive the intense activity in the Middle East for the last few years in particular.
We can also of course look at a particular group and see more precisely the location where it attacked or even the distribution of the attack types. The two following plots show the idea for the Taliban
group.
In [15]:
from visualization import zoom_taliban_intensity
zoom_taliban_intensity(attacks)
In [16]:
from visualization import zoom_taliban_attack_types
zoom_taliban_attack_types(attacks)
In [34]:
from plots import *
plot_attack_ferq_per_year(attacks)
We see a clear increase of attacks for the last 5 years. The following plots will respectively give us information about:
In [20]:
plot_attack_types_freq(attacks)
In [21]:
plot_target_distribution(attacks)
In [22]:
plot_attack_freq_by_year_and_region(attacks)
In [25]:
plot_attack_distribution_by_region(attacks)
In [24]:
plot_most_affected_countries(attacks)
In [26]:
plot_top15_most_active_terrorist_groups(attacks)
In [37]:
joint_plot_coordinates(attacks)
In [39]:
killed_num_attacks_relation(attacks)
As they are very self-explanatory, we decided to not explicitly comment each one of these plots. All of them give a good overview of the situation and the context in which the terrorist attacks have evolved for the last decades.
As we also have the time for each data point, one can try to do some forecasting or merge it with other time series to see if one affects the other. Let's see how we can construct such time series for informational purposes.
In [26]:
from time_series import create_time_series
ts = create_time_series(attacks)
ts.head()
Out[26]:
In [27]:
from time_series import show_subset
show_subset(ts, 'Taliban')
The plot above shows the activity of the Taliban
from 2008 to 2011.
In [8]:
exploitation = pd.read_csv('attacks.csv', index_col=0)
exploitation.head()
Out[8]:
In [9]:
exploitation.drop(['Year', 'Latitude', 'Longitude', 'Month','Day', 'Country', 'City', 'Region', 'Target', 'Summary', 'Motive'], inplace = True, axis = 1)
exploitation.columns
Out[9]:
We will perform a Principal Component Analysis (PCA) in order to represent ourselves all the datapoints of the attacks in a comprehensive dimension for our mind.
Due to the tremendous amount of data, we made a first drastic selection of the features: we focused on the top 10 most active terrorist groups. We then tried for only one group to see if we had a better representation and it was actually the case. Hence, in the following cells, we analyze all the attacks for the Taliban
group only.
In [10]:
# we will need that to correctly legend the later plots
list_attack_type_unique = exploitation['Attack_type'].unique().tolist()
list_target_type_unique = exploitation['Target_type'].unique().tolist()
list_weapon_type_unique = exploitation['Weapon_type'].unique().tolist()
list_unique = [list_attack_type_unique, list_target_type_unique, list_weapon_type_unique]
In [11]:
# select the Taliban group
features = exploitation[exploitation['Group'] == 'Taliban']
del features['Group']
# delete any row that contain a NaN
features = features.dropna()
# check amount of remaining data
features.shape
Out[11]:
In [12]:
# again we will need these lists for the legends
list_attack_type = features['Attack_type'].tolist()
list_target_type = features['Target_type'].tolist()
list_weapon_type = features['Weapon_type'].tolist()
list_ = [list_attack_type, list_target_type, list_weapon_type]
In [13]:
features.head()
Out[13]:
Another problem that needs to be handled is that we are facing with numerical and categorical variables. In order to convert the latter ones without making wrong assumptions, we will use the one-hot encoding that is already provided by the pandas
librabry under the .get_dummies
function. In words, it is equivalent to represent a categorical variable as a vector of $n$ dimensions, where $n$ is the number of categories available for the said variable. Then, all vectors representing a category are orthogonal to each other, avoiding this way hazardous mathematical assumptions.
In [14]:
features = pd.get_dummies(features, columns=['Attack_type', 'Target_type', 'Weapon_type'])
features.head()
Out[14]:
In [15]:
features =(features-features.mean())/features.std()
In [16]:
from pca import pca, plot_killed_wounded_pca, plot_attack_weapon_target_type
features_pca = pca(features)
plot_killed_wounded_pca(features_pca)
plot_attack_weapon_target_type(features_pca, list_unique, list_)
We see that the number of killed people increases as the second component decreases. For the number of wounded people it depends on the two components, but the dependance is more difficult to explain. Hence, we can say that the second component explains sufficiently the number of killed people.
The PCA for the different types of data shows interesting facts. For the two first plots, we are able to distinguish the clusters representing each of the categories. We can particularly notice the following results:
attack_type
plot: the Bombing/Explosion
is very well clustered. It is also apart from the remaining datapoints. The Armed Assault
is also apart although it shares some caracteristics with the Assassination
. A portion of the Unarmed Assault
has its own disctinctive location. The remaining categories are grouped in the same fashion but with some dense clustering at some parts.weapon_type
plot: the Explosives/Bombs/Dynamite
, Firearm
and Chemical
are the most significant clusters.target_type
plot: Here we see that PCA fails to explain the target categories, except possibly that above a specific threshold, the second component clearly identifies the Educational Institution
.Coupled with the two previous plots, we can add to our analysis that firearms and explosions are the most murderous weapons. We can also say that there is no obvious type of weapon or attack that corresponds to a particular target unless for Educational Institution
type.
In this manner, we just demonstrated that PCA allows to represent high-dimensional information that is inter-correlated into a much lower dimension with patterns of similarity. But, when we want to do some clustering, a more suitable approach is to use the Laplacian Eigenmaps algorithm. Indeed, where PCA is generally used for linear dimensionality reduction, the Laplacian Eigenmaps is a spectral embedding for non-linear dimensionality reduction well-tried for clustering representation. Let's see what we get.
In [17]:
from se import se, plot_killed_wounded_se, plot_attack_weapon_target_type
features_se = se(features)
plot_killed_wounded_se(features_se)
plot_attack_weapon_target_type(features_se, list_unique, list_)
First, notice that the number of killed and wounded people is dealt the same way in both cases, i.e., the high datapoints lie approximately in the center. Remember that for the PCA, both cases where not treated the exact same way.
Then, with this method, we see a more organized representation. We can clearly distinguish three straight lines each representing a category with some condensed area where these three lines are joining. With the two first plots, we can clearly see a link as the variable they represent are, in a way, related: if we look at the Explosives/Bombs/Dynamite
weapon type we see without surprise that it was used in Bombing/Explosion
attacks but also in Assassination
cases and some Armed Assault
raids. For the last plot, we saw with PCA that it was difficult to read any information unless for the Educational Institution
target type. However, with the spectral embedding method we see that the Police
target type is heavily represented in the end of each straight line, which was not the case with PCA. The remaining target types are rather mix in the center.
In this way, we saw that, due to the different conceptual approaches of each technique, the two methods lead to distinctive representations. In the context of this project, we saw that spectral embedding using the Laplacian Eigenmaps algorithm performs a better clustering representation. Nevertheless, the two techniques allow to draw interesting facts.
In this analysis we will try to cluster our data. In particular, we will remove the group label and try to see if the clustering allows us to find the groups. This will allow us to find if there is some group caracteristics, and if groups naturally cluster our dataset.
Again, we will drop features that might not give insightful information for the next analysis.
In [137]:
exploitation = pd.read_csv('attacks.csv', index_col=0)
In [138]:
exploitation['Killed'] = exploitation['Killed'].fillna(0).astype(int)
exploitation['Wounded'] = exploitation['Wounded'].fillna(0).astype(int)
In [139]:
exploitation.head()
Out[139]:
In [140]:
exploitation.drop(['Latitude', 'Longitude', 'Month','Day', 'Country', 'Target', 'Summary', 'Motive'], inplace = True, axis = 1)
exploitation.columns
Out[140]:
We will add a new feature isCapital
who is 1 if the city is a capital or 0 otherwise. In order to do this, as there is no python module for that, we record all capitals of each country in our list of countries.
In [141]:
#List of capital based on the list of country we have
Capitals = ['bagdad', 'new delhi', 'mogadiscio', 'abuja', 'manille', 'ankara','islamabad', 'sanaa', 'damas', 'nairobi', 'yaoundé', 'niamey', 'beyrouth', 'ndjaména', 'riyad', 'paris', 'kaboul', 'tunis', 'amman', 'berlin', 'kampala', 'addis-abeba', 'le caire', 'bruxelles', 'manama', 'amsterdam', 'jerusalem', 'tripoli', 'katmandou', 'london', 'moscou', 'dodoma', 'abou dabi', 'ouagadougou', 'new york', 'tbilisi', 'djibouti', 'jakarta', 'kuala lumpur', 'vienne','bangkok']
In order to match the capital and the city names we will homogenize both of them. First, lets convert the city name in lower case.
In [142]:
exploitation['City'] = exploitation['City'].apply(lambda x: str(x).lower())
Now, we will use .get_close_matches
from the difflib
library in order to see if there is some capital who is not recognized due to some spelling difference.
In [47]:
forgotten_capital = []
In [48]:
for city in exploitation.City:
tmp = di.get_close_matches(city, Capitals, cutoff=0.7)
if(len(tmp) > 0):
for c in tmp:
if(c not in Capitals and c not in forgotten_capital):
forgotten_capital.append(c)
In [51]:
forgotten_capital
Out[51]:
We check manually what corresponds to capitals and what corresponds to other cities and just remove these lasts. Then, we concat the two lists.
In [70]:
for el in ['laman', 'baga', 'kalampur']:
forgotten_capital.remove(el)
In [143]:
Capitals = np.concatenate((Capitals,forgotten_capital), axis = 0)
Now, we apply a lambda function over the dataset to know if a city is a capital.
In [144]:
exploitation['isCapital'] = exploitation['City'].apply(lambda x: x in Capitals)
exploitation['isCapital'] = exploitation['isCapital'].astype(int)
Due to the tremendous amount of data, we made a first drastic selection of the features: we focused on the top 5 most active terrorist groups since 2010.
In [145]:
exploitation = exploitation[exploitation['Year'] > 2009]
#find the top 5 groups (the first one is unknow so we select groups from 1 to 6)
top_5 = exploitation['Group'].value_counts()[1:6].index.values
exploitation['top_5'] = exploitation['Group'].apply(lambda x: (x in top_5))
exploitation = exploitation[exploitation['top_5'] == True]
As before we use binary encoding for the categorical features Attacks type, Region, Target type and Weapon type.
In [146]:
exploitation = pd.get_dummies(exploitation, columns=['Region', 'Attack_type', 'Target_type', 'Weapon_type'])
exploitation.head()
Out[146]:
Now we can drop features that we don't need for our analysis. We just save the group labels to compare them with the results at the end of our analysis.
In [147]:
label = exploitation.Group
In [150]:
exploitation.drop(['Year', 'City', 'Group', 'top_5'], axis = 1, inplace = True)
Then normalize our features as before.
In [151]:
exploitation = (exploitation-exploitation.mean())/exploitation.std()
Lets try the K-Means clustering algorithm. First, we will try to find the best k
for our data. To do this we try a range of k and then graphically analyse our result. We decide to try for k
between 1 and 20.
In [123]:
from clustering import *
In [104]:
best_k(1, 20, exploitation)
We were expected a discontinuity in the shape of the curve, that would indicate that there is a natural way to cluster our dataset over K-Means. Such result can be explained by the nature of this algorithm. Furthermore, K-means assumes that we deal with spherical clusters and that each cluster has roughly an equal number of observations.
We have no way to know if our data can be clustered by spherical clusters, but lets see if all the groups have roughly the same number of observations.
In [172]:
sns.countplot(label)
plt.xticks(rotation=90)
plt.show()
We see that groups don't have the same number of observations this can be the reason of our result.
One can think of Gaussian Mixture Models as generalizing K-Means clustering to incorporate information about the covariance structure of the data. Let's see is we can reach a better result.
Here we use the BIC criterion in order to select the number of components in an efficient way.
In [126]:
best_k_gmm(1, 20, exploitation)
Here we can clearly identify a discontinuity in the shape of the curve, this let us think that the best clustering is obtain with k
around 5. This is interesting because we decide to focus only on the 5 most active groups. Lets focus our analysis on the GMM clustering with k = 5
.
Lets now see if the GMM with k=5
is close to the group labels.
In [136]:
prediction = prediction_kmm(5, exploitation)
In [189]:
plot_prediction(label, prediction)
As we can see the clustering did not find the original groups. Even if the number of clusters is around 5, it's not the group label that dominated the clustering. It must be a combination of different factors, for example the number of victimes or the attack types as well as the group who can explain this clustering.
All along the Data Science pipeline, we found that there is multiple perspectives from which we can look at the data we have. Each one proposes a different approach with different results. We found in particular that:
Educational Institution
. Indeed, the analysis showed that the main Educational Institution
attacks where made using Chemical
material. Hence, looking at the coordinates of these attacks one can take the necessary measures to protect such facility, if deemed necessary. When developing this notebook we also notice that integrating the coordinates in the PCA didn't improve the clustering meaning that no correlation can be drawn from the coordinates and the remaining variables.We restricted our attention to the Taliban
group mainly because it contained a large amount of datapoints, but any other group which is sufficiently represented can be choosen as well. We can then compare them to see if the conclusions we draw are specific to a group or if they are a shared component.