Understanding terrorist attacks

EE-558 A Network Tour of Data Science, EPFL

Joël M. Fonseca, Nelson Antunes, Enguerrand Granoux and Hedi Driss

The aim of this project is to understand the terrorist attacks from different perspectives. The data scientist's toolkit provides a panel of useful methods that allows to highlight the main characteristics of the data we are investigating. The project structure is defined as follows:

  • I. Data Acquisition

    In this part, the goal is to understand the data. What are the information we have ? How is it represented ? Can missing values be recovered without making wrong assumptions ? What are the relevant variables as part of this project ? That is the kind of questions we try to answer.

  • II. Data Exploration:

    Here, we need to bring the data alive, to let it express itself. We present a map that dynamically illustrates the evolution of all the terrorist attacks avaible from the database. We also plot some basic facts, like the most affected countries and the most used types of weapon. Finally, we propose alternative roads that can be further developed like the time series.

  • III. Data Exploitation:

    In this section, we use the power of mathematics to represent the data in a more abstract level, in order to find the hidden structure of the attacks. In particular, Principal Component Analysis (PCA) is used for a specific terrorist group to highlight the pattern of similarities. Then, a spectral embedding with the Laplacian Eigenmaps algorithm is applied to compare the two techniques. In a second part, we will remove the group labels and see if we can recover them with different clustering algorithms.

  • IV. Conclusion:

    Finally, the conclusions drawn from the results founded in the two previous sections are summarized.


In [2]:
# the basics
import numpy as np
import pandas as pd
import seaborn as sns

# useful libraries
import googlemaps

# for visualization
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

# To match name
import difflib as di

# Clustering part
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GMM

# avoid warnings
import warnings
warnings.filterwarnings('ignore')

# automatically reloads the different .py's before running a cell
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

I. Data Acquisition

In this project, we will work on the Global Terorisme Database which is an open-source database containing information about more than 170,000 terrorist attacks around the world from 1970 through 2016 provided by the START consortium. It can be downloaded here. Make sure you download it before running this notebook.

Let's first of all make the data usable for our analysis.

Load and clean the data

We will first analyze the content of the database to understand its content and then restrict ourselves to the most significant features in order to carry out our task in the best possible way.


In [3]:
attacks=pd.read_csv('globalterrorismdb_0617dist.csv',encoding='ISO-8859-1')
attacks.head()


Out[3]:
eventid iyear imonth iday approxdate extended resolution country country_txt region ... addnotes scite1 scite2 scite3 dbsource INT_LOG INT_IDEO INT_MISC INT_ANY related
0 197000000001 1970 7 2 NaN 0 NaN 58 Dominican Republic 2 ... NaN NaN NaN NaN PGIS 0 0 0 0 NaN
1 197000000002 1970 0 0 NaN 0 NaN 130 Mexico 1 ... NaN NaN NaN NaN PGIS 0 1 1 1 NaN
2 197001000001 1970 1 0 NaN 0 NaN 160 Philippines 5 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
3 197001000002 1970 1 0 NaN 0 NaN 78 Greece 8 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
4 197001000003 1970 1 0 NaN 0 NaN 101 Japan 4 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN

5 rows × 135 columns


In [5]:
attacks.shape


Out[5]:
(170350, 135)

We see that there is 135 different columns. We won't be using all of them. A complete description of each column can be found here. After thinking, we made a first selection of the most important features we will consider. Furthermore, we renamed all the columns of interest.


In [6]:
attacks.rename(columns={'iyear':'Year','imonth':'Month','iday':'Day','country_txt':'Country','region_txt':'Region', 'city': 'City', 'latitude':'Latitude', 'longitude': 'Longitude', 'attacktype1_txt':'Attack_type','target1':'Target','nkill':'Killed','nwound':'Wounded','summary':'Summary','gname':'Group','targtype1_txt':'Target_type','weaptype1_txt':'Weapon_type','motive':'Motive'},inplace=True)

In [7]:
attacks=attacks[['Year','Month','Day','Country','Region','City','Latitude','Longitude','Attack_type','Killed','Wounded','Target','Summary','Group','Target_type','Weapon_type','Motive']]
attacks.head()


Out[7]:
Year Month Day Country Region City Latitude Longitude Attack_type Killed Wounded Target Summary Group Target_type Weapon_type Motive
0 1970 7 2 Dominican Republic Central America & Caribbean Santo Domingo 18.456792 -69.951164 Assassination 1.0 0.0 Julio Guzman NaN MANO-D Private Citizens & Property Unknown NaN
1 1970 0 0 Mexico North America Mexico city 19.432608 -99.133207 Hostage Taking (Kidnapping) 0.0 0.0 Nadine Chaval, daughter NaN 23rd of September Communist League Government (Diplomatic) Unknown NaN
2 1970 1 0 Philippines Southeast Asia Unknown 15.478598 120.599741 Assassination 1.0 0.0 Employee NaN Unknown Journalists & Media Unknown NaN
3 1970 1 0 Greece Western Europe Athens 37.983773 23.728157 Bombing/Explosion NaN NaN U.S. Embassy NaN Unknown Government (Diplomatic) Explosives/Bombs/Dynamite NaN
4 1970 1 0 Japan East Asia Fukouka 33.580412 130.396361 Facility/Infrastructure Attack NaN NaN U.S. Consulate NaN Unknown Government (Diplomatic) Incendiary NaN

In [8]:
attacks.isnull().sum()


Out[8]:
Year                0
Month               0
Day                 0
Country             0
Region              0
City              446
Latitude         4606
Longitude        4606
Attack_type         0
Killed           9682
Wounded         15325
Target            634
Summary         66138
Group               0
Target_type         0
Weapon_type         0
Motive         121764
dtype: int64

Let's look at each of the columns where there is missing information and see what we can do in order to complete it, if possible:

  • City: if we know the coordinates we can retrieve the name of the city.
  • Latitude and Longitude: if we know the city and the country then we can retrieve the aproximated geographic coordinates.
  • Killed and Wounded: we cannot correclty retrieve this information without making wrong assumptions.
  • Target, Summary, Motive: these fields are impossible to infer from the data we have, so we might left them as is.

The only field which makes sense to develop for the sequel of our analysis, is the one for the latitude and longitude coordinates. Indeed, as we will be plotting a world map with all the attacks it makes sense to be as complete as possible. Knowing the city won't influence our further analysis. For the two last fields, as discussed, we can't reliably retrieve this information.

Using Google Maps Geocoding API

To retrieve the latitude and longitude coordinates Google Maps Geocoding API seems to be the more appropriate tool. In order to minimize the number of requests, we have to create a mask to find which coordinates can be recovered.


In [7]:
mask = (attacks['City'].notnull()) & (attacks['City'] != 'Unknown') & (attacks['Longitude'].isnull())
attacks[mask].shape


Out[7]:
(2503, 17)

We see that 2503 attacks are concerned. Let's see how many we can recover.


In [12]:
gmaps = googlemaps.Client(key='AIzaSyC2wKci31uhUZrjfrSIhbZ2RuS5WtL0KPI')

In [76]:
def fill_lat_long(row): 

    location = row['Country'] + ', '+ row['City']
    info = gmaps.geocode(location)
    if info:
        lat_lon = info[0]['geometry']['location']
        row['Latitude'] = lat_lon['lat']
        row['Longitude'] = lat_lon['lng']

    return row

In [77]:
attacks[mask] = attacks[mask].apply(lambda row: fill_lat_long(row), axis = 1)
attacks.isnull().sum()


Out[77]:
Year                0
Month               0
Day                 0
Country             0
Region              0
City              446
Latitude         3094
Longitude        3094
AttackType          0
Killed           9682
Wounded         15325
Target            634
Summary         66138
Group               0
Target_type         0
Weapon_type         0
Motive         121764
dtype: int64

We have been able to retrieve about 1600 coordinates (64%) using this API. Although it is provided by Google, we see that there is still some cities that are not correctly recognized. It still was worth the try.


In [10]:
# uncomment the line below if you don't have the attack.csv file yet
#attacks.to_csv('attacks.csv')

In [6]:
attacks = pd.read_csv('attacks.csv', index_col=0)

II. Data Exploration

Javascript website

We developed a Javascript website that allows an interactive experience. The code is available, but it should be run locally.

World view

As we have the coordinates for all the attacks of each year, we can dynamically represent the activity of terrorism on a world map. To do so, we will create an animated plot using the Matplotlib Animation and Basemap libraries.

Essentially, the markers represent the attacks for a specific year. Besides the geographic coordinates, each marker $i$ has a corresponding size $s_i$ and color $c_i$ defined as follows:

$$s_i = 10 \text{ abs}(k_i - k_{\text{median}})$$$$c_i = \frac{k_i - k_{\text{min}}}{k_{\text{max}} - k_{\text{min}}}$$

where $k_i$ corresponds to the number of killed people for the attack represented by the marker $i$ and $k_{\text{median}}$, $k_{\text{min}}$ and $k_{\text{max}}$ correspond respectively to the median, the min and the max number of killed people for the specific year.

These two indicators combined will give us a much more comprehensive situation of the gravity of the attacks based on the number of killed people.


In [4]:
from visualization import world_view

In [7]:
world_view(attacks)


This visualization allows to grasp very well the geographic context of the terrorist attacks. One can realize the degree of severity of the 9/11 attacks and perceive the intense activity in the Middle East for the last few years in particular.

We can also of course look at a particular group and see more precisely the location where it attacked or even the distribution of the attack types. The two following plots show the idea for the Taliban group.


In [15]:
from visualization import zoom_taliban_intensity
zoom_taliban_intensity(attacks)



In [16]:
from visualization import zoom_taliban_attack_types
zoom_taliban_attack_types(attacks)


Basic facts

We can further explore our data by analyzing the number of attacks per year.


In [34]:
from plots import *
plot_attack_ferq_per_year(attacks)


We see a clear increase of attacks for the last 5 years. The following plots will respectively give us information about:

  • the attacking methods used by the terrorists,
  • the attack frequency of each type of targets,
  • the activity of terrorist attacks by year for each region,
  • the distribution of attack types for each region,
  • the most concerned countries and finally,
  • the most active terrorist groups.

In [20]:
plot_attack_types_freq(attacks)



In [21]:
plot_target_distribution(attacks)



In [22]:
plot_attack_freq_by_year_and_region(attacks)



In [25]:
plot_attack_distribution_by_region(attacks)



In [24]:
plot_most_affected_countries(attacks)



In [26]:
plot_top15_most_active_terrorist_groups(attacks)



In [37]:
joint_plot_coordinates(attacks)


<matplotlib.figure.Figure at 0x1103c8cf8>

In [39]:
killed_num_attacks_relation(attacks)


As they are very self-explanatory, we decided to not explicitly comment each one of these plots. All of them give a good overview of the situation and the context in which the terrorist attacks have evolved for the last decades.

Time series

As we also have the time for each data point, one can try to do some forecasting or merge it with other time series to see if one affects the other. Let's see how we can construct such time series for informational purposes.


In [26]:
from time_series import create_time_series
ts = create_time_series(attacks)
ts.head()


Out[26]:
DateTime Country City Latitude Longitude Attack_type Killed Wounded Target Group
0 1970-07-02 Dominican Republic Santo Domingo 18.456792 -69.951164 Assassination 1.0 0.0 Julio Guzman MANO-D
5 1970-01-01 United States Cairo 37.005105 -89.176269 Armed Assault 0.0 0.0 Cairo Police Headquarters Black Nationalists
6 1970-01-02 Uruguay Montevideo -34.891151 -56.187214 Assassination 0.0 0.0 Juan Maria de Lucah/Chief of Directorate of in... Tupamaros (Uruguay)
7 1970-01-02 United States Oakland 37.805065 -122.273024 Bombing/Explosion 0.0 0.0 Edes Substation Unknown
8 1970-01-02 United States Madison 43.076592 -89.412488 Facility/Infrastructure Attack 0.0 0.0 R.O.T.C. offices at University of Wisconsin, M... New Year's Gang

In [27]:
from time_series import show_subset
show_subset(ts, 'Taliban')


The plot above shows the activity of the Taliban from 2008 to 2011.

III. Data Exploitation

First Analysis

Feature Selection

We will drop features that might not give insightful information for the next analysis.


In [8]:
exploitation = pd.read_csv('attacks.csv', index_col=0)
exploitation.head()


Out[8]:
Year Month Day Country Region City Latitude Longitude Attack_type Killed Wounded Target Summary Group Target_type Weapon_type Motive
0 1970 7 2 Dominican Republic Central America & Caribbean Santo Domingo 18.456792 -69.951164 Assassination 1.0 0.0 Julio Guzman NaN MANO-D Private Citizens & Property Unknown NaN
1 1970 0 0 Mexico North America Mexico city 19.432608 -99.133207 Hostage Taking (Kidnapping) 0.0 0.0 Nadine Chaval, daughter NaN 23rd of September Communist League Government (Diplomatic) Unknown NaN
2 1970 1 0 Philippines Southeast Asia Unknown 15.478598 120.599741 Assassination 1.0 0.0 Employee NaN Unknown Journalists & Media Unknown NaN
3 1970 1 0 Greece Western Europe Athens 37.983773 23.728157 Bombing/Explosion NaN NaN U.S. Embassy NaN Unknown Government (Diplomatic) Explosives/Bombs/Dynamite NaN
4 1970 1 0 Japan East Asia Fukouka 33.580412 130.396361 Facility/Infrastructure Attack NaN NaN U.S. Consulate NaN Unknown Government (Diplomatic) Incendiary NaN

In [9]:
exploitation.drop(['Year', 'Latitude', 'Longitude', 'Month','Day', 'Country', 'City', 'Region', 'Target', 'Summary', 'Motive'], inplace = True, axis = 1)
exploitation.columns


Out[9]:
Index(['Attack_type', 'Killed', 'Wounded', 'Group', 'Target_type',
       'Weapon_type'],
      dtype='object')

PCA

We will perform a Principal Component Analysis (PCA) in order to represent ourselves all the datapoints of the attacks in a comprehensive dimension for our mind.

Due to the tremendous amount of data, we made a first drastic selection of the features: we focused on the top 10 most active terrorist groups. We then tried for only one group to see if we had a better representation and it was actually the case. Hence, in the following cells, we analyze all the attacks for the Taliban group only.


In [10]:
# we will need that to correctly legend the later plots
list_attack_type_unique = exploitation['Attack_type'].unique().tolist()
list_target_type_unique = exploitation['Target_type'].unique().tolist()
list_weapon_type_unique = exploitation['Weapon_type'].unique().tolist()
list_unique = [list_attack_type_unique, list_target_type_unique, list_weapon_type_unique]

In [11]:
# select the Taliban group
features = exploitation[exploitation['Group'] == 'Taliban']
del features['Group']
# delete any row that contain a NaN
features = features.dropna()
# check amount of remaining data
features.shape


Out[11]:
(6081, 5)

In [12]:
# again we will need these lists for the legends
list_attack_type = features['Attack_type'].tolist()
list_target_type = features['Target_type'].tolist()
list_weapon_type = features['Weapon_type'].tolist()
list_ = [list_attack_type, list_target_type, list_weapon_type]

In [13]:
features.head()


Out[13]:
Attack_type Killed Wounded Target_type Weapon_type
58969 Hostage Taking (Kidnapping) 0.0 0.0 Police Unknown
59943 Hijacking 0.0 0.0 Airports & Aircraft Explosives/Bombs/Dynamite
60391 Hijacking 0.0 0.0 Airports & Aircraft Explosives/Bombs/Dynamite
60715 Bombing/Explosion 0.0 0.0 Airports & Aircraft Explosives/Bombs/Dynamite
73044 Bombing/Explosion 3.0 1.0 Government (General) Explosives/Bombs/Dynamite

One-hot encoding

Another problem that needs to be handled is that we are facing with numerical and categorical variables. In order to convert the latter ones without making wrong assumptions, we will use the one-hot encoding that is already provided by the pandas librabry under the .get_dummies function. In words, it is equivalent to represent a categorical variable as a vector of $n$ dimensions, where $n$ is the number of categories available for the said variable. Then, all vectors representing a category are orthogonal to each other, avoiding this way hazardous mathematical assumptions.


In [14]:
features = pd.get_dummies(features, columns=['Attack_type', 'Target_type', 'Weapon_type'])
features.head()


Out[14]:
Killed Wounded Attack_type_Armed Assault Attack_type_Assassination Attack_type_Bombing/Explosion Attack_type_Facility/Infrastructure Attack Attack_type_Hijacking Attack_type_Hostage Taking (Barricade Incident) Attack_type_Hostage Taking (Kidnapping) Attack_type_Unarmed Assault ... Target_type_Violent Political Party Weapon_type_Chemical Weapon_type_Explosives/Bombs/Dynamite Weapon_type_Firearms Weapon_type_Incendiary Weapon_type_Melee Weapon_type_Other Weapon_type_Sabotage Equipment Weapon_type_Unknown Weapon_type_Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)
58969 0.0 0.0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
59943 0.0 0.0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
60391 0.0 0.0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
60715 0.0 0.0 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
73044 3.0 1.0 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 40 columns

Feature normalization

It is also always better to normalize all our data to give the features all the same importance.


In [15]:
features =(features-features.mean())/features.std()

In [16]:
from pca import pca, plot_killed_wounded_pca, plot_attack_weapon_target_type

features_pca = pca(features)
plot_killed_wounded_pca(features_pca)
plot_attack_weapon_target_type(features_pca, list_unique, list_)


We see that the number of killed people increases as the second component decreases. For the number of wounded people it depends on the two components, but the dependance is more difficult to explain. Hence, we can say that the second component explains sufficiently the number of killed people.

The PCA for the different types of data shows interesting facts. For the two first plots, we are able to distinguish the clusters representing each of the categories. We can particularly notice the following results:

  • for the attack_type plot: the Bombing/Explosion is very well clustered. It is also apart from the remaining datapoints. The Armed Assault is also apart although it shares some caracteristics with the Assassination. A portion of the Unarmed Assault has its own disctinctive location. The remaining categories are grouped in the same fashion but with some dense clustering at some parts.
  • for the weapon_type plot: the Explosives/Bombs/Dynamite, Firearm and Chemical are the most significant clusters.
  • for the target_type plot: Here we see that PCA fails to explain the target categories, except possibly that above a specific threshold, the second component clearly identifies the Educational Institution.

Coupled with the two previous plots, we can add to our analysis that firearms and explosions are the most murderous weapons. We can also say that there is no obvious type of weapon or attack that corresponds to a particular target unless for Educational Institution type.

In this manner, we just demonstrated that PCA allows to represent high-dimensional information that is inter-correlated into a much lower dimension with patterns of similarity. But, when we want to do some clustering, a more suitable approach is to use the Laplacian Eigenmaps algorithm. Indeed, where PCA is generally used for linear dimensionality reduction, the Laplacian Eigenmaps is a spectral embedding for non-linear dimensionality reduction well-tried for clustering representation. Let's see what we get.

Laplacian Eigenmaps


In [17]:
from se import se, plot_killed_wounded_se, plot_attack_weapon_target_type

features_se = se(features)
plot_killed_wounded_se(features_se)
plot_attack_weapon_target_type(features_se, list_unique, list_)


First, notice that the number of killed and wounded people is dealt the same way in both cases, i.e., the high datapoints lie approximately in the center. Remember that for the PCA, both cases where not treated the exact same way.

Then, with this method, we see a more organized representation. We can clearly distinguish three straight lines each representing a category with some condensed area where these three lines are joining. With the two first plots, we can clearly see a link as the variable they represent are, in a way, related: if we look at the Explosives/Bombs/Dynamite weapon type we see without surprise that it was used in Bombing/Explosion attacks but also in Assassination cases and some Armed Assault raids. For the last plot, we saw with PCA that it was difficult to read any information unless for the Educational Institution target type. However, with the spectral embedding method we see that the Police target type is heavily represented in the end of each straight line, which was not the case with PCA. The remaining target types are rather mix in the center.

In this way, we saw that, due to the different conceptual approaches of each technique, the two methods lead to distinctive representations. In the context of this project, we saw that spectral embedding using the Laplacian Eigenmaps algorithm performs a better clustering representation. Nevertheless, the two techniques allow to draw interesting facts.

Second analysis

In this analysis we will try to cluster our data. In particular, we will remove the group label and try to see if the clustering allows us to find the groups. This will allow us to find if there is some group caracteristics, and if groups naturally cluster our dataset.

Feature selection

Again, we will drop features that might not give insightful information for the next analysis.


In [137]:
exploitation = pd.read_csv('attacks.csv', index_col=0)

In [138]:
exploitation['Killed'] = exploitation['Killed'].fillna(0).astype(int)
exploitation['Wounded'] = exploitation['Wounded'].fillna(0).astype(int)

In [139]:
exploitation.head()


Out[139]:
Year Month Day Country Region City Latitude Longitude Attack_type Killed Wounded Target Summary Group Target_type Weapon_type Motive
0 1970 7 2 Dominican Republic Central America & Caribbean Santo Domingo 18.456792 -69.951164 Assassination 1 0 Julio Guzman NaN MANO-D Private Citizens & Property Unknown NaN
1 1970 0 0 Mexico North America Mexico city 19.432608 -99.133207 Hostage Taking (Kidnapping) 0 0 Nadine Chaval, daughter NaN 23rd of September Communist League Government (Diplomatic) Unknown NaN
2 1970 1 0 Philippines Southeast Asia Unknown 15.478598 120.599741 Assassination 1 0 Employee NaN Unknown Journalists & Media Unknown NaN
3 1970 1 0 Greece Western Europe Athens 37.983773 23.728157 Bombing/Explosion 0 0 U.S. Embassy NaN Unknown Government (Diplomatic) Explosives/Bombs/Dynamite NaN
4 1970 1 0 Japan East Asia Fukouka 33.580412 130.396361 Facility/Infrastructure Attack 0 0 U.S. Consulate NaN Unknown Government (Diplomatic) Incendiary NaN

In [140]:
exploitation.drop(['Latitude', 'Longitude', 'Month','Day', 'Country', 'Target', 'Summary', 'Motive'], inplace = True, axis = 1)
exploitation.columns


Out[140]:
Index(['Year', 'Region', 'City', 'Attack_type', 'Killed', 'Wounded', 'Group',
       'Target_type', 'Weapon_type'],
      dtype='object')

Feature engineering

We will add a new feature isCapital who is 1 if the city is a capital or 0 otherwise. In order to do this, as there is no python module for that, we record all capitals of each country in our list of countries.


In [141]:
#List of capital based on the list of country we have
Capitals = ['bagdad', 'new delhi', 'mogadiscio', 'abuja', 'manille', 'ankara','islamabad', 'sanaa', 'damas', 'nairobi', 'yaoundé', 'niamey', 'beyrouth', 'ndjaména', 'riyad', 'paris', 'kaboul', 'tunis', 'amman', 'berlin', 'kampala', 'addis-abeba', 'le caire', 'bruxelles', 'manama', 'amsterdam', 'jerusalem', 'tripoli', 'katmandou', 'london', 'moscou', 'dodoma', 'abou dabi', 'ouagadougou', 'new york', 'tbilisi', 'djibouti', 'jakarta', 'kuala lumpur', 'vienne','bangkok']

In order to match the capital and the city names we will homogenize both of them. First, lets convert the city name in lower case.


In [142]:
exploitation['City'] = exploitation['City'].apply(lambda x: str(x).lower())

Now, we will use .get_close_matches from the difflib library in order to see if there is some capital who is not recognized due to some spelling difference.


In [47]:
forgotten_capital = []

In [48]:
for city in exploitation.City:
    tmp = di.get_close_matches(city, Capitals, cutoff=0.7)
    if(len(tmp) > 0):
        for c in tmp:
            if(c not in Capitals and c not in forgotten_capital):
                forgotten_capital.append(c)

In [51]:
forgotten_capital


Out[51]:
['laman',
 'baghdad',
 'kabul',
 'baga',
 'riyadh',
 'kalampur',
 "n'djamena",
 'moscow']

We check manually what corresponds to capitals and what corresponds to other cities and just remove these lasts. Then, we concat the two lists.


In [70]:
for el in ['laman', 'baga', 'kalampur']:
    forgotten_capital.remove(el)

In [143]:
Capitals = np.concatenate((Capitals,forgotten_capital), axis = 0)

Now, we apply a lambda function over the dataset to know if a city is a capital.


In [144]:
exploitation['isCapital'] = exploitation['City'].apply(lambda x: x in Capitals)
exploitation['isCapital'] = exploitation['isCapital'].astype(int)

Data Selection

Due to the tremendous amount of data, we made a first drastic selection of the features: we focused on the top 5 most active terrorist groups since 2010.


In [145]:
exploitation = exploitation[exploitation['Year'] > 2009]
#find the top 5 groups (the first one is unknow so we select groups from 1 to 6)
top_5 = exploitation['Group'].value_counts()[1:6].index.values

exploitation['top_5'] = exploitation['Group'].apply(lambda x: (x in top_5))
exploitation = exploitation[exploitation['top_5'] == True]

One-hot encoding

As before we use binary encoding for the categorical features Attacks type, Region, Target type and Weapon type.


In [146]:
exploitation = pd.get_dummies(exploitation, columns=['Region', 'Attack_type', 'Target_type', 'Weapon_type'])
exploitation.head()


Out[146]:
Year City Killed Wounded Group isCapital top_5 Region_Central Asia Region_Eastern Europe Region_Middle East & North Africa ... Target_type_Violent Political Party Weapon_type_Chemical Weapon_type_Explosives/Bombs/Dynamite Weapon_type_Firearms Weapon_type_Incendiary Weapon_type_Melee Weapon_type_Other Weapon_type_Sabotage Equipment Weapon_type_Unknown Weapon_type_Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)
94760 2010 dwah mande 4 0 Taliban 0 True 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
94789 2010 salpatra 1 0 Communist Party of India - Maoist (CPI-Maoist) 0 True 0 0 0 ... 1 0 0 1 0 0 0 0 0 0
94790 2010 murabani 1 0 Communist Party of India - Maoist (CPI-Maoist) 0 True 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
94801 2010 dawlatabad 4 2 Taliban 0 True 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
94836 2010 gumla 7 0 Communist Party of India - Maoist (CPI-Maoist) 0 True 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 53 columns

Now we can drop features that we don't need for our analysis. We just save the group labels to compare them with the results at the end of our analysis.


In [147]:
label = exploitation.Group

In [150]:
exploitation.drop(['Year', 'City', 'Group', 'top_5'], axis = 1, inplace = True)

Then normalize our features as before.


In [151]:
exploitation = (exploitation-exploitation.mean())/exploitation.std()

K-Means Clustering

Lets try the K-Means clustering algorithm. First, we will try to find the best k for our data. To do this we try a range of k and then graphically analyse our result. We decide to try for k between 1 and 20.


In [123]:
from clustering import *

In [104]:
best_k(1, 20, exploitation)


We were expected a discontinuity in the shape of the curve, that would indicate that there is a natural way to cluster our dataset over K-Means. Such result can be explained by the nature of this algorithm. Furthermore, K-means assumes that we deal with spherical clusters and that each cluster has roughly an equal number of observations.

We have no way to know if our data can be clustered by spherical clusters, but lets see if all the groups have roughly the same number of observations.


In [172]:
sns.countplot(label)
plt.xticks(rotation=90)
plt.show()


We see that groups don't have the same number of observations this can be the reason of our result.

Gaussian Mixture Model

One can think of Gaussian Mixture Models as generalizing K-Means clustering to incorporate information about the covariance structure of the data. Let's see is we can reach a better result.

Here we use the BIC criterion in order to select the number of components in an efficient way.


In [126]:
best_k_gmm(1, 20, exploitation)


Here we can clearly identify a discontinuity in the shape of the curve, this let us think that the best clustering is obtain with k around 5. This is interesting because we decide to focus only on the 5 most active groups. Lets focus our analysis on the GMM clustering with k = 5.

GMM with k = 5

Lets now see if the GMM with k=5 is close to the group labels.


In [136]:
prediction = prediction_kmm(5, exploitation)

In [189]:
plot_prediction(label, prediction)


As we can see the clustering did not find the original groups. Even if the number of clusters is around 5, it's not the group label that dominated the clustering. It must be a combination of different factors, for example the number of victimes or the attack types as well as the group who can explain this clustering.

IV. Conclusion

All along the Data Science pipeline, we found that there is multiple perspectives from which we can look at the data we have. Each one proposes a different approach with different results. We found in particular that:

  • Visualization is great for data where lat/long coordinates are available. Possibilities are endless.
  • Basic facts that may sound uninteresting at first bring significant knowledge about the context of terrorism.
  • PCA allows to explain the hidden correlations between variables. We found that there is a correlation between the number of victims and the attack and weapon types. The target types have no particular link with the weapon or attack types used against them, unless for the Educational Institution. Indeed, the analysis showed that the main Educational Institution attacks where made using Chemical material. Hence, looking at the coordinates of these attacks one can take the necessary measures to protect such facility, if deemed necessary. When developing this notebook we also notice that integrating the coordinates in the PCA didn't improve the clustering meaning that no correlation can be drawn from the coordinates and the remaining variables.
  • Spectral embedding allows a better clustering representation than PCA (although this is not its main use of it).
  • Finally, the second clustering part has not been very conclusive with the K-Mean algorithm, but it was better with GMM. Although we did not identify the clustering vector, we see that there is roughly 5 main clusters in our analysis. We could continue our analysis in order to determine the vector of clustering for our dataset.

Further work

We restricted our attention to the Taliban group mainly because it contained a large amount of datapoints, but any other group which is sufficiently represented can be choosen as well. We can then compare them to see if the conclusions we draw are specific to a group or if they are a shared component.