Understanding terrorist attacks

EE-558 A Network Tour of Data Science, EPFL

Joël M. Fonseca, Nelson Antunes, Enguerrand Granoux and Hedi Driss

The aim of this project is to understand the terrorist attacks from different perspectives. The data scientist's toolkit provides a panel of useful methods that allows to highlight the main characteristics of the data we are investigating. The project structure is defined as follows:

I. Data Acquisition

In this part, the goal is to understand the data. What are the information we have ? How is it represented ? Can missing values be recovered without making wrong assumptions ? What are the relevant variables as part of this project ? That is the kind of questions we try to answer.

II. Data Exploration:

Here, we need to bring the data alive, to let it express itself. We present a map that dynamically illustrates the evolution of all the terrorist attacks avaible from the database. We also plot some basic facts, like the most affected countries and the most used types of weapon. Finally, we propose alternative roads that can be further developed like the time series.

III. Data Exploitation:

In this section, we use the power of mathematics to represent the data in a more abstract level, in order to find the hidden structure of the attacks. In particular, Principal Component Analysis (PCA) is used for a specific terrorist group to highlight the pattern of similarities. Then, a spectral embedding with the Laplacian Eigenmaps algorithm is applied to compare the two techniques. In a second part, we will remove the group labels and see if we can recover them with different clustering algorithms.

IV. Conclusion:

Finally, the conclusions drawn from the results founded in the two previous sections are summarized.



In [2]:

    
# the basics
import numpy as np
import pandas as pd
import seaborn as sns

# useful libraries
import googlemaps

# for visualization
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

# To match name
import difflib as di

# Clustering part
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GMM

# avoid warnings
import warnings
warnings.filterwarnings('ignore')

# automatically reloads the different .py's before running a cell
%load_ext autoreload
%autoreload 2









    



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

I. Data Acquisition

In this project, we will work on the Global Terorisme Database which is an open-source database containing information about more than 170,000 terrorist attacks around the world from 1970 through 2016 provided by the START consortium. It can be downloaded here. Make sure you download it before running this notebook.

Let's first of all make the data usable for our analysis.

Load and clean the data

We will first analyze the content of the database to understand its content and then restrict ourselves to the most significant features in order to carry out our task in the best possible way.



In [3]:

    
attacks=pd.read_csv('globalterrorismdb_0617dist.csv',encoding='ISO-8859-1')
attacks.head()









    Out[3]:







  
    
      
      eventid
      iyear
      imonth
      iday
      approxdate
      extended
      resolution
      country
      country_txt
      region
      ...
      addnotes
      scite1
      scite2
      scite3
      dbsource
      INT_LOG
      INT_IDEO
      INT_MISC
      INT_ANY
      related
    
  
  
    
      0
      197000000001
      1970
      7
      2
      NaN
      0
      NaN
      58
      Dominican Republic
      2
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      0
      0
      0
      0
      NaN
    
    
      1
      197000000002
      1970
      0
      0
      NaN
      0
      NaN
      130
      Mexico
      1
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      0
      1
      1
      1
      NaN
    
    
      2
      197001000001
      1970
      1
      0
      NaN
      0
      NaN
      160
      Philippines
      5
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      -9
      -9
      1
      1
      NaN
    
    
      3
      197001000002
      1970
      1
      0
      NaN
      0
      NaN
      78
      Greece
      8
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      -9
      -9
      1
      1
      NaN
    
    
      4
      197001000003
      1970
      1
      0
      NaN
      0
      NaN
      101
      Japan
      4
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      -9
      -9
      1
      1
      NaN
    
  

5 rows × 135 columns



In [5]:

    
attacks.shape









    Out[5]:





(170350, 135)

We see that there is 135 different columns. We won't be using all of them. A complete description of each column can be found here. After thinking, we made a first selection of the most important features we will consider. Furthermore, we renamed all the columns of interest.



In [6]:

    
attacks.rename(columns={'iyear':'Year','imonth':'Month','iday':'Day','country_txt':'Country','region_txt':'Region', 'city': 'City', 'latitude':'Latitude', 'longitude': 'Longitude', 'attacktype1_txt':'Attack_type','target1':'Target','nkill':'Killed','nwound':'Wounded','summary':'Summary','gname':'Group','targtype1_txt':'Target_type','weaptype1_txt':'Weapon_type','motive':'Motive'},inplace=True)



In [7]:

    
attacks=attacks[['Year','Month','Day','Country','Region','City','Latitude','Longitude','Attack_type','Killed','Wounded','Target','Summary','Group','Target_type','Weapon_type','Motive']]
attacks.head()









    Out[7]:







  
    
      
      Year
      Month
      Day
      Country
      Region
      City
      Latitude
      Longitude
      Attack_type
      Killed
      Wounded
      Target
      Summary
      Group
      Target_type
      Weapon_type
      Motive
    
  
  
    
      0
      1970
      7
      2
      Dominican Republic
      Central America & Caribbean
      Santo Domingo
      18.456792
      -69.951164
      Assassination
      1.0
      0.0
      Julio Guzman
      NaN
      MANO-D
      Private Citizens & Property
      Unknown
      NaN
    
    
      1
      1970
      0
      0
      Mexico
      North America
      Mexico city
      19.432608
      -99.133207
      Hostage Taking (Kidnapping)
      0.0
      0.0
      Nadine Chaval, daughter
      NaN
      23rd of September Communist League
      Government (Diplomatic)
      Unknown
      NaN
    
    
      2
      1970
      1
      0
      Philippines
      Southeast Asia
      Unknown
      15.478598
      120.599741
      Assassination
      1.0
      0.0
      Employee
      NaN
      Unknown
      Journalists & Media
      Unknown
      NaN
    
    
      3
      1970
      1
      0
      Greece
      Western Europe
      Athens
      37.983773
      23.728157
      Bombing/Explosion
      NaN
      NaN
      U.S. Embassy
      NaN
      Unknown
      Government (Diplomatic)
      Explosives/Bombs/Dynamite
      NaN
    
    
      4
      1970
      1
      0
      Japan
      East Asia
      Fukouka
      33.580412
      130.396361
      Facility/Infrastructure Attack
      NaN
      NaN
      U.S. Consulate
      NaN
      Unknown
      Government (Diplomatic)
      Incendiary
      NaN



In [8]:

    
attacks.isnull().sum()









    Out[8]:





Year                0
Month               0
Day                 0
Country             0
Region              0
City              446
Latitude         4606
Longitude        4606
Attack_type         0
Killed           9682
Wounded         15325
Target            634
Summary         66138
Group               0
Target_type         0
Weapon_type         0
Motive         121764
dtype: int64

Let's look at each of the columns where there is missing information and see what we can do in order to complete it, if possible:

City: if we know the coordinates we can retrieve the name of the city.
Latitude and Longitude: if we know the city and the country then we can retrieve the aproximated geographic coordinates.
Killed and Wounded: we cannot correclty retrieve this information without making wrong assumptions.
Target, Summary, Motive: these fields are impossible to infer from the data we have, so we might left them as is.

The only field which makes sense to develop for the sequel of our analysis, is the one for the latitude and longitude coordinates. Indeed, as we will be plotting a world map with all the attacks it makes sense to be as complete as possible. Knowing the city won't influence our further analysis. For the two last fields, as discussed, we can't reliably retrieve this information.

Using Google Maps Geocoding API

To retrieve the latitude and longitude coordinates Google Maps Geocoding API seems to be the more appropriate tool. In order to minimize the number of requests, we have to create a mask to find which coordinates can be recovered.



In [7]:

    
mask = (attacks['City'].notnull()) & (attacks['City'] != 'Unknown') & (attacks['Longitude'].isnull())
attacks[mask].shape









    Out[7]:





(2503, 17)

We see that 2503 attacks are concerned. Let's see how many we can recover.



In [12]:

    
gmaps = googlemaps.Client(key='AIzaSyC2wKci31uhUZrjfrSIhbZ2RuS5WtL0KPI')



In [76]:

    
def fill_lat_long(row): 

    location = row['Country'] + ', '+ row['City']
    info = gmaps.geocode(location)
    if info:
        lat_lon = info[0]['geometry']['location']
        row['Latitude'] = lat_lon['lat']
        row['Longitude'] = lat_lon['lng']

    return row



In [77]:

    
attacks[mask] = attacks[mask].apply(lambda row: fill_lat_long(row), axis = 1)
attacks.isnull().sum()









    Out[77]:





Year                0
Month               0
Day                 0
Country             0
Region              0
City              446
Latitude         3094
Longitude        3094
AttackType          0
Killed           9682
Wounded         15325
Target            634
Summary         66138
Group               0
Target_type         0
Weapon_type         0
Motive         121764
dtype: int64

We have been able to retrieve about 1600 coordinates (64%) using this API. Although it is provided by Google, we see that there is still some cities that are not correctly recognized. It still was worth the try.



In [10]:

    
# uncomment the line below if you don't have the attack.csv file yet
#attacks.to_csv('attacks.csv')



In [6]:

    
attacks = pd.read_csv('attacks.csv', index_col=0)

II. Data Exploration

Javascript website

We developed a Javascript website that allows an interactive experience. The code is available, but it should be run locally.

World view

As we have the coordinates for all the attacks of each year, we can dynamically represent the activity of terrorism on a world map. To do so, we will create an animated plot using the Matplotlib Animation and Basemap libraries.

Essentially, the markers represent the attacks for a specific year. Besides the geographic coordinates, each marker $i$ has a corresponding size $s_i$ and color $c_i$ defined as follows:

$$s_i = 10 \text{ abs}(k_i - k_{\text{median}})$$$$c_i = \frac{k_i - k_{\text{min}}}{k_{\text{max}} - k_{\text{min}}}$$

where $k_i$ corresponds to the number of killed people for the attack represented by the marker $i$ and $k_{\text{median}}$, $k_{\text{min}}$ and $k_{\text{max}}$ correspond respectively to the median, the min and the max number of killed people for the specific year.

These two indicators combined will give us a much more comprehensive situation of the gravity of the attacks based on the number of killed people.



In [4]:

    
from visualization import world_view



In [7]:

    
world_view(attacks)

This visualization allows to grasp very well the geographic context of the terrorist attacks. One can realize the degree of severity of the 9/11 attacks and perceive the intense activity in the Middle East for the last few years in particular.

We can also of course look at a particular group and see more precisely the location where it attacked or even the distribution of the attack types. The two following plots show the idea for the Taliban group.



In [15]:

    
from visualization import zoom_taliban_intensity
zoom_taliban_intensity(attacks)



In [16]:

    
from visualization import zoom_taliban_attack_types
zoom_taliban_attack_types(attacks)

Basic facts

We can further explore our data by analyzing the number of attacks per year.



In [34]:

    
from plots import *
plot_attack_ferq_per_year(attacks)

We see a clear increase of attacks for the last 5 years. The following plots will respectively give us information about:

the attacking methods used by the terrorists,
the attack frequency of each type of targets,
the activity of terrorist attacks by year for each region,
the distribution of attack types for each region,
the most concerned countries and finally,
the most active terrorist groups.



In [20]:

    
plot_attack_types_freq(attacks)



In [21]:

    
plot_target_distribution(attacks)



In [22]:

    
plot_attack_freq_by_year_and_region(attacks)



In [25]:

    
plot_attack_distribution_by_region(attacks)



In [24]:

    
plot_most_affected_countries(attacks)



In [26]:

    
plot_top15_most_active_terrorist_groups(attacks)



In [37]:

    
joint_plot_coordinates(attacks)









    





<matplotlib.figure.Figure at 0x1103c8cf8>



In [39]:

    
killed_num_attacks_relation(attacks)

As they are very self-explanatory, we decided to not explicitly comment each one of these plots. All of them give a good overview of the situation and the context in which the terrorist attacks have evolved for the last decades.

Time series

As we also have the time for each data point, one can try to do some forecasting or merge it with other time series to see if one affects the other. Let's see how we can construct such time series for informational purposes.



In [26]:

    
from time_series import create_time_series
ts = create_time_series(attacks)
ts.head()









    Out[26]:







  
    
      
      DateTime
      Country
      City
      Latitude
      Longitude
      Attack_type
      Killed
      Wounded
      Target
      Group
    
  
  
    
      0
      1970-07-02
      Dominican Republic
      Santo Domingo
      18.456792
      -69.951164
      Assassination
      1.0
      0.0
      Julio Guzman
      MANO-D
    
    
      5
      1970-01-01
      United States
      Cairo
      37.005105
      -89.176269
      Armed Assault
      0.0
      0.0
      Cairo Police Headquarters
      Black Nationalists
    
    
      6
      1970-01-02
      Uruguay
      Montevideo
      -34.891151
      -56.187214
      Assassination
      0.0
      0.0
      Juan Maria de Lucah/Chief of Directorate of in...
      Tupamaros (Uruguay)
    
    
      7
      1970-01-02
      United States
      Oakland
      37.805065
      -122.273024
      Bombing/Explosion
      0.0
      0.0
      Edes Substation
      Unknown
    
    
      8
      1970-01-02
      United States
      Madison
      43.076592
      -89.412488
      Facility/Infrastructure Attack
      0.0
      0.0
      R.O.T.C. offices at University of Wisconsin, M...
      New Year's Gang



In [27]:

    
from time_series import show_subset
show_subset(ts, 'Taliban')

The plot above shows the activity of the Taliban from 2008 to 2011.

III. Data Exploitation

First Analysis

Feature Selection

We will drop features that might not give insightful information for the next analysis.



In [8]:

    
exploitation = pd.read_csv('attacks.csv', index_col=0)
exploitation.head()









    Out[8]:







  
    
      
      Year
      Month
      Day
      Country
      Region
      City
      Latitude
      Longitude
      Attack_type
      Killed
      Wounded
      Target
      Summary
      Group
      Target_type
      Weapon_type
      Motive
    
  
  
    
      0
      1970
      7
      2
      Dominican Republic
      Central America & Caribbean
      Santo Domingo
      18.456792
      -69.951164
      Assassination
      1.0
      0.0
      Julio Guzman
      NaN
      MANO-D
      Private Citizens & Property
      Unknown
      NaN
    
    
      1
      1970
      0
      0
      Mexico
      North America
      Mexico city
      19.432608
      -99.133207
      Hostage Taking (Kidnapping)
      0.0
      0.0
      Nadine Chaval, daughter
      NaN
      23rd of September Communist League
      Government (Diplomatic)
      Unknown
      NaN
    
    
      2
      1970
      1
      0
      Philippines
      Southeast Asia
      Unknown
      15.478598
      120.599741
      Assassination
      1.0
      0.0
      Employee
      NaN
      Unknown
      Journalists & Media
      Unknown
      NaN
    
    
      3
      1970
      1
      0
      Greece
      Western Europe
      Athens
      37.983773
      23.728157
      Bombing/Explosion
      NaN
      NaN
      U.S. Embassy
      NaN
      Unknown
      Government (Diplomatic)
      Explosives/Bombs/Dynamite
      NaN
    
    
      4
      1970
      1
      0
      Japan
      East Asia
      Fukouka
      33.580412
      130.396361
      Facility/Infrastructure Attack
      NaN
      NaN
      U.S. Consulate
      NaN
      Unknown
      Government (Diplomatic)
      Incendiary
      NaN



In [9]:

    
exploitation.drop(['Year', 'Latitude', 'Longitude', 'Month','Day', 'Country', 'City', 'Region', 'Target', 'Summary', 'Motive'], inplace = True, axis = 1)
exploitation.columns









    Out[9]:





Index(['Attack_type', 'Killed', 'Wounded', 'Group', 'Target_type',
       'Weapon_type'],
      dtype='object')

PCA

We will perform a Principal Component Analysis (PCA) in order to represent ourselves all the datapoints of the attacks in a comprehensive dimension for our mind.

Due to the tremendous amount of data, we made a first drastic selection of the features: we focused on the top 10 most active terrorist groups. We then tried for only one group to see if we had a better representation and it was actually the case. Hence, in the following cells, we analyze all the attacks for the Taliban group only.



In [10]:

    
# we will need that to correctly legend the later plots
list_attack_type_unique = exploitation['Attack_type'].unique().tolist()
list_target_type_unique = exploitation['Target_type'].unique().tolist()
list_weapon_type_unique = exploitation['Weapon_type'].unique().tolist()
list_unique = [list_attack_type_unique, list_target_type_unique, list_weapon_type_unique]



In [11]:

    
# select the Taliban group
features = exploitation[exploitation['Group'] == 'Taliban']
del features['Group']
# delete any row that contain a NaN
features = features.dropna()
# check amount of remaining data
features.shape









    Out[11]:





(6081, 5)



In [12]:

    
# again we will need these lists for the legends
list_attack_type = features['Attack_type'].tolist()
list_target_type = features['Target_type'].tolist()
list_weapon_type = features['Weapon_type'].tolist()
list_ = [list_attack_type, list_target_type, list_weapon_type]



In [13]:

    
features.head()









    Out[13]:







  
    
      
      Attack_type
      Killed
      Wounded
      Target_type
      Weapon_type
    
  
  
    
      58969
      Hostage Taking (Kidnapping)
      0.0
      0.0
      Police
      Unknown
    
    
      59943
      Hijacking
      0.0
      0.0
      Airports & Aircraft
      Explosives/Bombs/Dynamite
    
    
      60391
      Hijacking
      0.0
      0.0
      Airports & Aircraft
      Explosives/Bombs/Dynamite
    
    
      60715
      Bombing/Explosion
      0.0
      0.0
      Airports & Aircraft
      Explosives/Bombs/Dynamite
    
    
      73044
      Bombing/Explosion
      3.0
      1.0
      Government (General)
      Explosives/Bombs/Dynamite

One-hot encoding

Another problem that needs to be handled is that we are facing with numerical and categorical variables. In order to convert the latter ones without making wrong assumptions, we will use the one-hot encoding that is already provided by the pandas librabry under the .get_dummies function. In words, it is equivalent to represent a categorical variable as a vector of $n$ dimensions, where $n$ is the number of categories available for the said variable. Then, all vectors representing a category are orthogonal to each other, avoiding this way hazardous mathematical assumptions.



In [14]:

    
features = pd.get_dummies(features, columns=['Attack_type', 'Target_type', 'Weapon_type'])
features.head()









    Out[14]:







  
    
      
      Killed
      Wounded
      Attack_type_Armed Assault
      Attack_type_Assassination
      Attack_type_Bombing/Explosion
      Attack_type_Facility/Infrastructure Attack
      Attack_type_Hijacking
      Attack_type_Hostage Taking (Barricade Incident)
      Attack_type_Hostage Taking (Kidnapping)
      Attack_type_Unarmed Assault
      ...
      Target_type_Violent Political Party
      Weapon_type_Chemical
      Weapon_type_Explosives/Bombs/Dynamite
      Weapon_type_Firearms
      Weapon_type_Incendiary
      Weapon_type_Melee
      Weapon_type_Other
      Weapon_type_Sabotage Equipment
      Weapon_type_Unknown
      Weapon_type_Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)
    
  
  
    
      58969
      0.0
      0.0
      0
      0
      0
      0
      0
      0
      1
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      59943
      0.0
      0.0
      0
      0
      0
      0
      1
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      60391
      0.0
      0.0
      0
      0
      0
      0
      1
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      60715
      0.0
      0.0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      73044
      3.0
      1.0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 40 columns

Feature normalization

It is also always better to normalize all our data to give the features all the same importance.



In [15]:

    
features =(features-features.mean())/features.std()



In [16]:

    
from pca import pca, plot_killed_wounded_pca, plot_attack_weapon_target_type

features_pca = pca(features)
plot_killed_wounded_pca(features_pca)
plot_attack_weapon_target_type(features_pca, list_unique, list_)

We see that the number of killed people increases as the second component decreases. For the number of wounded people it depends on the two components, but the dependance is more difficult to explain. Hence, we can say that the second component explains sufficiently the number of killed people.

The PCA for the different types of data shows interesting facts. For the two first plots, we are able to distinguish the clusters representing each of the categories. We can particularly notice the following results:

for the attack_type plot: the Bombing/Explosion is very well clustered. It is also apart from the remaining datapoints. The Armed Assault is also apart although it shares some caracteristics with the Assassination. A portion of the Unarmed Assault has its own disctinctive location. The remaining categories are grouped in the same fashion but with some dense clustering at some parts.
for the weapon_type plot: the Explosives/Bombs/Dynamite, Firearm and Chemical are the most significant clusters.
for the target_type plot: Here we see that PCA fails to explain the target categories, except possibly that above a specific threshold, the second component clearly identifies the Educational Institution.

Coupled with the two previous plots, we can add to our analysis that firearms and explosions are the most murderous weapons. We can also say that there is no obvious type of weapon or attack that corresponds to a particular target unless for Educational Institution type.

In this manner, we just demonstrated that PCA allows to represent high-dimensional information that is inter-correlated into a much lower dimension with patterns of similarity. But, when we want to do some clustering, a more suitable approach is to use the Laplacian Eigenmaps algorithm. Indeed, where PCA is generally used for linear dimensionality reduction, the Laplacian Eigenmaps is a spectral embedding for non-linear dimensionality reduction well-tried for clustering representation. Let's see what we get.

Laplacian Eigenmaps



In [17]:

    
from se import se, plot_killed_wounded_se, plot_attack_weapon_target_type

features_se = se(features)
plot_killed_wounded_se(features_se)
plot_attack_weapon_target_type(features_se, list_unique, list_)

First, notice that the number of killed and wounded people is dealt the same way in both cases, i.e., the high datapoints lie approximately in the center. Remember that for the PCA, both cases where not treated the exact same way.

Then, with this method, we see a more organized representation. We can clearly distinguish three straight lines each representing a category with some condensed area where these three lines are joining. With the two first plots, we can clearly see a link as the variable they represent are, in a way, related: if we look at the Explosives/Bombs/Dynamite weapon type we see without surprise that it was used in Bombing/Explosion attacks but also in Assassination cases and some Armed Assault raids. For the last plot, we saw with PCA that it was difficult to read any information unless for the Educational Institution target type. However, with the spectral embedding method we see that the Police target type is heavily represented in the end of each straight line, which was not the case with PCA. The remaining target types are rather mix in the center.

In this way, we saw that, due to the different conceptual approaches of each technique, the two methods lead to distinctive representations. In the context of this project, we saw that spectral embedding using the Laplacian Eigenmaps algorithm performs a better clustering representation. Nevertheless, the two techniques allow to draw interesting facts.

Second analysis

In this analysis we will try to cluster our data. In particular, we will remove the group label and try to see if the clustering allows us to find the groups. This will allow us to find if there is some group caracteristics, and if groups naturally cluster our dataset.

Feature selection

Again, we will drop features that might not give insightful information for the next analysis.



In [137]:

    
exploitation = pd.read_csv('attacks.csv', index_col=0)



In [138]:

    
exploitation['Killed'] = exploitation['Killed'].fillna(0).astype(int)
exploitation['Wounded'] = exploitation['Wounded'].fillna(0).astype(int)



In [139]:

    
exploitation.head()









    Out[139]:







  
    
      
      Year
      Month
      Day
      Country
      Region
      City
      Latitude
      Longitude
      Attack_type
      Killed
      Wounded
      Target
      Summary
      Group
      Target_type
      Weapon_type
      Motive
    
  
  
    
      0
      1970
      7
      2
      Dominican Republic
      Central America & Caribbean
      Santo Domingo
      18.456792
      -69.951164
      Assassination
      1
      0
      Julio Guzman
      NaN
      MANO-D
      Private Citizens & Property
      Unknown
      NaN
    
    
      1
      1970
      0
      0
      Mexico
      North America
      Mexico city
      19.432608
      -99.133207
      Hostage Taking (Kidnapping)
      0
      0
      Nadine Chaval, daughter
      NaN
      23rd of September Communist League
      Government (Diplomatic)
      Unknown
      NaN
    
    
      2
      1970
      1
      0
      Philippines
      Southeast Asia
      Unknown
      15.478598
      120.599741
      Assassination
      1
      0
      Employee
      NaN
      Unknown
      Journalists & Media
      Unknown
      NaN
    
    
      3
      1970
      1
      0
      Greece
      Western Europe
      Athens
      37.983773
      23.728157
      Bombing/Explosion
      0
      0
      U.S. Embassy
      NaN
      Unknown
      Government (Diplomatic)
      Explosives/Bombs/Dynamite
      NaN
    
    
      4
      1970
      1
      0
      Japan
      East Asia
      Fukouka
      33.580412
      130.396361
      Facility/Infrastructure Attack
      0
      0
      U.S. Consulate
      NaN
      Unknown
      Government (Diplomatic)
      Incendiary
      NaN



In [140]:

    
exploitation.drop(['Latitude', 'Longitude', 'Month','Day', 'Country', 'Target', 'Summary', 'Motive'], inplace = True, axis = 1)
exploitation.columns









    Out[140]:





Index(['Year', 'Region', 'City', 'Attack_type', 'Killed', 'Wounded', 'Group',
       'Target_type', 'Weapon_type'],
      dtype='object')

Feature engineering

We will add a new feature isCapital who is 1 if the city is a capital or 0 otherwise. In order to do this, as there is no python module for that, we record all capitals of each country in our list of countries.



In [141]:

    
#List of capital based on the list of country we have
Capitals = ['bagdad', 'new delhi', 'mogadiscio', 'abuja', 'manille', 'ankara','islamabad', 'sanaa', 'damas', 'nairobi', 'yaoundé', 'niamey', 'beyrouth', 'ndjaména', 'riyad', 'paris', 'kaboul', 'tunis', 'amman', 'berlin', 'kampala', 'addis-abeba', 'le caire', 'bruxelles', 'manama', 'amsterdam', 'jerusalem', 'tripoli', 'katmandou', 'london', 'moscou', 'dodoma', 'abou dabi', 'ouagadougou', 'new york', 'tbilisi', 'djibouti', 'jakarta', 'kuala lumpur', 'vienne','bangkok']

In order to match the capital and the city names we will homogenize both of them. First, lets convert the city name in lower case.



In [142]:

    
exploitation['City'] = exploitation['City'].apply(lambda x: str(x).lower())

Now, we will use .get_close_matches from the difflib library in order to see if there is some capital who is not recognized due to some spelling difference.



In [47]:

    
forgotten_capital = []



In [48]:

    
for city in exploitation.City:
    tmp = di.get_close_matches(city, Capitals, cutoff=0.7)
    if(len(tmp) > 0):
        for c in tmp:
            if(c not in Capitals and c not in forgotten_capital):
                forgotten_capital.append(c)



In [51]:

    
forgotten_capital









    Out[51]:





['laman',
 'baghdad',
 'kabul',
 'baga',
 'riyadh',
 'kalampur',
 "n'djamena",
 'moscow']

We check manually what corresponds to capitals and what corresponds to other cities and just remove these lasts. Then, we concat the two lists.



In [70]:

    
for el in ['laman', 'baga', 'kalampur']:
    forgotten_capital.remove(el)



In [143]:

    
Capitals = np.concatenate((Capitals,forgotten_capital), axis = 0)

Now, we apply a lambda function over the dataset to know if a city is a capital.



In [144]:

    
exploitation['isCapital'] = exploitation['City'].apply(lambda x: x in Capitals)
exploitation['isCapital'] = exploitation['isCapital'].astype(int)

Data Selection

Due to the tremendous amount of data, we made a first drastic selection of the features: we focused on the top 5 most active terrorist groups since 2010.



In [145]:

    
exploitation = exploitation[exploitation['Year'] > 2009]
#find the top 5 groups (the first one is unknow so we select groups from 1 to 6)
top_5 = exploitation['Group'].value_counts()[1:6].index.values

exploitation['top_5'] = exploitation['Group'].apply(lambda x: (x in top_5))
exploitation = exploitation[exploitation['top_5'] == True]

One-hot encoding

As before we use binary encoding for the categorical features Attacks type, Region, Target type and Weapon type.



In [146]:

    
exploitation = pd.get_dummies(exploitation, columns=['Region', 'Attack_type', 'Target_type', 'Weapon_type'])
exploitation.head()









    Out[146]:







  
    
      
      Year
      City
      Killed
      Wounded
      Group
      isCapital
      top_5
      Region_Central Asia
      Region_Eastern Europe
      Region_Middle East & North Africa
      ...
      Target_type_Violent Political Party
      Weapon_type_Chemical
      Weapon_type_Explosives/Bombs/Dynamite
      Weapon_type_Firearms
      Weapon_type_Incendiary
      Weapon_type_Melee
      Weapon_type_Other
      Weapon_type_Sabotage Equipment
      Weapon_type_Unknown
      Weapon_type_Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)
    
  
  
    
      94760
      2010
      dwah mande
      4
      0
      Taliban
      0
      True
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      94789
      2010
      salpatra
      1
      0
      Communist Party of India - Maoist (CPI-Maoist)
      0
      True
      0
      0
      0
      ...
      1
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      94790
      2010
      murabani
      1
      0
      Communist Party of India - Maoist (CPI-Maoist)
      0
      True
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      94801
      2010
      dawlatabad
      4
      2
      Taliban
      0
      True
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      94836
      2010
      gumla
      7
      0
      Communist Party of India - Maoist (CPI-Maoist)
      0
      True
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 53 columns

Now we can drop features that we don't need for our analysis. We just save the group labels to compare them with the results at the end of our analysis.



In [147]:

    
label = exploitation.Group



In [150]:

    
exploitation.drop(['Year', 'City', 'Group', 'top_5'], axis = 1, inplace = True)

Then normalize our features as before.



In [151]:

    
exploitation = (exploitation-exploitation.mean())/exploitation.std()

K-Means Clustering

Lets try the K-Means clustering algorithm. First, we will try to find the best k for our data. To do this we try a range of k and then graphically analyse our result. We decide to try for k between 1 and 20.



In [123]:

    
from clustering import *



In [104]:

    
best_k(1, 20, exploitation)

We were expected a discontinuity in the shape of the curve, that would indicate that there is a natural way to cluster our dataset over K-Means. Such result can be explained by the nature of this algorithm. Furthermore, K-means assumes that we deal with spherical clusters and that each cluster has roughly an equal number of observations.

We have no way to know if our data can be clustered by spherical clusters, but lets see if all the groups have roughly the same number of observations.



In [172]:

    
sns.countplot(label)
plt.xticks(rotation=90)
plt.show()

We see that groups don't have the same number of observations this can be the reason of our result.

Gaussian Mixture Model

One can think of Gaussian Mixture Models as generalizing K-Means clustering to incorporate information about the covariance structure of the data. Let's see is we can reach a better result.

Here we use the BIC criterion in order to select the number of components in an efficient way.



In [126]:

    
best_k_gmm(1, 20, exploitation)

Here we can clearly identify a discontinuity in the shape of the curve, this let us think that the best clustering is obtain with k around 5. This is interesting because we decide to focus only on the 5 most active groups. Lets focus our analysis on the GMM clustering with k = 5.

GMM with k = 5

Lets now see if the GMM with k=5 is close to the group labels.



In [136]:

    
prediction = prediction_kmm(5, exploitation)



In [189]:

    
plot_prediction(label, prediction)

As we can see the clustering did not find the original groups. Even if the number of clusters is around 5, it's not the group label that dominated the clustering. It must be a combination of different factors, for example the number of victimes or the attack types as well as the group who can explain this clustering.

IV. Conclusion

All along the Data Science pipeline, we found that there is multiple perspectives from which we can look at the data we have. Each one proposes a different approach with different results. We found in particular that:

Visualization is great for data where lat/long coordinates are available. Possibilities are endless.
Basic facts that may sound uninteresting at first bring significant knowledge about the context of terrorism.
PCA allows to explain the hidden correlations between variables. We found that there is a correlation between the number of victims and the attack and weapon types. The target types have no particular link with the weapon or attack types used against them, unless for the Educational Institution. Indeed, the analysis showed that the main Educational Institution attacks where made using Chemical material. Hence, looking at the coordinates of these attacks one can take the necessary measures to protect such facility, if deemed necessary. When developing this notebook we also notice that integrating the coordinates in the PCA didn't improve the clustering meaning that no correlation can be drawn from the coordinates and the remaining variables.
Spectral embedding allows a better clustering representation than PCA (although this is not its main use of it).
Finally, the second clustering part has not been very conclusive with the K-Mean algorithm, but it was better with GMM. Although we did not identify the clustering vector, we see that there is roughly 5 main clusters in our analysis. We could continue our analysis in order to determine the vector of clustering for our dataset.

Further work

We restricted our attention to the Taliban group mainly because it contained a large amount of datapoints, but any other group which is sufficiently represented can be choosen as well. We can then compare them to see if the conclusions we draw are specific to a group or if they are a shared component.

	eventid	iyear	imonth	iday	approxdate	resolution	country	country_txt	region	...	addnotes	scite1	scite2	scite3	dbsource	INT_LOG	INT_IDEO	INT_MISC	INT_ANY	related
0	197000000001	1970	7	2	NaN	NaN	58	Dominican Republic	2	...	NaN	NaN	NaN	NaN	PGIS	0	0	0	0	NaN
1	197000000002	1970	0	0	NaN	NaN	130	Mexico	1	...	NaN	NaN	NaN	NaN	PGIS	0	1	1	1	NaN
2	197001000001	1970	1	0	NaN	NaN	160	Philippines	5	...	NaN	NaN	NaN	NaN	PGIS	-9	-9	1	1	NaN
3	197001000002	1970	1	0	NaN	NaN	78	Greece	8	...	NaN	NaN	NaN	NaN	PGIS	-9	-9	1	1	NaN
4	197001000003	1970	1	0	NaN	NaN	101	Japan	4	...	NaN	NaN	NaN	NaN	PGIS	-9	-9	1	1	NaN

	Year	Month	Day	Country	Region	City	Latitude	Longitude	Attack_type	Killed	Wounded	Target	Summary	Group	Target_type	Weapon_type	Motive
0	1970	7	2	Dominican Republic	Central America & Caribbean	Santo Domingo	18.456792	-69.951164	Assassination	1.0	0.0	Julio Guzman	NaN	MANO-D	Private Citizens & Property	Unknown	NaN
1	1970	0	0	Mexico	North America	Mexico city	19.432608	-99.133207	Hostage Taking (Kidnapping)	0.0	0.0	Nadine Chaval, daughter	NaN	23rd of September Communist League	Government (Diplomatic)	Unknown	NaN
2	1970	1	0	Philippines	Southeast Asia	Unknown	15.478598	120.599741	Assassination	1.0	0.0	Employee	NaN	Unknown	Journalists & Media	Unknown	NaN
3	1970	1	0	Greece	Western Europe	Athens	37.983773	23.728157	Bombing/Explosion	NaN	NaN	U.S. Embassy	NaN	Unknown	Government (Diplomatic)	Explosives/Bombs/Dynamite	NaN
4	1970	1	0	Japan	East Asia	Fukouka	33.580412	130.396361	Facility/Infrastructure Attack	NaN	NaN	U.S. Consulate	NaN	Unknown	Government (Diplomatic)	Incendiary	NaN

	DateTime	Country	City	Latitude	Longitude	Attack_type	Killed	Target	Group
0	1970-07-02	Dominican Republic	Santo Domingo	18.456792	-69.951164	Assassination	1.0	Julio Guzman	MANO-D
5	1970-01-01	United States	Cairo	37.005105	-89.176269	Armed Assault	0.0	Cairo Police Headquarters	Black Nationalists
6	1970-01-02	Uruguay	Montevideo	-34.891151	-56.187214	Assassination	0.0	Juan Maria de Lucah/Chief of Directorate of in...	Tupamaros (Uruguay)
7	1970-01-02	United States	Oakland	37.805065	-122.273024	Bombing/Explosion	0.0	Edes Substation	Unknown
8	1970-01-02	United States	Madison	43.076592	-89.412488	Facility/Infrastructure Attack	0.0	R.O.T.C. offices at University of Wisconsin, M...	New Year's Gang

	Attack_type	Killed	Wounded	Target_type	Weapon_type
58969	Hostage Taking (Kidnapping)	0.0	0.0	Police	Unknown
59943	Hijacking	0.0	0.0	Airports & Aircraft	Explosives/Bombs/Dynamite
60391	Hijacking	0.0	0.0	Airports & Aircraft	Explosives/Bombs/Dynamite
60715	Bombing/Explosion	0.0	0.0	Airports & Aircraft	Explosives/Bombs/Dynamite
73044	Bombing/Explosion	3.0	1.0	Government (General)	Explosives/Bombs/Dynamite

	Year	City	Killed	Wounded	Group	top_5	...	Target_type_Violent Political Party	Weapon_type_Explosives/Bombs/Dynamite	Weapon_type_Firearms
94760	2010	dwah mande	4	0	Taliban	True	...	0	1	0
94789	2010	salpatra	1	0	Communist Party of India - Maoist (CPI-Maoist)	True	...	1	0	1
94790	2010	murabani	1	0	Communist Party of India - Maoist (CPI-Maoist)	True	...	0	0	1
94801	2010	dawlatabad	4	2	Taliban	True	...	0	1	0
94836	2010	gumla	7	0	Communist Party of India - Maoist (CPI-Maoist)	True	...	0	1	0