Global Terrorist Attacks

Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.

Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.

We will start by updating and installing some of the libraries in this runtime.


In [1]:
!pip install -U seaborn
!pip install xlrd>=0.9.0
!pip install pdpbox
!pip install eli5


Requirement already up-to-date: seaborn in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages
Requirement already up-to-date: matplotlib>=1.4.3 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from seaborn)
Requirement already up-to-date: pandas>=0.15.2 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from seaborn)
Requirement already up-to-date: numpy>=1.9.3 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from seaborn)
Requirement already up-to-date: scipy>=0.14.0 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from seaborn)
Requirement already up-to-date: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: python-dateutil>=2.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: kiwisolver>=1.0.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: cycler>=0.10 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages/cycler-0.10.0-py3.5.egg (from matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: pytz>=2011k in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pandas>=0.15.2->seaborn)
Requirement already up-to-date: six>=1.5 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from python-dateutil>=2.1->matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: setuptools in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn)
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Requirement already satisfied: pdpbox in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages
Requirement already satisfied: psutil in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: matplotlib>=2.1.2 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: scipy in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: pandas in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: scikit-learn in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: numpy in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: joblib in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: cycler>=0.10 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages/cycler-0.10.0-py3.5.egg (from matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: python-dateutil>=2.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: pytz>=2011k in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pandas->pdpbox)
Requirement already satisfied: setuptools in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: six in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from cycler>=0.10->matplotlib>=2.1.2->pdpbox)
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting eli5
  Downloading https://files.pythonhosted.org/packages/8d/c8/04bed18dcce1d927b0dd5fc3425777354b714d2e62d60ae301928b5a5bf8/eli5-0.8.1-py2.py3-none-any.whl (98kB)
    100% |################################| 102kB 526kB/s a 0:00:011
Requirement already satisfied: numpy>=1.9.0 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Requirement already satisfied: jinja2 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Collecting typing (from eli5)
  Downloading https://files.pythonhosted.org/packages/4a/bd/eee1157fc2d8514970b345d69cb9975dcd1e42cd7e61146ed841f6e68309/typing-3.6.6-py3-none-any.whl
Requirement already satisfied: scipy in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Requirement already satisfied: scikit-learn>=0.18 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Requirement already satisfied: six in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Collecting tabulate>=0.7.7 (from eli5)
  Downloading https://files.pythonhosted.org/packages/12/c2/11d6845db5edf1295bc08b2f488cf5937806586afe42936c3f34c097ebdc/tabulate-0.8.2.tar.gz (45kB)
    100% |################################| 51kB 4.1MB/s eta 0:00:01
Collecting graphviz (from eli5)
  Downloading https://files.pythonhosted.org/packages/1f/e2/ef2581b5b86625657afd32030f90cf2717456c1d2b711ba074bf007c0f1a/graphviz-0.10.1-py2.py3-none-any.whl
Collecting attrs>16.0.0 (from eli5)
  Downloading https://files.pythonhosted.org/packages/3a/e1/5f9023cc983f1a628a8c2fd051ad19e76ff7b142a0faf329336f9a62a514/attrs-18.2.0-py2.py3-none-any.whl
Collecting singledispatch>=3.4.0.3; python_version < "3.5.6" (from eli5)
  Downloading https://files.pythonhosted.org/packages/c5/10/369f50bcd4621b263927b0a1519987a04383d4a98fb10438042ad410cf88/singledispatch-3.4.0.3-py2.py3-none-any.whl
Requirement already satisfied: MarkupSafe>=0.23 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from jinja2->eli5)
Building wheels for collected packages: tabulate
  Running setup.py bdist_wheel for tabulate ... done
  Stored in directory: /Users/mostafagazar/Library/Caches/pip/wheels/2a/85/33/2f6da85d5f10614cbe5a625eab3b3aebfdf43e7b857f25f829
Successfully built tabulate
Installing collected packages: typing, tabulate, graphviz, attrs, singledispatch, eli5
Successfully installed attrs-18.2.0 eli5-0.8.1 graphviz-0.10.1 singledispatch-3.4.0.3 tabulate-0.8.2 typing-3.6.6
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [2]:
import os.path

import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Download dataset

I will first explore a small dataset (attacks between 2014-2017) and then re-run all the cells on the complete dataset.


In [3]:
# excel_file = "gtd_14to17_0718dist.xlsx"
excel_file = "globalterrorismdb_0718dist.xlsx"

if os.path.isfile(excel_file):
    print("Reading local", excel_file)
    df = pd.read_excel(excel_file)
else:
    print("Downloading and reading,", excel_file)
    df = pd.read_excel('http://apps.start.umd.edu/gtd/downloads/dataset/' + excel_file)


Reading local globalterrorismdb_0718dist.xlsx

In [4]:
df.head()


Out[4]:
eventid iyear imonth iday approxdate extended resolution country country_txt region ... addnotes scite1 scite2 scite3 dbsource INT_LOG INT_IDEO INT_MISC INT_ANY related
0 197000000001 1970 7 2 NaN 0 NaT 58 Dominican Republic 2 ... NaN NaN NaN NaN PGIS 0 0 0 0 NaN
1 197000000002 1970 0 0 NaN 0 NaT 130 Mexico 1 ... NaN NaN NaN NaN PGIS 0 1 1 1 NaN
2 197001000001 1970 1 0 NaN 0 NaT 160 Philippines 5 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
3 197001000002 1970 1 0 NaN 0 NaT 78 Greece 8 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
4 197001000003 1970 1 0 NaN 0 NaT 101 Japan 4 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN

5 rows × 135 columns


In [5]:
df.columns.tolist()


Out[5]:
['eventid',
 'iyear',
 'imonth',
 'iday',
 'approxdate',
 'extended',
 'resolution',
 'country',
 'country_txt',
 'region',
 'region_txt',
 'provstate',
 'city',
 'latitude',
 'longitude',
 'specificity',
 'vicinity',
 'location',
 'summary',
 'crit1',
 'crit2',
 'crit3',
 'doubtterr',
 'alternative',
 'alternative_txt',
 'multiple',
 'success',
 'suicide',
 'attacktype1',
 'attacktype1_txt',
 'attacktype2',
 'attacktype2_txt',
 'attacktype3',
 'attacktype3_txt',
 'targtype1',
 'targtype1_txt',
 'targsubtype1',
 'targsubtype1_txt',
 'corp1',
 'target1',
 'natlty1',
 'natlty1_txt',
 'targtype2',
 'targtype2_txt',
 'targsubtype2',
 'targsubtype2_txt',
 'corp2',
 'target2',
 'natlty2',
 'natlty2_txt',
 'targtype3',
 'targtype3_txt',
 'targsubtype3',
 'targsubtype3_txt',
 'corp3',
 'target3',
 'natlty3',
 'natlty3_txt',
 'gname',
 'gsubname',
 'gname2',
 'gsubname2',
 'gname3',
 'gsubname3',
 'motive',
 'guncertain1',
 'guncertain2',
 'guncertain3',
 'individual',
 'nperps',
 'nperpcap',
 'claimed',
 'claimmode',
 'claimmode_txt',
 'claim2',
 'claimmode2',
 'claimmode2_txt',
 'claim3',
 'claimmode3',
 'claimmode3_txt',
 'compclaim',
 'weaptype1',
 'weaptype1_txt',
 'weapsubtype1',
 'weapsubtype1_txt',
 'weaptype2',
 'weaptype2_txt',
 'weapsubtype2',
 'weapsubtype2_txt',
 'weaptype3',
 'weaptype3_txt',
 'weapsubtype3',
 'weapsubtype3_txt',
 'weaptype4',
 'weaptype4_txt',
 'weapsubtype4',
 'weapsubtype4_txt',
 'weapdetail',
 'nkill',
 'nkillus',
 'nkillter',
 'nwound',
 'nwoundus',
 'nwoundte',
 'property',
 'propextent',
 'propextent_txt',
 'propvalue',
 'propcomment',
 'ishostkid',
 'nhostkid',
 'nhostkidus',
 'nhours',
 'ndays',
 'divert',
 'kidhijcountry',
 'ransom',
 'ransomamt',
 'ransomamtus',
 'ransompaid',
 'ransompaidus',
 'ransomnote',
 'hostkidoutcome',
 'hostkidoutcome_txt',
 'nreleased',
 'addnotes',
 'scite1',
 'scite2',
 'scite3',
 'dbsource',
 'INT_LOG',
 'INT_IDEO',
 'INT_MISC',
 'INT_ANY',
 'related']

Looking at the above columns, I suspect that:

  • Region/country would a great factor in predicting what group may be responsible for an attack.
  • Date would also be a good factor
  • Weapon and target as well

We will find out later if these assumptions are correct or not by doing permutation importance on my trained models.

Clean data

Next step is assesing our data and do some data cleanup


In [6]:
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1284e30f0>

Drop almost empty columns

I will start by dropping all the columns that are almost empty (70% empty)


In [7]:
DROP_THRESHOLD = .70

columns_to_drop = []
for column in df.columns.tolist():
    null_ratio = df[column].isnull().sum() / len(df[column])
    if null_ratio > DROP_THRESHOLD:
        columns_to_drop.append(column)
        print (column, "with null ratio", null_ratio , "will be dropped")

df.drop(columns_to_drop, axis=1, inplace=True)


approxdate with null ratio 0.9491499303762982 will be dropped
resolution with null ratio 0.9877814531264619 will be dropped
alternative with null ratio 0.8403278093026072 will be dropped
alternative_txt with null ratio 0.8403278093026072 will be dropped
attacktype2 with null ratio 0.9652486914596761 will be dropped
attacktype2_txt with null ratio 0.9652486914596761 will be dropped
attacktype3 with null ratio 0.997644352224381 will be dropped
attacktype3_txt with null ratio 0.997644352224381 will be dropped
targtype2 with null ratio 0.9386650962348162 will be dropped
targtype2_txt with null ratio 0.9386650962348162 will be dropped
targsubtype2 with null ratio 0.9411913633586694 will be dropped
targsubtype2_txt with null ratio 0.9411913633586694 will be dropped
corp2 with null ratio 0.9443175501263134 will be dropped
target2 with null ratio 0.9393475736277526 will be dropped
natlty2 with null ratio 0.9404043128168154 will be dropped
natlty2_txt with null ratio 0.9404043128168154 will be dropped
targtype3 with null ratio 0.9935274724669907 will be dropped
targtype3_txt with null ratio 0.9935274724669907 will be dropped
targsubtype3 with null ratio 0.9939622766124905 will be dropped
targsubtype3_txt with null ratio 0.9939622766124905 will be dropped
corp3 with null ratio 0.994353049958446 will be dropped
target3 with null ratio 0.9935329763169337 will be dropped
natlty3 with null ratio 0.9936870841153387 will be dropped
natlty3_txt with null ratio 0.9936870841153387 will be dropped
gsubname with null ratio 0.9675823238355229 will be dropped
gname2 with null ratio 0.9889207500646703 will be dropped
gsubname2 with null ratio 0.9991193840091144 will be dropped
gname3 with null ratio 0.9982167526184567 will be dropped
gsubname3 with null ratio 0.9998899230011393 will be dropped
motive with null ratio 0.7217198430301996 will be dropped
guncertain2 with null ratio 0.9892399733613663 will be dropped
guncertain3 with null ratio 0.9982387680182288 will be dropped
claimmode with null ratio 0.8949700315370602 will be dropped
claimmode_txt with null ratio 0.8949700315370602 will be dropped
claim2 with null ratio 0.9895977236076635 will be dropped
claimmode2 with null ratio 0.9966096284350904 will be dropped
claimmode2_txt with null ratio 0.9966096284350904 will be dropped
claim3 with null ratio 0.9982497757181148 will be dropped
claimmode3 with null ratio 0.9992679879575763 will be dropped
claimmode3_txt with null ratio 0.9992679879575763 will be dropped
compclaim with null ratio 0.9733668701256529 will be dropped
weaptype2 with null ratio 0.9277509617977775 will be dropped
weaptype2_txt with null ratio 0.9277509617977775 will be dropped
weapsubtype2 with null ratio 0.9364745639574883 will be dropped
weapsubtype2_txt with null ratio 0.9364745639574883 will be dropped
weaptype3 with null ratio 0.9897463275561255 will be dropped
weaptype3_txt with null ratio 0.9897463275561255 will be dropped
weapsubtype3 with null ratio 0.9906819820464415 will be dropped
weapsubtype3_txt with null ratio 0.9906819820464415 will be dropped
weaptype4 with null ratio 0.9995982189541585 will be dropped
weaptype4_txt with null ratio 0.9995982189541585 will be dropped
weapsubtype4 with null ratio 0.9996147305039875 will be dropped
weapsubtype4_txt with null ratio 0.9996147305039875 will be dropped
propvalue with null ratio 0.7854103945710024 will be dropped
nhostkid with null ratio 0.9253017485731269 will be dropped
nhostkidus with null ratio 0.9256044603199939 will be dropped
nhours with null ratio 0.9776378576814482 will be dropped
ndays with null ratio 0.9552867230627824 will be dropped
divert with null ratio 0.9982167526184567 will be dropped
kidhijcountry with null ratio 0.9818097759382688 will be dropped
ransomamt with null ratio 0.9925698025769025 will be dropped
ransomamtus with null ratio 0.9969013324820712 will be dropped
ransompaid with null ratio 0.9957400201440908 will be dropped
ransompaidus with null ratio 0.9969618748314446 will be dropped
ransomnote with null ratio 0.9971710211292799 will be dropped
hostkidoutcome with null ratio 0.9395071852761007 will be dropped
hostkidoutcome_txt with null ratio 0.9395071852761007 will be dropped
nreleased with null ratio 0.9427599605924344 will be dropped
addnotes with null ratio 0.8443015889614786 will be dropped
scite3 with null ratio 0.7604944658788823 will be dropped
related with null ratio 0.8621946051262859 will be dropped

Drop almost empty rows

Rows with invalid data like when the attack group is unkown will be of much help, it will just confuse the model.


In [8]:
print("All attacks", len(df))

# Also drop rows where gname is unkown
df = df[df['gname'] != 'Unknown']

print("Attacks where the attack group was known", len(df))


All attacks 181691
Attacks where the attack group was known 98909

In [9]:
df.head()


Out[9]:
eventid iyear imonth iday extended country country_txt region region_txt provstate ... propcomment ishostkid ransom scite1 scite2 dbsource INT_LOG INT_IDEO INT_MISC INT_ANY
0 197000000001 1970 7 2 0 58 Dominican Republic 2 Central America & Caribbean NaN ... NaN 0.0 0.0 NaN NaN PGIS 0 0 0 0
1 197000000002 1970 0 0 0 130 Mexico 1 North America Federal ... NaN 1.0 1.0 NaN NaN PGIS 0 1 1 1
5 197001010002 1970 1 1 0 217 United States 1 North America Illinois ... NaN 0.0 0.0 "Police Chief Quits," Washington Post, January... "Cairo Police Chief Quits; Decries Local 'Mili... Hewitt Project -9 -9 0 -9
6 197001020001 1970 1 2 0 218 Uruguay 3 South America Montevideo ... NaN 0.0 0.0 NaN NaN PGIS 0 0 0 0
8 197001020003 1970 1 2 0 217 United States 1 North America Wisconsin ... Basketball courts, weight room, swimming pool,... 0.0 0.0 Tom Bates, "Rads: The 1970 Bombing of the Army... David Newman, Sandra Sutherland, and Jon Stewa... Hewitt Project 0 0 0 0

5 rows × 64 columns


In [10]:
df.columns.tolist()


Out[10]:
['eventid',
 'iyear',
 'imonth',
 'iday',
 'extended',
 'country',
 'country_txt',
 'region',
 'region_txt',
 'provstate',
 'city',
 'latitude',
 'longitude',
 'specificity',
 'vicinity',
 'location',
 'summary',
 'crit1',
 'crit2',
 'crit3',
 'doubtterr',
 'multiple',
 'success',
 'suicide',
 'attacktype1',
 'attacktype1_txt',
 'targtype1',
 'targtype1_txt',
 'targsubtype1',
 'targsubtype1_txt',
 'corp1',
 'target1',
 'natlty1',
 'natlty1_txt',
 'gname',
 'guncertain1',
 'individual',
 'nperps',
 'nperpcap',
 'claimed',
 'weaptype1',
 'weaptype1_txt',
 'weapsubtype1',
 'weapsubtype1_txt',
 'weapdetail',
 'nkill',
 'nkillus',
 'nkillter',
 'nwound',
 'nwoundus',
 'nwoundte',
 'property',
 'propextent',
 'propextent_txt',
 'propcomment',
 'ishostkid',
 'ransom',
 'scite1',
 'scite2',
 'dbsource',
 'INT_LOG',
 'INT_IDEO',
 'INT_MISC',
 'INT_ANY']

In [11]:
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x131808e80>

That's looking a little better, next will just fill the unavailable data


In [12]:
df.fillna(0, inplace=True)

sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x119e406a0>

Data exploration


In [13]:
sns.relplot(x="iyear", y="nkill", 
            col="region_txt", # Categorical variables that will determine the faceting of the grid.
            hue="success",  # Grouping variable that will produce elements with different colors.
            style="success", # Grouping variable that will produce elements with different styles.
            data=df)


Out[13]:
<seaborn.axisgrid.FacetGrid at 0x130515198>

In [14]:
sns.relplot(x="iyear", y="nkill", 
           col="weaptype1_txt", # Categorical variables that will determine the faceting of the grid.
           hue="success",  # Grouping variable that will produce elements with different colors.
           style="success", # Grouping variable that will produce elements with different styles.
           data=df)


Out[14]:
<seaborn.axisgrid.FacetGrid at 0x12f31eeb8>

Number of attacks by group


In [15]:
df.groupby("gname").size().sort_values(ascending=False).head()


Out[15]:
gname
Taliban                                             7478
Islamic State of Iraq and the Levant (ISIL)         5613
Shining Path (SL)                                   4555
Farabundo Marti National Liberation Front (FMLN)    3351
Al-Shabaab                                          3288
dtype: int64

Number of kills by group


In [16]:
df.groupby("gname")["nkill"].sum().sort_values(ascending=False).head()


Out[16]:
gname
Islamic State of Iraq and the Levant (ISIL)    38923.0
Taliban                                        29410.0
Boko Haram                                     20328.0
Shining Path (SL)                              11601.0
Liberation Tigers of Tamil Eelam (LTTE)        10989.0
Name: nkill, dtype: float64

Number of attacks by target


In [17]:
df.groupby("targtype1_txt").size().sort_values(ascending=False).head()


Out[17]:
targtype1_txt
Private Citizens & Property    22727
Military                       18969
Police                         14094
Business                       11518
Government (General)            9946
dtype: int64

Number of attacks by nationality


In [18]:
df.groupby("natlty1_txt").size().sort_values(ascending=False).head()


Out[18]:
natlty1_txt
India          7805
Afghanistan    6989
Iraq           6010
Colombia       5958
Peru           4994
dtype: int64

Looking at the numbers above that might indicate that people with certain nationalities might be more likely to commit terrorist attacks but is that actually true?


In [19]:
df.groupby(['country_txt', 'natlty1_txt']).size()


Out[19]:
country_txt  natlty1_txt             
Afghanistan  0                            259
             Afghanistan                 6938
             Algeria                        1
             Asian                          3
             Australia                      2
             Bangladesh                     3
             Canada                         7
             China                          3
             Denmark                        1
             France                         8
             Germany                        7
             Great Britain                  8
             Iceland                        1
             India                         20
             International                522
             Iran                           4
             Iraq                           6
             Italy                          9
             Japan                          5
             Multinational                 20
             Nepal                          1
             Netherlands                    2
             Norway                         1
             Pakistan                      13
             Russia                         1
             Saudi Arabia                   1
             South Korea                    2
             Soviet Union                   1
             Spain                          1
             Sweden                         1
                                         ... 
Yemen        Tajikistan                     1
             Tuvalu                         1
             United Arab Emirates           4
             United Kingdom                 1
             United States                 29
             Uzbekistan                     1
             West Bank and Gaza Strip       1
             Yemen                       2170
Yugoslavia   Albania                        1
             Bosnia-Herzegovina             2
             Israel                         1
             Multinational                  1
             Serbia-Montenegro             44
             Turkey                         1
             United States                  2
             Yugoslavia                    22
Zaire        Great Britain                  1
             Israel                         3
             Rwanda                         2
             Spain                          1
             Sweden                         1
             United States                  1
             Zaire                          8
Zambia       Angola                         1
             Portugal                       1
             South Africa                   1
             Zambia                        29
Zimbabwe     Great Britain                  3
             United States                  1
             Zimbabwe                      40
Length: 1806, dtype: int64

It is clear that majority of the attacks in most of countries are commited by the citizens of that country. So if we see an over represented nationality that most probably indicate a failed state or an unstable government.

What about the United States specifically


In [20]:
df.loc[df['country_txt'] == 'United States', ['country_txt', 'natlty1_txt']].groupby(['country_txt', 'natlty1_txt']).size()


Out[20]:
country_txt    natlty1_txt                     
United States  0                                      4
               Angola                                 1
               Argentina                              2
               Bahamas                                2
               Bangladesh                             1
               Brazil                                 1
               Canada                                 1
               China                                  3
               Colombia                               1
               Costa Rica                             2
               Cuba                                  19
               Czechoslovakia                         1
               Democratic Republic of the Congo       1
               Dominican Republic                     3
               Egypt                                  8
               France                                 1
               Germany                                1
               Great Britain                          2
               Haiti                                  4
               India                                  4
               International                          8
               Iran                                   8
               Iraq                                   4
               Israel                                13
               Lebanon                                3
               Liberia                                1
               Libya                                  3
               Malawi                                 1
               Mexico                                10
               New Zealand                            1
               Nicaragua                              1
               Panama                                 1
               Poland                                 1
               Portugal                               2
               Puerto Rico                           58
               Rhodesia                               1
               Russia                                 2
               Saudi Arabia                           1
               South Africa                           6
               Soviet Union                          40
               Spain                                  5
               Switzerland                            5
               Tunisia                                1
               Turkey                                10
               United States                       1991
               Uruguay                                1
               Venezuela                              7
               Vietnam                                5
               West Bank and Gaza Strip               4
               West Germany (FRG)                     1
               Yugoslavia                             6
dtype: int64

That make sense because most groups are regional that also means that region and country would be good inputs to our model.

Model training

A tree based model should deliver good results


In [21]:
y = df['gname']
feature_names = ['iyear', 'country', 'region', 'multiple', 'success', 'suicide', 'attacktype1', 
                 'targtype1', 'targsubtype1', 'natlty1', 'claimed', 'weaptype1', 'nkill', 'nwound', 
                 'ransom']
X = df[feature_names]

Minimize our dataframe memory usage

We will achieve that by dropping the columns we will not be using in training our models, and if that is not enough I will look into columns data types and convert them.


In [22]:
# https://www.dataquest.io/blog/pandas-big-data/
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)

In [23]:
print("Memory usage before", mem_usage(df))


Memory usage before 191.12 MB

In [24]:
columns_to_keep = ['gname'] + feature_names
columns_to_drop = []
for column in df.columns.tolist():
    if column not in columns_to_keep:
        columns_to_drop.append(column)
        
df.drop(columns_to_drop, axis=1, inplace=True)

In [25]:
print("Memory usage after", mem_usage(df))


Memory usage after 20.01 MB

Split our data into train, validation and test


In [26]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, random_state=1)

RandomForestClassifier


In [27]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=0).fit(train_X, train_y)


/Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

Calculate and show permutation importances with the eli5 library


In [28]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())


Out[28]:
Weight Feature
0.2887 ± 0.0069 iyear
0.2758 ± 0.0046 country
0.2598 ± 0.0032 region
0.1095 ± 0.0037 natlty1
0.0321 ± 0.0030 claimed
0.0256 ± 0.0029 weaptype1
0.0238 ± 0.0014 nkill
0.0238 ± 0.0040 targsubtype1
0.0214 ± 0.0026 multiple
0.0195 ± 0.0026 targtype1
0.0143 ± 0.0023 attacktype1
0.0073 ± 0.0020 nwound
0.0023 ± 0.0004 suicide
0.0014 ± 0.0015 success
0.0007 ± 0.0002 ransom

The region and country where the attack happened is more indicative of what group might be responsible for it.

Including the country and region of the attack resulted in more accurate results:

  • Before including country and region

    • Accuracy with no scalers 0.6758790709744388
    • Accuracy after applying scalers 0.6754898144543922
  • After including coutry and region

    • Accuracy with no scalers 0.7468535097962891

In [29]:
from sklearn.metrics import accuracy_score

pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)


Out[29]:
0.6896230993206082

Decision Tree


In [30]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0).fit(train_X, train_y)

Calculate and show permutation importances


In [31]:
perm = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())


Out[31]:
Weight Feature
0.5608 ± 0.0026 country
0.4710 ± 0.0050 region
0.3807 ± 0.0035 iyear
0.0862 ± 0.0033 natlty1
0.0780 ± 0.0032 targsubtype1
0.0624 ± 0.0025 targtype1
0.0444 ± 0.0027 attacktype1
0.0413 ± 0.0018 weaptype1
0.0332 ± 0.0019 nkill
0.0294 ± 0.0015 multiple
0.0200 ± 0.0021 claimed
0.0145 ± 0.0031 nwound
0.0030 ± 0.0010 success
0.0026 ± 0.0003 suicide
0.0014 ± 0.0004 ransom

In [32]:
pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)


Out[32]:
0.6676237463604011

KNeighborsClassifier


In [33]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3).fit(train_X, train_y)

pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)


Out[33]:
0.6213199611776125

GaussianNB


In [ ]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB().fit(train_X, train_y)

pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)


Out[ ]:
0.09608540925266904

CVS


In [ ]:
from sklearn.svm import SVC

model = SVC().fit(train_X, train_y)

pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)

Conclusion

The majority of the attacks in most of countries are commited by the citizens of that country. So if we see an over represented nationality that most probably indicate a failed state or an unstable government.


In [ ]: