Global Terrorist Attacks

Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.

Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.

We will start by updating and installing some of the libraries in this runtime.



In [1]:

    
!pip install -U seaborn
!pip install xlrd>=0.9.0
!pip install pdpbox
!pip install eli5









    



Requirement already up-to-date: seaborn in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages
Requirement already up-to-date: matplotlib>=1.4.3 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from seaborn)
Requirement already up-to-date: pandas>=0.15.2 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from seaborn)
Requirement already up-to-date: numpy>=1.9.3 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from seaborn)
Requirement already up-to-date: scipy>=0.14.0 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from seaborn)
Requirement already up-to-date: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: python-dateutil>=2.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: kiwisolver>=1.0.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: cycler>=0.10 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages/cycler-0.10.0-py3.5.egg (from matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: pytz>=2011k in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pandas>=0.15.2->seaborn)
Requirement already up-to-date: six>=1.5 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from python-dateutil>=2.1->matplotlib>=1.4.3->seaborn)
Requirement already up-to-date: setuptools in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn)
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Requirement already satisfied: pdpbox in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages
Requirement already satisfied: psutil in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: matplotlib>=2.1.2 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: scipy in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: pandas in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: scikit-learn in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: numpy in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: joblib in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pdpbox)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: cycler>=0.10 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages/cycler-0.10.0-py3.5.egg (from matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: python-dateutil>=2.1 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: pytz>=2011k in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from pandas->pdpbox)
Requirement already satisfied: setuptools in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->pdpbox)
Requirement already satisfied: six in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from cycler>=0.10->matplotlib>=2.1.2->pdpbox)
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting eli5
  Downloading https://files.pythonhosted.org/packages/8d/c8/04bed18dcce1d927b0dd5fc3425777354b714d2e62d60ae301928b5a5bf8/eli5-0.8.1-py2.py3-none-any.whl (98kB)
    100% |################################| 102kB 526kB/s a 0:00:011
Requirement already satisfied: numpy>=1.9.0 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Requirement already satisfied: jinja2 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Collecting typing (from eli5)
  Downloading https://files.pythonhosted.org/packages/4a/bd/eee1157fc2d8514970b345d69cb9975dcd1e42cd7e61146ed841f6e68309/typing-3.6.6-py3-none-any.whl
Requirement already satisfied: scipy in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Requirement already satisfied: scikit-learn>=0.18 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Requirement already satisfied: six in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from eli5)
Collecting tabulate>=0.7.7 (from eli5)
  Downloading https://files.pythonhosted.org/packages/12/c2/11d6845db5edf1295bc08b2f488cf5937806586afe42936c3f34c097ebdc/tabulate-0.8.2.tar.gz (45kB)
    100% |################################| 51kB 4.1MB/s eta 0:00:01
Collecting graphviz (from eli5)
  Downloading https://files.pythonhosted.org/packages/1f/e2/ef2581b5b86625657afd32030f90cf2717456c1d2b711ba074bf007c0f1a/graphviz-0.10.1-py2.py3-none-any.whl
Collecting attrs>16.0.0 (from eli5)
  Downloading https://files.pythonhosted.org/packages/3a/e1/5f9023cc983f1a628a8c2fd051ad19e76ff7b142a0faf329336f9a62a514/attrs-18.2.0-py2.py3-none-any.whl
Collecting singledispatch>=3.4.0.3; python_version < "3.5.6" (from eli5)
  Downloading https://files.pythonhosted.org/packages/c5/10/369f50bcd4621b263927b0a1519987a04383d4a98fb10438042ad410cf88/singledispatch-3.4.0.3-py2.py3-none-any.whl
Requirement already satisfied: MarkupSafe>=0.23 in /Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages (from jinja2->eli5)
Building wheels for collected packages: tabulate
  Running setup.py bdist_wheel for tabulate ... done
  Stored in directory: /Users/mostafagazar/Library/Caches/pip/wheels/2a/85/33/2f6da85d5f10614cbe5a625eab3b3aebfdf43e7b857f25f829
Successfully built tabulate
Installing collected packages: typing, tabulate, graphviz, attrs, singledispatch, eli5
Successfully installed attrs-18.2.0 eli5-0.8.1 graphviz-0.10.1 singledispatch-3.4.0.3 tabulate-0.8.2 typing-3.6.6
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.



In [2]:

    
import os.path

import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Download dataset

I will first explore a small dataset (attacks between 2014-2017) and then re-run all the cells on the complete dataset.



In [3]:

    
# excel_file = "gtd_14to17_0718dist.xlsx"
excel_file = "globalterrorismdb_0718dist.xlsx"

if os.path.isfile(excel_file):
    print("Reading local", excel_file)
    df = pd.read_excel(excel_file)
else:
    print("Downloading and reading,", excel_file)
    df = pd.read_excel('http://apps.start.umd.edu/gtd/downloads/dataset/' + excel_file)









    



Reading local globalterrorismdb_0718dist.xlsx



In [4]:

    
df.head()









    Out[4]:







  
    
      
      eventid
      iyear
      imonth
      iday
      approxdate
      extended
      resolution
      country
      country_txt
      region
      ...
      addnotes
      scite1
      scite2
      scite3
      dbsource
      INT_LOG
      INT_IDEO
      INT_MISC
      INT_ANY
      related
    
  
  
    
      0
      197000000001
      1970
      7
      2
      NaN
      0
      NaT
      58
      Dominican Republic
      2
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      0
      0
      0
      0
      NaN
    
    
      1
      197000000002
      1970
      0
      0
      NaN
      0
      NaT
      130
      Mexico
      1
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      0
      1
      1
      1
      NaN
    
    
      2
      197001000001
      1970
      1
      0
      NaN
      0
      NaT
      160
      Philippines
      5
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      -9
      -9
      1
      1
      NaN
    
    
      3
      197001000002
      1970
      1
      0
      NaN
      0
      NaT
      78
      Greece
      8
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      -9
      -9
      1
      1
      NaN
    
    
      4
      197001000003
      1970
      1
      0
      NaN
      0
      NaT
      101
      Japan
      4
      ...
      NaN
      NaN
      NaN
      NaN
      PGIS
      -9
      -9
      1
      1
      NaN
    
  

5 rows × 135 columns



In [5]:

    
df.columns.tolist()









    Out[5]:





['eventid',
 'iyear',
 'imonth',
 'iday',
 'approxdate',
 'extended',
 'resolution',
 'country',
 'country_txt',
 'region',
 'region_txt',
 'provstate',
 'city',
 'latitude',
 'longitude',
 'specificity',
 'vicinity',
 'location',
 'summary',
 'crit1',
 'crit2',
 'crit3',
 'doubtterr',
 'alternative',
 'alternative_txt',
 'multiple',
 'success',
 'suicide',
 'attacktype1',
 'attacktype1_txt',
 'attacktype2',
 'attacktype2_txt',
 'attacktype3',
 'attacktype3_txt',
 'targtype1',
 'targtype1_txt',
 'targsubtype1',
 'targsubtype1_txt',
 'corp1',
 'target1',
 'natlty1',
 'natlty1_txt',
 'targtype2',
 'targtype2_txt',
 'targsubtype2',
 'targsubtype2_txt',
 'corp2',
 'target2',
 'natlty2',
 'natlty2_txt',
 'targtype3',
 'targtype3_txt',
 'targsubtype3',
 'targsubtype3_txt',
 'corp3',
 'target3',
 'natlty3',
 'natlty3_txt',
 'gname',
 'gsubname',
 'gname2',
 'gsubname2',
 'gname3',
 'gsubname3',
 'motive',
 'guncertain1',
 'guncertain2',
 'guncertain3',
 'individual',
 'nperps',
 'nperpcap',
 'claimed',
 'claimmode',
 'claimmode_txt',
 'claim2',
 'claimmode2',
 'claimmode2_txt',
 'claim3',
 'claimmode3',
 'claimmode3_txt',
 'compclaim',
 'weaptype1',
 'weaptype1_txt',
 'weapsubtype1',
 'weapsubtype1_txt',
 'weaptype2',
 'weaptype2_txt',
 'weapsubtype2',
 'weapsubtype2_txt',
 'weaptype3',
 'weaptype3_txt',
 'weapsubtype3',
 'weapsubtype3_txt',
 'weaptype4',
 'weaptype4_txt',
 'weapsubtype4',
 'weapsubtype4_txt',
 'weapdetail',
 'nkill',
 'nkillus',
 'nkillter',
 'nwound',
 'nwoundus',
 'nwoundte',
 'property',
 'propextent',
 'propextent_txt',
 'propvalue',
 'propcomment',
 'ishostkid',
 'nhostkid',
 'nhostkidus',
 'nhours',
 'ndays',
 'divert',
 'kidhijcountry',
 'ransom',
 'ransomamt',
 'ransomamtus',
 'ransompaid',
 'ransompaidus',
 'ransomnote',
 'hostkidoutcome',
 'hostkidoutcome_txt',
 'nreleased',
 'addnotes',
 'scite1',
 'scite2',
 'scite3',
 'dbsource',
 'INT_LOG',
 'INT_IDEO',
 'INT_MISC',
 'INT_ANY',
 'related']

Looking at the above columns, I suspect that:

Region/country would a great factor in predicting what group may be responsible for an attack.
Date would also be a good factor
Weapon and target as well

We will find out later if these assumptions are correct or not by doing permutation importance on my trained models.

Clean data

Next step is assesing our data and do some data cleanup



In [6]:

    
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x1284e30f0>

Drop almost empty columns

I will start by dropping all the columns that are almost empty (70% empty)



In [7]:

    
DROP_THRESHOLD = .70

columns_to_drop = []
for column in df.columns.tolist():
    null_ratio = df[column].isnull().sum() / len(df[column])
    if null_ratio > DROP_THRESHOLD:
        columns_to_drop.append(column)
        print (column, "with null ratio", null_ratio , "will be dropped")

df.drop(columns_to_drop, axis=1, inplace=True)









    



approxdate with null ratio 0.9491499303762982 will be dropped
resolution with null ratio 0.9877814531264619 will be dropped
alternative with null ratio 0.8403278093026072 will be dropped
alternative_txt with null ratio 0.8403278093026072 will be dropped
attacktype2 with null ratio 0.9652486914596761 will be dropped
attacktype2_txt with null ratio 0.9652486914596761 will be dropped
attacktype3 with null ratio 0.997644352224381 will be dropped
attacktype3_txt with null ratio 0.997644352224381 will be dropped
targtype2 with null ratio 0.9386650962348162 will be dropped
targtype2_txt with null ratio 0.9386650962348162 will be dropped
targsubtype2 with null ratio 0.9411913633586694 will be dropped
targsubtype2_txt with null ratio 0.9411913633586694 will be dropped
corp2 with null ratio 0.9443175501263134 will be dropped
target2 with null ratio 0.9393475736277526 will be dropped
natlty2 with null ratio 0.9404043128168154 will be dropped
natlty2_txt with null ratio 0.9404043128168154 will be dropped
targtype3 with null ratio 0.9935274724669907 will be dropped
targtype3_txt with null ratio 0.9935274724669907 will be dropped
targsubtype3 with null ratio 0.9939622766124905 will be dropped
targsubtype3_txt with null ratio 0.9939622766124905 will be dropped
corp3 with null ratio 0.994353049958446 will be dropped
target3 with null ratio 0.9935329763169337 will be dropped
natlty3 with null ratio 0.9936870841153387 will be dropped
natlty3_txt with null ratio 0.9936870841153387 will be dropped
gsubname with null ratio 0.9675823238355229 will be dropped
gname2 with null ratio 0.9889207500646703 will be dropped
gsubname2 with null ratio 0.9991193840091144 will be dropped
gname3 with null ratio 0.9982167526184567 will be dropped
gsubname3 with null ratio 0.9998899230011393 will be dropped
motive with null ratio 0.7217198430301996 will be dropped
guncertain2 with null ratio 0.9892399733613663 will be dropped
guncertain3 with null ratio 0.9982387680182288 will be dropped
claimmode with null ratio 0.8949700315370602 will be dropped
claimmode_txt with null ratio 0.8949700315370602 will be dropped
claim2 with null ratio 0.9895977236076635 will be dropped
claimmode2 with null ratio 0.9966096284350904 will be dropped
claimmode2_txt with null ratio 0.9966096284350904 will be dropped
claim3 with null ratio 0.9982497757181148 will be dropped
claimmode3 with null ratio 0.9992679879575763 will be dropped
claimmode3_txt with null ratio 0.9992679879575763 will be dropped
compclaim with null ratio 0.9733668701256529 will be dropped
weaptype2 with null ratio 0.9277509617977775 will be dropped
weaptype2_txt with null ratio 0.9277509617977775 will be dropped
weapsubtype2 with null ratio 0.9364745639574883 will be dropped
weapsubtype2_txt with null ratio 0.9364745639574883 will be dropped
weaptype3 with null ratio 0.9897463275561255 will be dropped
weaptype3_txt with null ratio 0.9897463275561255 will be dropped
weapsubtype3 with null ratio 0.9906819820464415 will be dropped
weapsubtype3_txt with null ratio 0.9906819820464415 will be dropped
weaptype4 with null ratio 0.9995982189541585 will be dropped
weaptype4_txt with null ratio 0.9995982189541585 will be dropped
weapsubtype4 with null ratio 0.9996147305039875 will be dropped
weapsubtype4_txt with null ratio 0.9996147305039875 will be dropped
propvalue with null ratio 0.7854103945710024 will be dropped
nhostkid with null ratio 0.9253017485731269 will be dropped
nhostkidus with null ratio 0.9256044603199939 will be dropped
nhours with null ratio 0.9776378576814482 will be dropped
ndays with null ratio 0.9552867230627824 will be dropped
divert with null ratio 0.9982167526184567 will be dropped
kidhijcountry with null ratio 0.9818097759382688 will be dropped
ransomamt with null ratio 0.9925698025769025 will be dropped
ransomamtus with null ratio 0.9969013324820712 will be dropped
ransompaid with null ratio 0.9957400201440908 will be dropped
ransompaidus with null ratio 0.9969618748314446 will be dropped
ransomnote with null ratio 0.9971710211292799 will be dropped
hostkidoutcome with null ratio 0.9395071852761007 will be dropped
hostkidoutcome_txt with null ratio 0.9395071852761007 will be dropped
nreleased with null ratio 0.9427599605924344 will be dropped
addnotes with null ratio 0.8443015889614786 will be dropped
scite3 with null ratio 0.7604944658788823 will be dropped
related with null ratio 0.8621946051262859 will be dropped

Drop almost empty rows

Rows with invalid data like when the attack group is unkown will be of much help, it will just confuse the model.



In [8]:

    
print("All attacks", len(df))

# Also drop rows where gname is unkown
df = df[df['gname'] != 'Unknown']

print("Attacks where the attack group was known", len(df))









    



All attacks 181691
Attacks where the attack group was known 98909



In [9]:

    
df.head()









    Out[9]:







  
    
      
      eventid
      iyear
      imonth
      iday
      extended
      country
      country_txt
      region
      region_txt
      provstate
      ...
      propcomment
      ishostkid
      ransom
      scite1
      scite2
      dbsource
      INT_LOG
      INT_IDEO
      INT_MISC
      INT_ANY
    
  
  
    
      0
      197000000001
      1970
      7
      2
      0
      58
      Dominican Republic
      2
      Central America & Caribbean
      NaN
      ...
      NaN
      0.0
      0.0
      NaN
      NaN
      PGIS
      0
      0
      0
      0
    
    
      1
      197000000002
      1970
      0
      0
      0
      130
      Mexico
      1
      North America
      Federal
      ...
      NaN
      1.0
      1.0
      NaN
      NaN
      PGIS
      0
      1
      1
      1
    
    
      5
      197001010002
      1970
      1
      1
      0
      217
      United States
      1
      North America
      Illinois
      ...
      NaN
      0.0
      0.0
      "Police Chief Quits," Washington Post, January...
      "Cairo Police Chief Quits; Decries Local 'Mili...
      Hewitt Project
      -9
      -9
      0
      -9
    
    
      6
      197001020001
      1970
      1
      2
      0
      218
      Uruguay
      3
      South America
      Montevideo
      ...
      NaN
      0.0
      0.0
      NaN
      NaN
      PGIS
      0
      0
      0
      0
    
    
      8
      197001020003
      1970
      1
      2
      0
      217
      United States
      1
      North America
      Wisconsin
      ...
      Basketball courts, weight room, swimming pool,...
      0.0
      0.0
      Tom Bates, "Rads: The 1970 Bombing of the Army...
      David Newman, Sandra Sutherland, and Jon Stewa...
      Hewitt Project
      0
      0
      0
      0
    
  

5 rows × 64 columns



In [10]:

    
df.columns.tolist()









    Out[10]:





['eventid',
 'iyear',
 'imonth',
 'iday',
 'extended',
 'country',
 'country_txt',
 'region',
 'region_txt',
 'provstate',
 'city',
 'latitude',
 'longitude',
 'specificity',
 'vicinity',
 'location',
 'summary',
 'crit1',
 'crit2',
 'crit3',
 'doubtterr',
 'multiple',
 'success',
 'suicide',
 'attacktype1',
 'attacktype1_txt',
 'targtype1',
 'targtype1_txt',
 'targsubtype1',
 'targsubtype1_txt',
 'corp1',
 'target1',
 'natlty1',
 'natlty1_txt',
 'gname',
 'guncertain1',
 'individual',
 'nperps',
 'nperpcap',
 'claimed',
 'weaptype1',
 'weaptype1_txt',
 'weapsubtype1',
 'weapsubtype1_txt',
 'weapdetail',
 'nkill',
 'nkillus',
 'nkillter',
 'nwound',
 'nwoundus',
 'nwoundte',
 'property',
 'propextent',
 'propextent_txt',
 'propcomment',
 'ishostkid',
 'ransom',
 'scite1',
 'scite2',
 'dbsource',
 'INT_LOG',
 'INT_IDEO',
 'INT_MISC',
 'INT_ANY']



In [11]:

    
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x131808e80>

That's looking a little better, next will just fill the unavailable data



In [12]:

    
df.fillna(0, inplace=True)

sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x119e406a0>

Data exploration



In [13]:

    
sns.relplot(x="iyear", y="nkill", 
            col="region_txt", # Categorical variables that will determine the faceting of the grid.
            hue="success",  # Grouping variable that will produce elements with different colors.
            style="success", # Grouping variable that will produce elements with different styles.
            data=df)









    Out[13]:





<seaborn.axisgrid.FacetGrid at 0x130515198>



In [14]:

    
sns.relplot(x="iyear", y="nkill", 
           col="weaptype1_txt", # Categorical variables that will determine the faceting of the grid.
           hue="success",  # Grouping variable that will produce elements with different colors.
           style="success", # Grouping variable that will produce elements with different styles.
           data=df)









    Out[14]:





<seaborn.axisgrid.FacetGrid at 0x12f31eeb8>

Number of attacks by group



In [15]:

    
df.groupby("gname").size().sort_values(ascending=False).head()









    Out[15]:





gname
Taliban                                             7478
Islamic State of Iraq and the Levant (ISIL)         5613
Shining Path (SL)                                   4555
Farabundo Marti National Liberation Front (FMLN)    3351
Al-Shabaab                                          3288
dtype: int64

Number of kills by group



In [16]:

    
df.groupby("gname")["nkill"].sum().sort_values(ascending=False).head()









    Out[16]:





gname
Islamic State of Iraq and the Levant (ISIL)    38923.0
Taliban                                        29410.0
Boko Haram                                     20328.0
Shining Path (SL)                              11601.0
Liberation Tigers of Tamil Eelam (LTTE)        10989.0
Name: nkill, dtype: float64

Number of attacks by target



In [17]:

    
df.groupby("targtype1_txt").size().sort_values(ascending=False).head()









    Out[17]:





targtype1_txt
Private Citizens & Property    22727
Military                       18969
Police                         14094
Business                       11518
Government (General)            9946
dtype: int64

Number of attacks by nationality



In [18]:

    
df.groupby("natlty1_txt").size().sort_values(ascending=False).head()









    Out[18]:





natlty1_txt
India          7805
Afghanistan    6989
Iraq           6010
Colombia       5958
Peru           4994
dtype: int64

Looking at the numbers above that might indicate that people with certain nationalities might be more likely to commit terrorist attacks but is that actually true?



In [19]:

    
df.groupby(['country_txt', 'natlty1_txt']).size()









    Out[19]:





country_txt  natlty1_txt             
Afghanistan  0                            259
             Afghanistan                 6938
             Algeria                        1
             Asian                          3
             Australia                      2
             Bangladesh                     3
             Canada                         7
             China                          3
             Denmark                        1
             France                         8
             Germany                        7
             Great Britain                  8
             Iceland                        1
             India                         20
             International                522
             Iran                           4
             Iraq                           6
             Italy                          9
             Japan                          5
             Multinational                 20
             Nepal                          1
             Netherlands                    2
             Norway                         1
             Pakistan                      13
             Russia                         1
             Saudi Arabia                   1
             South Korea                    2
             Soviet Union                   1
             Spain                          1
             Sweden                         1
                                         ... 
Yemen        Tajikistan                     1
             Tuvalu                         1
             United Arab Emirates           4
             United Kingdom                 1
             United States                 29
             Uzbekistan                     1
             West Bank and Gaza Strip       1
             Yemen                       2170
Yugoslavia   Albania                        1
             Bosnia-Herzegovina             2
             Israel                         1
             Multinational                  1
             Serbia-Montenegro             44
             Turkey                         1
             United States                  2
             Yugoslavia                    22
Zaire        Great Britain                  1
             Israel                         3
             Rwanda                         2
             Spain                          1
             Sweden                         1
             United States                  1
             Zaire                          8
Zambia       Angola                         1
             Portugal                       1
             South Africa                   1
             Zambia                        29
Zimbabwe     Great Britain                  3
             United States                  1
             Zimbabwe                      40
Length: 1806, dtype: int64

It is clear that majority of the attacks in most of countries are commited by the citizens of that country. So if we see an over represented nationality that most probably indicate a failed state or an unstable government.

What about the United States specifically



In [20]:

    
df.loc[df['country_txt'] == 'United States', ['country_txt', 'natlty1_txt']].groupby(['country_txt', 'natlty1_txt']).size()









    Out[20]:





country_txt    natlty1_txt                     
United States  0                                      4
               Angola                                 1
               Argentina                              2
               Bahamas                                2
               Bangladesh                             1
               Brazil                                 1
               Canada                                 1
               China                                  3
               Colombia                               1
               Costa Rica                             2
               Cuba                                  19
               Czechoslovakia                         1
               Democratic Republic of the Congo       1
               Dominican Republic                     3
               Egypt                                  8
               France                                 1
               Germany                                1
               Great Britain                          2
               Haiti                                  4
               India                                  4
               International                          8
               Iran                                   8
               Iraq                                   4
               Israel                                13
               Lebanon                                3
               Liberia                                1
               Libya                                  3
               Malawi                                 1
               Mexico                                10
               New Zealand                            1
               Nicaragua                              1
               Panama                                 1
               Poland                                 1
               Portugal                               2
               Puerto Rico                           58
               Rhodesia                               1
               Russia                                 2
               Saudi Arabia                           1
               South Africa                           6
               Soviet Union                          40
               Spain                                  5
               Switzerland                            5
               Tunisia                                1
               Turkey                                10
               United States                       1991
               Uruguay                                1
               Venezuela                              7
               Vietnam                                5
               West Bank and Gaza Strip               4
               West Germany (FRG)                     1
               Yugoslavia                             6
dtype: int64

That make sense because most groups are regional that also means that region and country would be good inputs to our model.

Model training

A tree based model should deliver good results



In [21]:

    
y = df['gname']
feature_names = ['iyear', 'country', 'region', 'multiple', 'success', 'suicide', 'attacktype1', 
                 'targtype1', 'targsubtype1', 'natlty1', 'claimed', 'weaptype1', 'nkill', 'nwound', 
                 'ransom']
X = df[feature_names]

Minimize our dataframe memory usage

We will achieve that by dropping the columns we will not be using in training our models, and if that is not enough I will look into columns data types and convert them.



In [22]:

    
# https://www.dataquest.io/blog/pandas-big-data/
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)



In [23]:

    
print("Memory usage before", mem_usage(df))









    



Memory usage before 191.12 MB



In [24]:

    
columns_to_keep = ['gname'] + feature_names
columns_to_drop = []
for column in df.columns.tolist():
    if column not in columns_to_keep:
        columns_to_drop.append(column)
        
df.drop(columns_to_drop, axis=1, inplace=True)



In [25]:

    
print("Memory usage after", mem_usage(df))









    



Memory usage after 20.01 MB

Split our data into train, validation and test



In [26]:

    
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, random_state=1)

RandomForestClassifier



In [27]:

    
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=0).fit(train_X, train_y)









    



/Users/mostafagazar/anaconda3/envs/tf/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

Calculate and show permutation importances with the eli5 library



In [28]:

    
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())









    Out[28]:





    



    

    

    

    

    

    


    

    

    

    

    

    


    

    

    

    

    
        
    
    
        Weight
        Feature
    
    
    
    
        
            
                0.2887
                
                    ± 0.0069
                
            
            
                iyear
            
        
    
        
            
                0.2758
                
                    ± 0.0046
                
            
            
                country
            
        
    
        
            
                0.2598
                
                    ± 0.0032
                
            
            
                region
            
        
    
        
            
                0.1095
                
                    ± 0.0037
                
            
            
                natlty1
            
        
    
        
            
                0.0321
                
                    ± 0.0030
                
            
            
                claimed
            
        
    
        
            
                0.0256
                
                    ± 0.0029
                
            
            
                weaptype1
            
        
    
        
            
                0.0238
                
                    ± 0.0014
                
            
            
                nkill
            
        
    
        
            
                0.0238
                
                    ± 0.0040
                
            
            
                targsubtype1
            
        
    
        
            
                0.0214
                
                    ± 0.0026
                
            
            
                multiple
            
        
    
        
            
                0.0195
                
                    ± 0.0026
                
            
            
                targtype1
            
        
    
        
            
                0.0143
                
                    ± 0.0023
                
            
            
                attacktype1
            
        
    
        
            
                0.0073
                
                    ± 0.0020
                
            
            
                nwound
            
        
    
        
            
                0.0023
                
                    ± 0.0004
                
            
            
                suicide
            
        
    
        
            
                0.0014
                
                    ± 0.0015
                
            
            
                success
            
        
    
        
            
                0.0007
                
                    ± 0.0002
                
            
            
                ransom

The region and country where the attack happened is more indicative of what group might be responsible for it.

Including the country and region of the attack resulted in more accurate results:

Before including country and region
- Accuracy with no scalers 0.6758790709744388
- Accuracy after applying scalers 0.6754898144543922
After including coutry and region
- Accuracy with no scalers 0.7468535097962891



In [29]:

    
from sklearn.metrics import accuracy_score

pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)









    Out[29]:





0.6896230993206082

Decision Tree



In [30]:

    
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0).fit(train_X, train_y)

Calculate and show permutation importances



In [31]:

    
perm = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())









    Out[31]:





    



    

    

    

    

    

    


    

    

    

    

    

    


    

    

    

    

    
        
    
    
        Weight
        Feature
    
    
    
    
        
            
                0.5608
                
                    ± 0.0026
                
            
            
                country
            
        
    
        
            
                0.4710
                
                    ± 0.0050
                
            
            
                region
            
        
    
        
            
                0.3807
                
                    ± 0.0035
                
            
            
                iyear
            
        
    
        
            
                0.0862
                
                    ± 0.0033
                
            
            
                natlty1
            
        
    
        
            
                0.0780
                
                    ± 0.0032
                
            
            
                targsubtype1
            
        
    
        
            
                0.0624
                
                    ± 0.0025
                
            
            
                targtype1
            
        
    
        
            
                0.0444
                
                    ± 0.0027
                
            
            
                attacktype1
            
        
    
        
            
                0.0413
                
                    ± 0.0018
                
            
            
                weaptype1
            
        
    
        
            
                0.0332
                
                    ± 0.0019
                
            
            
                nkill
            
        
    
        
            
                0.0294
                
                    ± 0.0015
                
            
            
                multiple
            
        
    
        
            
                0.0200
                
                    ± 0.0021
                
            
            
                claimed
            
        
    
        
            
                0.0145
                
                    ± 0.0031
                
            
            
                nwound
            
        
    
        
            
                0.0030
                
                    ± 0.0010
                
            
            
                success
            
        
    
        
            
                0.0026
                
                    ± 0.0003
                
            
            
                suicide
            
        
    
        
            
                0.0014
                
                    ± 0.0004
                
            
            
                ransom



In [32]:

    
pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)









    Out[32]:





0.6676237463604011

KNeighborsClassifier



In [33]:

    
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3).fit(train_X, train_y)

pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)









    Out[33]:





0.6213199611776125

GaussianNB



In [ ]:

    
from sklearn.naive_bayes import GaussianNB

model = GaussianNB().fit(train_X, train_y)

pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)









    Out[ ]:





0.09608540925266904

CVS



In [ ]:

    
from sklearn.svm import SVC

model = SVC().fit(train_X, train_y)

pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)

Conclusion

The majority of the attacks in most of countries are commited by the citizens of that country. So if we see an over represented nationality that most probably indicate a failed state or an unstable government.



In [ ]:

	eventid	iyear	imonth	iday	approxdate	resolution	country	country_txt	region	...	addnotes	scite1	scite2	scite3	dbsource	INT_LOG	INT_IDEO	INT_MISC	INT_ANY	related
0	197000000001	1970	7	2	NaN	NaT	58	Dominican Republic	2	...	NaN	NaN	NaN	NaN	PGIS	0	0	0	0	NaN
1	197000000002	1970	0	0	NaN	NaT	130	Mexico	1	...	NaN	NaN	NaN	NaN	PGIS	0	1	1	1	NaN
2	197001000001	1970	1	0	NaN	NaT	160	Philippines	5	...	NaN	NaN	NaN	NaN	PGIS	-9	-9	1	1	NaN
3	197001000002	1970	1	0	NaN	NaT	78	Greece	8	...	NaN	NaN	NaN	NaN	PGIS	-9	-9	1	1	NaN
4	197001000003	1970	1	0	NaN	NaT	101	Japan	4	...	NaN	NaN	NaN	NaN	PGIS	-9	-9	1	1	NaN

Weight	Feature
0.2887 ± 0.0069	iyear
0.2758 ± 0.0046	country
0.2598 ± 0.0032	region
0.1095 ± 0.0037	natlty1
0.0321 ± 0.0030	claimed
0.0256 ± 0.0029	weaptype1
0.0238 ± 0.0014	nkill
0.0238 ± 0.0040	targsubtype1
0.0214 ± 0.0026	multiple
0.0195 ± 0.0026	targtype1
0.0143 ± 0.0023	attacktype1
0.0073 ± 0.0020	nwound
0.0023 ± 0.0004	suicide
0.0014 ± 0.0015	success
0.0007 ± 0.0002	ransom

Weight	Feature
0.5608 ± 0.0026	country
0.4710 ± 0.0050	region
0.3807 ± 0.0035	iyear
0.0862 ± 0.0033	natlty1
0.0780 ± 0.0032	targsubtype1
0.0624 ± 0.0025	targtype1
0.0444 ± 0.0027	attacktype1
0.0413 ± 0.0018	weaptype1
0.0332 ± 0.0019	nkill
0.0294 ± 0.0015	multiple
0.0200 ± 0.0021	claimed
0.0145 ± 0.0031	nwound
0.0030 ± 0.0010	success
0.0026 ± 0.0003	suicide
0.0014 ± 0.0004	ransom