Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.
Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.
We will start by updating and installing some of the libraries in this runtime.
In [1]:
!pip install -U seaborn
!pip install xlrd>=0.9.0
!pip install pdpbox
!pip install eli5
In [2]:
import os.path
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
In [3]:
# excel_file = "gtd_14to17_0718dist.xlsx"
excel_file = "globalterrorismdb_0718dist.xlsx"
if os.path.isfile(excel_file):
print("Reading local", excel_file)
df = pd.read_excel(excel_file)
else:
print("Downloading and reading,", excel_file)
df = pd.read_excel('http://apps.start.umd.edu/gtd/downloads/dataset/' + excel_file)
In [4]:
df.head()
Out[4]:
In [5]:
df.columns.tolist()
Out[5]:
Looking at the above columns, I suspect that:
We will find out later if these assumptions are correct or not by doing permutation importance on my trained models.
In [6]:
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
Out[6]:
In [7]:
DROP_THRESHOLD = .70
columns_to_drop = []
for column in df.columns.tolist():
null_ratio = df[column].isnull().sum() / len(df[column])
if null_ratio > DROP_THRESHOLD:
columns_to_drop.append(column)
print (column, "with null ratio", null_ratio , "will be dropped")
df.drop(columns_to_drop, axis=1, inplace=True)
In [8]:
print("All attacks", len(df))
# Also drop rows where gname is unkown
df = df[df['gname'] != 'Unknown']
print("Attacks where the attack group was known", len(df))
In [9]:
df.head()
Out[9]:
In [10]:
df.columns.tolist()
Out[10]:
In [11]:
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
Out[11]:
That's looking a little better, next will just fill the unavailable data
In [12]:
df.fillna(0, inplace=True)
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
Out[12]:
In [13]:
sns.relplot(x="iyear", y="nkill",
col="region_txt", # Categorical variables that will determine the faceting of the grid.
hue="success", # Grouping variable that will produce elements with different colors.
style="success", # Grouping variable that will produce elements with different styles.
data=df)
Out[13]:
In [14]:
sns.relplot(x="iyear", y="nkill",
col="weaptype1_txt", # Categorical variables that will determine the faceting of the grid.
hue="success", # Grouping variable that will produce elements with different colors.
style="success", # Grouping variable that will produce elements with different styles.
data=df)
Out[14]:
Number of attacks by group
In [15]:
df.groupby("gname").size().sort_values(ascending=False).head()
Out[15]:
Number of kills by group
In [16]:
df.groupby("gname")["nkill"].sum().sort_values(ascending=False).head()
Out[16]:
Number of attacks by target
In [17]:
df.groupby("targtype1_txt").size().sort_values(ascending=False).head()
Out[17]:
Number of attacks by nationality
In [18]:
df.groupby("natlty1_txt").size().sort_values(ascending=False).head()
Out[18]:
Looking at the numbers above that might indicate that people with certain nationalities might be more likely to commit terrorist attacks but is that actually true?
In [19]:
df.groupby(['country_txt', 'natlty1_txt']).size()
Out[19]:
It is clear that majority of the attacks in most of countries are commited by the citizens of that country. So if we see an over represented nationality that most probably indicate a failed state or an unstable government.
What about the United States specifically
In [20]:
df.loc[df['country_txt'] == 'United States', ['country_txt', 'natlty1_txt']].groupby(['country_txt', 'natlty1_txt']).size()
Out[20]:
That make sense because most groups are regional that also means that region and country would be good inputs to our model.
In [21]:
y = df['gname']
feature_names = ['iyear', 'country', 'region', 'multiple', 'success', 'suicide', 'attacktype1',
'targtype1', 'targsubtype1', 'natlty1', 'claimed', 'weaptype1', 'nkill', 'nwound',
'ransom']
X = df[feature_names]
In [22]:
# https://www.dataquest.io/blog/pandas-big-data/
def mem_usage(pandas_obj):
if isinstance(pandas_obj,pd.DataFrame):
usage_b = pandas_obj.memory_usage(deep=True).sum()
else: # we assume if not a df it's a series
usage_b = pandas_obj.memory_usage(deep=True)
usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
return "{:03.2f} MB".format(usage_mb)
In [23]:
print("Memory usage before", mem_usage(df))
In [24]:
columns_to_keep = ['gname'] + feature_names
columns_to_drop = []
for column in df.columns.tolist():
if column not in columns_to_keep:
columns_to_drop.append(column)
df.drop(columns_to_drop, axis=1, inplace=True)
In [25]:
print("Memory usage after", mem_usage(df))
In [26]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, random_state=1)
In [27]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
Calculate and show permutation importances with the eli5 library
In [28]:
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())
Out[28]:
The region and country where the attack happened is more indicative of what group might be responsible for it.
Including the country and region of the attack resulted in more accurate results:
Before including country and region
After including coutry and region
In [29]:
from sklearn.metrics import accuracy_score
pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)
Out[29]:
In [30]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=0).fit(train_X, train_y)
Calculate and show permutation importances
In [31]:
perm = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())
Out[31]:
In [32]:
pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)
Out[32]:
In [33]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3).fit(train_X, train_y)
pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)
Out[33]:
In [ ]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB().fit(train_X, train_y)
pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)
Out[ ]:
In [ ]:
from sklearn.svm import SVC
model = SVC().fit(train_X, train_y)
pred_y = model.predict(test_X)
accuracy_score(test_y, pred_y)
In [ ]: