This dataset contains information on more than 170,000 terrorit attacks. The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2016 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 170,000 cases. Learn more.
For more precise information about important details like definitions, collection methodoloy and others plese see the GTD Codebook.
Format : CSV Size : 29 MB
The World Development Indicators from the World Bank contain over a thousand annual indicators of economic development from hundreds of countries around the world.
Here's a list of the available indicators along with a list of the available countries.
Format : SQLITE Size : 261 MB
Our project main purpose involves to get countries statisitcs from different indicators in relation to period where terrorist attacks occur. In order to do it, we have to our diposal two datasets: one with the different countries indicators which is in a really convenient SQLITE format, and the other which contains terrorist attacks and data related to it (like where it happened, which type of attack it is, ect.).
The first important part to deal with is to clean the datasets and merge them. Indeed, we would like to perform join operations between indicators and terrorist attacks. The main problem to overcome is the lack of "agreed" convention to uniquely identify countries between the two datasets. In the first one, all indicators use alpha3 codes (3 letters codes) to denote the countries. We also have a Country
table which have the alpha2 code (2 letters) for each country. In the other dataset, the countries are identified using different codes but there wasn't a simple way to use it for merging. Moreover, the name of the countries used are not exactly the same for the two datasets.
The solution we found was to use the Google Maps API and to send requests to it with the name of the countries coming from the second dataset. This API is robust to country names spelling and can "normalize" them for us by returning the alpha2 code of the country given its name.
Here is the detail of our process:
1. First, we got the Country
table from the first dataset into CSV format and the attack dataset (which was already in CSV format) in order to process them easily with pandas. We only need the Country
table from the frst dataset to perform the cleaning and merging step.
2. Before normalizing the country names using the Google Map API, we needed to tackle some name ambiguity by hand for the Republic of the Congo. Indeed, in the attack dataset, this country is denoted as People's Republic of the Congo and the API couln't understand this name. So we replaced occurences of People's Republic of the Congo with Republic of the Congo. The issue was due to the similarity between the two countries Congo and Republic of the Congo and so the API couldn't figure out which one was People's Republic of the Congo.
3. After resolving this ambiguity, we found each unique country name in the attacks dataset and we built a mapping going from country names to alpha2 code using the Google Map API. Then, we added a field called iso_code
to the dataset with the alpha2 code using the mapping.
4. After that, we had some last issue with some alpha2 code not found by the API, it was for Ireland and Namibia so we added them by hand.
5. We deleted all attacks for which we had no country indicators: if a country of an attack was not present in the Country
table, we deleted the row corresponding to this attack.
6. At this point we were able to join the two datasets using the Country
table, but since all indicators denote the country by an alpha3 code, we built a mapping going from alpha2 code to alpha3 code and we transformed the iso_code
field so that it contains alpha3 code instead. If we did not performed this operation, each join we would make in the future would need an extra join to get the alpha3 code of an attack through the Country
table. Now, we can only have a single join using directly the alpha3 code of the indicators and joining on the iso_code
of the attacks table.
7. To conclude, we saved the cleaned attacks dataset and added it to the SQLITE database so we can easily use SQL to query our data.
In [2]:
import json
from urllib.request import urlopen
from urllib.parse import quote_plus
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook
from multiprocessing import Pool, cpu_count
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
import sqlite3
import warnings
warnings.filterwarnings('ignore')
In [24]:
# This function take the name of a place on Earth (countries in our case) and make a request
# to Google Mas API in order to "normalize" the location and get the country code associated.
# This country code will help to join the terrorist attacks with the country indicators.
def getplace(place):
# Google Maps API base URL
url = "https://maps.googleapis.com/maps/api/geocode/json?"
# Add the parameters to the base URL
url += "address=%s&sensor=false&key=AIzaSyAk90LdWroCWnkWwOVEB_89kAzz1uPCwo0" % (quote_plus(place))
v = urlopen(url).read()
j = json.loads(v)
try:
components = j['results'][0]['address_components']
long = short = None
for c in components:
if "country" in c['types']:
long = c['long_name']
short = c['short_name']
return long, short
except:
# print('-------------', place)
return None, None
# Function that creates that normalize location in order to build a mapping which goes from
# location to country codes.
def mapping(n):
return n, getplace(n)[1]
In [ ]:
# Data folder path
data_path = '../data/'
# Get the number of cores available for parallelization
n_cores = cpu_count()
attacks = pd.read_csv(data_path + './globalterrorismdb_0617dist.csv', encoding="ISO-8859-1")
countries = pd.read_csv(data_path + './Country.csv')
# Treat Congo names ambiguity
attacks.loc[attacks.country_txt.str.contains('People\'s Republic of the Congo'),
'country_txt'] = 'Republic of the Congo'
# Get all unique names
all_names = attacks.country_txt.unique()
# Create a mapping from country names to alpha2 country codes
pool = Pool(n_cores)
name_to_code = tqdm_notebook(pool.imap_unordered(mapping, all_names),
total=all_names.size, desc='API calls')
name_to_code = {k: v for k,v in name_to_code}
# Add field with isocode
attacks['iso_code'] = attacks.country_txt.apply(lambda x: name_to_code[x])
# Treat special country with no match for iso_code
attacks.loc[attacks.country_txt == 'Ireland','iso_code'] = 'IE'
attacks.loc[attacks.country_txt == 'Namibia','iso_code'] = 'NA'
# Delete attacks where no country indicators are available
countries.loc[countries.CountryCode=='NAM', 'Alpha2Code'] = 'NA'
countries.loc[countries.CountryCode=='KSV', 'Alpha2Code'] = 'XK'
attacks = attacks[attacks.iso_code.isin(countries.Alpha2Code)]
# Build a mapping from alpha2 country codes to alpha3 country codes
alpha2_to_alpha3 = countries[['Alpha2Code', 'CountryCode']]
alpha2_to_alpha3 = dict(alpha2_to_alpha3.apply(lambda x: (x.Alpha2Code, x.CountryCode),
axis=1).values)
# Transform the iso_code field with the precedingly created mapping to get alpha3 code
attacks.iso_code = attacks.iso_code.apply(lambda iso: alpha2_to_alpha3[iso])
# Save dataframe
attacks.to_csv(data_path + 'attacks_cleaned.csv')
In [4]:
connection = sqlite3.connect("../api/data/database.sqlite")
attacks = pd.read_sql_query("SELECT * FROM Attacks", connection)
attacks.head()
Out[4]:
In [4]:
# List of all headers from the Attacks table
list(attacks)
Out[4]:
Analyzing this list along with the documentation of the dataset, we see that there is a lot of information which is a subclass of another similar class or which it depends from. Here are a few examples:
Hence, unless we are interested in specific features of an attack, we decided to ignore those subcategories as they give too much details. Therefore, we chose the features that might give the more relevant information to the lambda user, having in mind that we can't flood him with a lot of information.
Here is the list of the information the user will have access to concerning the attacks:
These information will allow to understand the situation of a given country in a clear and simple way (more details about how we present this information will come later).
Let's dive into an example. What kind of results one can expect for USA ?
In [33]:
attack_types_USA = pd.read_sql_query('SELECT attacktype1_txt, num_attacks FROM (SELECT attacktype1_txt, COUNT(attacktype1_txt) num_attacks FROM Attacks WHERE iso_code="USA" GROUP BY attacktype1_txt) ORDER BY num_attacks DESC', connection)
ax = attack_types_USA.set_index('attacktype1_txt').plot(kind='barh', figsize=(10,5), title='Type of attacks distribution \n', legend=False)
ax.set(xlabel='\nNumber of attacks', ylabel='Type of attacks\n')
plt.show()
In [34]:
attack_targets_USA = pd.read_sql_query('SELECT targtype1_txt, num_attacks FROM (SELECT targtype1_txt, COUNT(targtype1_txt) num_attacks FROM Attacks WHERE iso_code="USA" GROUP BY targtype1_txt) ORDER BY num_attacks DESC LIMIT 10', connection)
ax = attack_targets_USA.set_index('targtype1_txt').plot(kind='barh', figsize=(10,5), title='Type of targets distribution \n', legend=False)
ax.set(xlabel='\nNumber of attacks', ylabel='Type of targets\n')
plt.show()
In [35]:
attack_perpetrators_USA = pd.read_sql_query('SELECT gname, num_attacks FROM (SELECT gname, COUNT(gname) num_attacks FROM Attacks WHERE iso_code="USA" GROUP BY gname) ORDER BY num_attacks DESC LIMIT 10', connection)
ax = attack_perpetrators_USA.set_index('gname').plot(kind='barh', figsize=(10,5), title='Perpetrators attack distribution \n', legend=False)
ax.set(xlabel='\nNumber of attacks', ylabel='Perpetrators\n')
plt.show()
In [36]:
num_victims_USA = pd.read_sql_query('SELECT iyear, SUM(nkill) FROM Attacks WHERE iso_code="USA" GROUP BY iyear', connection)
ax = num_victims_USA.set_index('iyear').plot(kind='bar', figsize=(10,5), title='Number of attacks per year \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Number of victims\n')
plt.show()
One can immediatly notice the September 11 attacks.
In [37]:
num_attacks_USA = pd.read_sql_query('SELECT iyear, COUNT(*) FROM Attacks WHERE iso_code="USA" GROUP BY iyear', connection)
ax = num_attacks_USA.set_index('iyear').plot(kind='bar', figsize=(10,5), title='Number of attacks per year \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Number of attacks\n')
plt.show()
We see that these few plots contain already a lot of information that can be very interesting to analyze. The attentive reader will notice that for the three first plots we see the Unknown type appearing each time. Although it mitght seem worhtless at first, we decided to keep it as it testifies the lack of information about the attacks.
As we develop a website to offer the best possible experience for the user/reader, we set up an API to make requests and fetch the data. Each of the requests are SQL queries we are doing at the server database where stands all the information needed.
Example of requests for the API (from ./api/api.py):
In [ ]:
@app.route('/attacks/perpetrators/<string:country>')
def attack_perpetrators_by_country(country):
"""Returns the perpetrators list with the number of attacks corresponding to their attacks in descending order of the given country."""
cur = get_db().execute('SELECT gname, num_attacks FROM (SELECT gname, COUNT(gname) num_attacks FROM Attacks WHERE iso_code="{}" GROUP BY gname) ORDER BY num_attacks DESC'.format(country))
attack_perpetrators = cur.fetchall()
cur.close()
return jsonify(attack_perpetrators)
We made quite an advancement on the website, especially the data visualization part. First, when the user lands on the page, a world map is displayed with the number of terrorist attacks in each of the country represented by a grayscale level. There is also a help available in the bottom-right corner, as can be seen in the interaction GIF hereafter.
From there, the user can interact with the map, click on countries to zoom in and plot the coordinates of the terrorist attacks. The density of the attacks is shown as a function of the plasma colorscale:
Here is a GIF of the interaction with the map:
When the user is zoomed-in on a country, he can further explore details about the number of attacks/victims for each year, and also details about the most active terrorist groups, most targeted victims and most common attack types for a country:
Here is a static view of the detailed view for a country:
For that we used d3.js, JavaScript ES2016 (Yarn) and Wepback. The design is made with http://www.materializecss.com/ for the design of the interfaces/menus/icons and https://gionkunz.github.io/chartist-js/ for the beautiful plots.
We created an API server with Flask to fetch and serve the data from the SQLite database. Everything will be hosted on a website, but at the moment the code is in the ./api/ folder for the API, and ./website/.
A script will be made to make it easy to run the website locally in the future of the project, if the public hosting is not enough.
In [38]:
dev_indicators = pd.read_sql_query("SELECT * FROM Indicators", connection)
dev_indicators.head()
Out[38]:
How many entries are there ?
In [57]:
print('The dataframe contains {} rows.'.format(dev_indicators.shape[0]))
Again, we have a lot of indicators and we need to choose the most relevant ones. The first criterion is the number of years for which the indicators are available, and the second is the number of countries for which the indicators are also available. Then, the information will be sufficiently and equally represented only if the two metrics just defined are high.
In [48]:
dev_indicators_summarize = pd.read_sql_query("SELECT IndicatorCode, IndicatorName, COUNT(DISTINCT Year) as numYears, COUNT(DISTINCT CountryName) as numCountries FROM Indicators GROUP BY IndicatorCode", connection)
dev_indicators_summarize[:15]
Out[48]:
Analyzing just the first few entries of the dataframe, we already see which indicators we should avoid. It is hard to draw strong conclusions from an indicator that is available only two years, and the more countries we have to compare the values of the indicators the better. Finally, the indicators we choose have to be pertinent. After several careful rereadings, we decided too consider the following development indicators:
In [63]:
choosen_indicators = ['EG.USE.ELEC.KH.PC', 'EG.FEC.RNEW.ZS', 'EN.ATM.CO2E.KT',
'EN.POP.DNST', 'FI.RES.XGLD.CD', 'IS.AIR.PSGR', 'IT.NET.USER.P2',
'MS.MIL.MPRT.KD', 'MS.MIL.XPRT.KD', 'NE.EXP.GNFS.CD', 'NE.EXP.GNFS.KD.ZG',
'NE.IMP.GNFS.CD', 'NE.IMP.GNFS.KD.ZG', 'NY.GDP.MKTP.CD', 'NY.GDP.PCAP.CD',
'NY.GNP.MKTP.CD', 'NY.GNP.PCAP.CD', 'SH.DYN.MORT', 'SH.MED.BEDS.ZS',
'SP.DYN.CBRT.IN', 'SP.DYN.CDRT.IN', 'SP.POP.0014.TO.ZS', 'SP.POP.1564.TO.ZS',
'SP.POP.65UP.TO.ZS', 'SP.POP.GROW', 'SP.POP.TOTL']
dev_indicators_summarize[dev_indicators_summarize['IndicatorCode'].isin(choosen_indicators)]
Out[63]:
We decided to represent the development status of a country based on different categories:
These were the most relevant and reliable indicators we found. Again, let's see the values of some indicators for USA.
In [69]:
arms_exported_USA = pd.read_sql_query('SELECT Year, Value FROM Indicators WHERE CountryCode="USA" AND IndicatorCode="MS.MIL.MPRT.KD"', connection)
ax = arms_exported_USA.set_index('Year').plot(kind='bar', figsize=(10,5), title='Arms exports \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Exported arms\n')
plt.show()
In [70]:
hospital_beds_USA = pd.read_sql_query('SELECT Year, Value FROM Indicators WHERE CountryCode="USA" AND IndicatorCode="SH.MED.BEDS.ZS"', connection)
ax = hospital_beds_USA.set_index('Year').plot(kind='bar', figsize=(10,5), title='Hospital beds \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Number of hospital beds per 1,000\n')
plt.show()
In [73]:
pop_growth_USA = pd.read_sql_query('SELECT Year, Value FROM Indicators WHERE CountryCode="USA" AND IndicatorCode="SP.POP.GROW"', connection)
ax = pop_growth_USA.set_index('Year').plot(kind='bar', figsize=(10,5), title='Population growth \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Population growth in %\n')
plt.show()
In [67]:
electric_consumption_USA = pd.read_sql_query('SELECT Year, Value FROM Indicators WHERE CountryCode="USA" AND IndicatorCode="EG.USE.ELEC.KH.PC"', connection)
ax = electric_consumption_USA.set_index('Year').plot(kind='bar', figsize=(10,5), title='Electric consumption \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Electric consumption (kWh per capita)\n')
plt.show()
For the milestone 1, we proposed some ideas to tackle the problem of correlation comparison, and asked for recommendations from the TAs.
We will quickly recall the propositions:
From these advices and remarks, we will try to use some mathematical-based technique (4), looking for correlation prior to 5 years (or maybe 3 years if "this is asking too much").
Pearson correlation coefficient
We might change this pearson correlation coefficient for something more robust or subject to non-linear relationships:
The delicate part is to see if a link can exist between a pair of indicators chosen from each one of these two databases. However, this way of proceeding has a severe weak point: comparing only two indicators is equivalent to reducing a causal link to an unique variable while it includes several. To overcome one part of this issue, we decided to aggregate a subset of the GTD indicators in order to use the Global Terrorism Index (GTI) scoring system. This score is used in the well-known report of the same name published annually by Institute for Economics and Peace (IEP) to account for the relative impact of terrorism for a given year. The score, that was determined by GPI expert panel, is computed using four factors and different weight associated with. A description of the formula can be found here.
Let's look at the raw score of USA as an example.
In [10]:
raw_score_switzerland = pd.read_sql_query('SELECT iyear, (1*COUNT(*) + 3*SUM(nkill) + 0.5*SUM(nwound) + 2*SUM(case propextent when 1.0 then 1 else 0 end) + 2*SUM(case propextent when 2.0 then 1 else 0 end) + 2*SUM(case propextent when 3.0 then 1 else 0 end) + 2*SUM(case propextent when 4.0 then 1 else 0 end)) FROM Attacks WHERE iso_code="USA" GROUP BY iyear', connection)
ax = raw_score_switzerland.set_index('iyear').plot(kind='bar', figsize=(10,5), title='GTI \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Raw score \n')
plt.show()
Of course a similar index for the development indicators would be very desirable. Unfortunately, due to the very different nature of the indicators we selected, it was difficult from our position to create and define an index that can encapsulate all these aspects with justifiable reasons. Hence, as a precaution, we decided to left the development indicators as is.
This doesn’t mean that we couldn’t draw conclusions from the data we have. Inspecting some countries we, indeed, found some interesting results were we think terrorism is directly but not only correlated. Let's go through an example with Switzerland.
As a recall, on the first graph we have the GTI score in yellow, the number of victims in red and the number of attacks in purple. On the second graph, we have the number of arrivals for the tourism as indicated.
Observing this plot, we can see that Switzerland suffered drastically from a particular terrorist attack in the development of its tourism. Indeed, the attack of the canton’s parliament of Zug in the 27th September of 2001, where 15 people died, shaked the whole country and forced him to take stronger security measures. Other similar analysis are available in our website as examples.
Hence, we see that even with a limited amount of indicators and an uncomplicated analysis, we are able to extract precious information from the context in which a particular country expands.
Again, we would like to emphasize that those analysis should not be considered as an absolute truth and rather give insight about how things can be related. We tried, in this project, to explain things through the terrorism view but many other perspective can be developped.
To explore our project with a lot more interaction and read our data story, visit our website!