Impact of Terrorism on World Development

To explore our project with a lot more interaction and read our data story, visit our website!

Datasets description

Global Terrorism Database

This dataset contains information on more than 170,000 terrorit attacks. The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2016 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 170,000 cases. Learn more.

For more precise information about important details like definitions, collection methodoloy and others plese see the GTD Codebook.

Format : CSV      Size : 29 MB

World Development Indicators

The World Development Indicators from the World Bank contain over a thousand annual indicators of economic development from hundreds of countries around the world.

Here's a list of the available indicators along with a list of the available countries.

Format : SQLITE      Size : 261 MB

Data cleaning and merging

Main data cleaning and merging constraints

Our project main purpose involves to get countries statisitcs from different indicators in relation to period where terrorist attacks occur. In order to do it, we have to our diposal two datasets: one with the different countries indicators which is in a really convenient SQLITE format, and the other which contains terrorist attacks and data related to it (like where it happened, which type of attack it is, ect.).

The first important part to deal with is to clean the datasets and merge them. Indeed, we would like to perform join operations between indicators and terrorist attacks. The main problem to overcome is the lack of "agreed" convention to uniquely identify countries between the two datasets. In the first one, all indicators use alpha3 codes (3 letters codes) to denote the countries. We also have a Country table which have the alpha2 code (2 letters) for each country. In the other dataset, the countries are identified using different codes but there wasn't a simple way to use it for merging. Moreover, the name of the countries used are not exactly the same for the two datasets.

Our solution

The solution we found was to use the Google Maps API and to send requests to it with the name of the countries coming from the second dataset. This API is robust to country names spelling and can "normalize" them for us by returning the alpha2 code of the country given its name.
Here is the detail of our process:

1. First, we got the Country table from the first dataset into CSV format and the attack dataset (which was already in CSV format) in order to process them easily with pandas. We only need the Country table from the frst dataset to perform the cleaning and merging step.

2. Before normalizing the country names using the Google Map API, we needed to tackle some name ambiguity by hand for the Republic of the Congo. Indeed, in the attack dataset, this country is denoted as People's Republic of the Congo and the API couln't understand this name. So we replaced occurences of People's Republic of the Congo with Republic of the Congo. The issue was due to the similarity between the two countries Congo and Republic of the Congo and so the API couldn't figure out which one was People's Republic of the Congo.

3. After resolving this ambiguity, we found each unique country name in the attacks dataset and we built a mapping going from country names to alpha2 code using the Google Map API. Then, we added a field called iso_code to the dataset with the alpha2 code using the mapping.

4. After that, we had some last issue with some alpha2 code not found by the API, it was for Ireland and Namibia so we added them by hand.

5. We deleted all attacks for which we had no country indicators: if a country of an attack was not present in the Country table, we deleted the row corresponding to this attack.

6. At this point we were able to join the two datasets using the Country table, but since all indicators denote the country by an alpha3 code, we built a mapping going from alpha2 code to alpha3 code and we transformed the iso_code field so that it contains alpha3 code instead. If we did not performed this operation, each join we would make in the future would need an extra join to get the alpha3 code of an attack through the Country table. Now, we can only have a single join using directly the alpha3 code of the indicators and joining on the iso_code of the attacks table.

7. To conclude, we saved the cleaned attacks dataset and added it to the SQLITE database so we can easily use SQL to query our data.


In [2]:
import json
from urllib.request import urlopen
from urllib.parse import quote_plus
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook
from multiprocessing import Pool, cpu_count
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
import sqlite3

import warnings
warnings.filterwarnings('ignore')

In [24]:
# This function take the name of a place on Earth (countries in our case) and make a request            
# to Google Mas API in order to "normalize" the location and get the country code associated.           
# This country code will help to join the terrorist attacks with the country indicators.                
def getplace(place):                                                                                    
    # Google Maps API base URL                                                                          
    url = "https://maps.googleapis.com/maps/api/geocode/json?"                                          
    # Add the parameters to the base URL                                                                
    url += "address=%s&sensor=false&key=AIzaSyAk90LdWroCWnkWwOVEB_89kAzz1uPCwo0" % (quote_plus(place))  
    v = urlopen(url).read()
    j = json.loads(v)                                                                                   
    try:                                                                                                
        components = j['results'][0]['address_components']                                              
        long = short = None                                                                             
        for c in components:                                                                            
            if "country" in c['types']:                                                                 
                long = c['long_name']                                                                   
                short = c['short_name']                                                                 
        return long, short                                                                              
    except:                                                                                             
        # print('-------------', place)                                                                   
        return None, None 
    
    
# Function that creates that normalize location in order to build a mapping which goes from             
# location to country codes.                                                                             
def mapping(n):                                                                                         
    return n, getplace(n)[1]

In [ ]:
# Data folder path
data_path = '../data/'
# Get the number of cores available for parallelization
n_cores = cpu_count()

attacks = pd.read_csv(data_path + './globalterrorismdb_0617dist.csv', encoding="ISO-8859-1")
countries = pd.read_csv(data_path + './Country.csv')
# Treat Congo names ambiguity
attacks.loc[attacks.country_txt.str.contains('People\'s Republic of the Congo'),
        'country_txt'] = 'Republic of the Congo'
# Get all unique names
all_names = attacks.country_txt.unique()
# Create a mapping from country names to alpha2 country codes
pool = Pool(n_cores)
name_to_code = tqdm_notebook(pool.imap_unordered(mapping, all_names),
        total=all_names.size, desc='API calls')
name_to_code = {k: v for k,v in name_to_code}
# Add field with isocode
attacks['iso_code'] = attacks.country_txt.apply(lambda x: name_to_code[x])
# Treat special country with no match for iso_code
attacks.loc[attacks.country_txt == 'Ireland','iso_code'] = 'IE'
attacks.loc[attacks.country_txt == 'Namibia','iso_code'] = 'NA'
# Delete attacks where no country indicators are available
countries.loc[countries.CountryCode=='NAM', 'Alpha2Code'] = 'NA'
countries.loc[countries.CountryCode=='KSV', 'Alpha2Code'] = 'XK'
attacks = attacks[attacks.iso_code.isin(countries.Alpha2Code)]
# Build a mapping from alpha2 country codes to alpha3 country codes 
alpha2_to_alpha3 = countries[['Alpha2Code', 'CountryCode']]
alpha2_to_alpha3 = dict(alpha2_to_alpha3.apply(lambda x: (x.Alpha2Code, x.CountryCode),
    axis=1).values)
# Transform the iso_code field with the precedingly created mapping to get alpha3 code
attacks.iso_code = attacks.iso_code.apply(lambda iso: alpha2_to_alpha3[iso])
# Save dataframe
attacks.to_csv(data_path + 'attacks_cleaned.csv')






Data exploration

The first part of the database concerns the attacks. Let's see what it looks like.


In [4]:
connection = sqlite3.connect("../api/data/database.sqlite")
attacks = pd.read_sql_query("SELECT * FROM Attacks", connection)
attacks.head()


Out[4]:
eventid iyear imonth iday approxdate extended resolution country country_txt ... scite1 scite2 scite3 dbsource INT_LOG INT_IDEO INT_MISC INT_ANY related iso_code
0 0 197000000001 1970 7 2 0 58 Dominican Republic ... PGIS 0 0 0 0 DOM
1 1 197000000002 1970 0 0 0 130 Mexico ... PGIS 0 1 1 1 MEX
2 2 197001000001 1970 1 0 0 160 Philippines ... PGIS -9 -9 1 1 PHL
3 3 197001000002 1970 1 0 0 78 Greece ... PGIS -9 -9 1 1 GRC
4 4 197001000003 1970 1 0 0 101 Japan ... PGIS -9 -9 1 1 JPN

5 rows × 137 columns


In [4]:
# List of all headers from the Attacks table
list(attacks)


Out[4]:
['',
 'eventid',
 'iyear',
 'imonth',
 'iday',
 'approxdate',
 'extended',
 'resolution',
 'country',
 'country_txt',
 'region',
 'region_txt',
 'provstate',
 'city',
 'latitude',
 'longitude',
 'specificity',
 'vicinity',
 'location',
 'summary',
 'crit1',
 'crit2',
 'crit3',
 'doubtterr',
 'alternative',
 'alternative_txt',
 'multiple',
 'success',
 'suicide',
 'attacktype1',
 'attacktype1_txt',
 'attacktype2',
 'attacktype2_txt',
 'attacktype3',
 'attacktype3_txt',
 'targtype1',
 'targtype1_txt',
 'targsubtype1',
 'targsubtype1_txt',
 'corp1',
 'target1',
 'natlty1',
 'natlty1_txt',
 'targtype2',
 'targtype2_txt',
 'targsubtype2',
 'targsubtype2_txt',
 'corp2',
 'target2',
 'natlty2',
 'natlty2_txt',
 'targtype3',
 'targtype3_txt',
 'targsubtype3',
 'targsubtype3_txt',
 'corp3',
 'target3',
 'natlty3',
 'natlty3_txt',
 'gname',
 'gsubname',
 'gname2',
 'gsubname2',
 'gname3',
 'gsubname3',
 'motive',
 'guncertain1',
 'guncertain2',
 'guncertain3',
 'individual',
 'nperps',
 'nperpcap',
 'claimed',
 'claimmode',
 'claimmode_txt',
 'claim2',
 'claimmode2',
 'claimmode2_txt',
 'claim3',
 'claimmode3',
 'claimmode3_txt',
 'compclaim',
 'weaptype1',
 'weaptype1_txt',
 'weapsubtype1',
 'weapsubtype1_txt',
 'weaptype2',
 'weaptype2_txt',
 'weapsubtype2',
 'weapsubtype2_txt',
 'weaptype3',
 'weaptype3_txt',
 'weapsubtype3',
 'weapsubtype3_txt',
 'weaptype4',
 'weaptype4_txt',
 'weapsubtype4',
 'weapsubtype4_txt',
 'weapdetail',
 'nkill',
 'nkillus',
 'nkillter',
 'nwound',
 'nwoundus',
 'nwoundte',
 'property',
 'propextent',
 'propextent_txt',
 'propvalue',
 'propcomment',
 'ishostkid',
 'nhostkid',
 'nhostkidus',
 'nhours',
 'ndays',
 'divert',
 'kidhijcountry',
 'ransom',
 'ransomamt',
 'ransomamtus',
 'ransompaid',
 'ransompaidus',
 'ransomnote',
 'hostkidoutcome',
 'hostkidoutcome_txt',
 'nreleased',
 'addnotes',
 'scite1',
 'scite2',
 'scite3',
 'dbsource',
 'INT_LOG',
 'INT_IDEO',
 'INT_MISC',
 'INT_ANY',
 'related',
 'iso_code']

Analyzing this list along with the documentation of the dataset, we see that there is a lot of information which is a subclass of another similar class or which it depends from. Here are a few examples:

  • targettype1_txt, targetsubtype1_txt
  • targettype2_txt, targetsubtype2_txt
  • attacktype1_txt, attacktype2_txt, attacktype3_txt
  • gname, gsubname

Hence, unless we are interested in specific features of an attack, we decided to ignore those subcategories as they give too much details. Therefore, we chose the features that might give the more relevant information to the lambda user, having in mind that we can't flood him with a lot of information.

Here is the list of the information the user will have access to concerning the attacks:

  • coordinates of the attacks by country
  • num attacks by country and year
  • num victims by country and year
  • top target types by country
  • top weapon types by country
  • top perpetrator types by country

These information will allow to understand the situation of a given country in a clear and simple way (more details about how we present this information will come later).

Let's dive into an example. What kind of results one can expect for USA ?


In [33]:
attack_types_USA = pd.read_sql_query('SELECT attacktype1_txt, num_attacks FROM (SELECT attacktype1_txt, COUNT(attacktype1_txt) num_attacks FROM Attacks WHERE iso_code="USA" GROUP BY attacktype1_txt) ORDER BY num_attacks DESC', connection)
ax = attack_types_USA.set_index('attacktype1_txt').plot(kind='barh', figsize=(10,5), title='Type of attacks distribution \n', legend=False)
ax.set(xlabel='\nNumber of attacks', ylabel='Type of attacks\n')
plt.show()



In [34]:
attack_targets_USA = pd.read_sql_query('SELECT targtype1_txt, num_attacks FROM (SELECT targtype1_txt, COUNT(targtype1_txt) num_attacks FROM Attacks WHERE iso_code="USA" GROUP BY targtype1_txt) ORDER BY num_attacks DESC LIMIT 10', connection)
ax = attack_targets_USA.set_index('targtype1_txt').plot(kind='barh', figsize=(10,5), title='Type of targets distribution \n', legend=False)
ax.set(xlabel='\nNumber of attacks', ylabel='Type of targets\n')
plt.show()



In [35]:
attack_perpetrators_USA = pd.read_sql_query('SELECT gname, num_attacks FROM (SELECT gname, COUNT(gname) num_attacks FROM Attacks WHERE iso_code="USA" GROUP BY gname) ORDER BY num_attacks DESC LIMIT 10', connection)
ax = attack_perpetrators_USA.set_index('gname').plot(kind='barh', figsize=(10,5), title='Perpetrators attack distribution \n', legend=False)
ax.set(xlabel='\nNumber of attacks', ylabel='Perpetrators\n')
plt.show()



In [36]:
num_victims_USA = pd.read_sql_query('SELECT iyear, SUM(nkill) FROM Attacks WHERE iso_code="USA" GROUP BY iyear', connection)
ax = num_victims_USA.set_index('iyear').plot(kind='bar', figsize=(10,5), title='Number of attacks per year \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Number of victims\n')
plt.show()


One can immediatly notice the September 11 attacks.


In [37]:
num_attacks_USA = pd.read_sql_query('SELECT iyear, COUNT(*) FROM Attacks WHERE iso_code="USA" GROUP BY iyear', connection)
ax = num_attacks_USA.set_index('iyear').plot(kind='bar', figsize=(10,5), title='Number of attacks per year \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Number of attacks\n')
plt.show()


We see that these few plots contain already a lot of information that can be very interesting to analyze. The attentive reader will notice that for the three first plots we see the Unknown type appearing each time. Although it mitght seem worhtless at first, we decided to keep it as it testifies the lack of information about the attacks.

As we develop a website to offer the best possible experience for the user/reader, we set up an API to make requests and fetch the data. Each of the requests are SQL queries we are doing at the server database where stands all the information needed.

Example of requests for the API (from ./api/api.py):


In [ ]:
@app.route('/attacks/perpetrators/<string:country>')
def attack_perpetrators_by_country(country):
    """Returns the perpetrators list with the number of attacks corresponding to their attacks in descending order of the given country."""
    cur = get_db().execute('SELECT gname, num_attacks FROM (SELECT gname, COUNT(gname) num_attacks FROM Attacks WHERE iso_code="{}" GROUP BY gname) ORDER BY num_attacks DESC'.format(country))
    attack_perpetrators = cur.fetchall()
    cur.close()
    return jsonify(attack_perpetrators)

Website

We made quite an advancement on the website, especially the data visualization part. First, when the user lands on the page, a world map is displayed with the number of terrorist attacks in each of the country represented by a grayscale level. There is also a help available in the bottom-right corner, as can be seen in the interaction GIF hereafter.

From there, the user can interact with the map, click on countries to zoom in and plot the coordinates of the terrorist attacks. The density of the attacks is shown as a function of the plasma colorscale:

Here is a GIF of the interaction with the map:

When the user is zoomed-in on a country, he can further explore details about the number of attacks/victims for each year, and also details about the most active terrorist groups, most targeted victims and most common attack types for a country:

Here is a static view of the detailed view for a country:

Technologies

For that we used d3.js, JavaScript ES2016 (Yarn) and Wepback. The design is made with http://www.materializecss.com/ for the design of the interfaces/menus/icons and https://gionkunz.github.io/chartist-js/ for the beautiful plots.

We created an API server with Flask to fetch and serve the data from the SQLite database. Everything will be hosted on a website, but at the moment the code is in the ./api/ folder for the API, and ./website/.

A script will be made to make it easy to run the website locally in the future of the project, if the public hosting is not enough.

What's next?

From there, we need to incorporate the correlation part between terrorist attacks and the second main part of the database: the development indicators. Let's have a look at it.


In [38]:
dev_indicators = pd.read_sql_query("SELECT * FROM Indicators", connection)
dev_indicators.head()


Out[38]:
CountryName CountryCode IndicatorName IndicatorCode Year Value
0 Arab World ARB Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 1960 1.335609e+02
1 Arab World ARB Age dependency ratio (% of working-age populat... SP.POP.DPND 1960 8.779760e+01
2 Arab World ARB Age dependency ratio, old (% of working-age po... SP.POP.DPND.OL 1960 6.634579e+00
3 Arab World ARB Age dependency ratio, young (% of working-age ... SP.POP.DPND.YG 1960 8.102333e+01
4 Arab World ARB Arms exports (SIPRI trend indicator values) MS.MIL.XPRT.KD 1960 3.000000e+06

How many entries are there ?


In [57]:
print('The dataframe contains {} rows.'.format(dev_indicators.shape[0]))


The dataframe contains 5656458 rows.

Again, we have a lot of indicators and we need to choose the most relevant ones. The first criterion is the number of years for which the indicators are available, and the second is the number of countries for which the indicators are also available. Then, the information will be sufficiently and equally represented only if the two metrics just defined are high.


In [48]:
dev_indicators_summarize = pd.read_sql_query("SELECT IndicatorCode, IndicatorName, COUNT(DISTINCT Year) as numYears, COUNT(DISTINCT CountryName) as numCountries  FROM Indicators GROUP BY IndicatorCode", connection)
dev_indicators_summarize[:15]


Out[48]:
IndicatorCode IndicatorName numYears numCountries
0 AG.AGR.TRAC.NO Agricultural machinery, tractors 49 219
1 AG.CON.FERT.PT.ZS Fertilizer consumption (% of fertilizer produc... 12 118
2 AG.CON.FERT.ZS Fertilizer consumption (kilograms per hectare ... 12 188
3 AG.LND.AGRI.K2 Agricultural land (sq. km) 53 242
4 AG.LND.AGRI.ZS Agricultural land (% of land area) 53 241
5 AG.LND.ARBL.HA Arable land (hectares) 53 207
6 AG.LND.ARBL.HA.PC Arable land (hectares per person) 53 240
7 AG.LND.ARBL.ZS Arable land (% of land area) 53 240
8 AG.LND.CREL.HA Land under cereal production (hectares) 53 213
9 AG.LND.CROP.ZS Permanent cropland (% of land area) 53 228
10 AG.LND.EL5M.ZS Land area where elevation is below 5 meters (%... 2 240
11 AG.LND.FRST.K2 Forest area (sq. km) 23 241
12 AG.LND.FRST.ZS Forest area (% of land area) 24 240
13 AG.LND.IRIG.AG.ZS Agricultural irrigated land (% of total agricu... 13 118
14 AG.LND.PRCP.MM Average precipitation in depth (mm per year) 12 190

Analyzing just the first few entries of the dataframe, we already see which indicators we should avoid. It is hard to draw strong conclusions from an indicator that is available only two years, and the more countries we have to compare the values of the indicators the better. Finally, the indicators we choose have to be pertinent. After several careful rereadings, we decided too consider the following development indicators:


In [63]:
choosen_indicators = ['EG.USE.ELEC.KH.PC', 'EG.FEC.RNEW.ZS', 'EN.ATM.CO2E.KT',
                      'EN.POP.DNST', 'FI.RES.XGLD.CD', 'IS.AIR.PSGR', 'IT.NET.USER.P2',
                    'MS.MIL.MPRT.KD', 'MS.MIL.XPRT.KD', 'NE.EXP.GNFS.CD', 'NE.EXP.GNFS.KD.ZG',
                     'NE.IMP.GNFS.CD', 'NE.IMP.GNFS.KD.ZG', 'NY.GDP.MKTP.CD', 'NY.GDP.PCAP.CD', 
                     'NY.GNP.MKTP.CD', 'NY.GNP.PCAP.CD', 'SH.DYN.MORT', 'SH.MED.BEDS.ZS', 
                     'SP.DYN.CBRT.IN', 'SP.DYN.CDRT.IN', 'SP.POP.0014.TO.ZS', 'SP.POP.1564.TO.ZS',
                     'SP.POP.65UP.TO.ZS', 'SP.POP.GROW', 'SP.POP.TOTL']
dev_indicators_summarize[dev_indicators_summarize['IndicatorCode'].isin(choosen_indicators)]


Out[63]:
IndicatorCode IndicatorName numYears numCountries
360 EG.FEC.RNEW.ZS Renewable energy consumption (% of total final... 23 221
371 EG.USE.ELEC.KH.PC Electric power consumption (kWh per capita) 53 164
377 EN.ATM.CO2E.KT CO2 emissions (kt) 52 232
414 EN.POP.DNST Population density (people per sq. km of land ... 54 246
446 FI.RES.XGLD.CD Total reserves minus gold (current US$) 55 183
610 IS.AIR.PSGR Air transport, passengers carried 45 215
624 IT.NET.USER.P2 Internet users (per 100 people) 43 240
634 MS.MIL.MPRT.KD Arms imports (SIPRI trend indicator values) 55 217
640 MS.MIL.XPRT.KD Arms exports (SIPRI trend indicator values) 55 151
678 NE.EXP.GNFS.CD Exports of goods and services (current US$) 55 226
681 NE.EXP.GNFS.KD.ZG Exports of goods and services (annual % growth) 54 200
701 NE.IMP.GNFS.CD Imports of goods and services (current US$) 55 226
704 NE.IMP.GNFS.KD.ZG Imports of goods and services (annual % growth) 54 201
781 NY.GDP.MKTP.CD GDP at market prices (current US$) 55 239
789 NY.GDP.PCAP.CD GDP per capita (current US$) 55 239
805 NY.GNP.MKTP.CD GNI (current US$) 55 238
812 NY.GNP.PCAP.CD GNI per capita, Atlas method (current US$) 53 235
964 SH.DYN.MORT Mortality rate, under-5 (per 1,000) 56 225
977 SH.MED.BEDS.ZS Hospital beds (per 1,000 people) 53 231
1200 SP.DYN.CBRT.IN Birth rate, crude (per 1,000 people) 54 243
1201 SP.DYN.CDRT.IN Death rate, crude (per 1,000 people) 54 243
1215 SP.POP.0014.TO.ZS Population, ages 0-14 (% of total) 55 228
1216 SP.POP.1564.TO.ZS Population, ages 15-64 (% of total) 55 228
1217 SP.POP.65UP.TO.ZS Population ages 65 and above (% of total) 55 228
1221 SP.POP.GROW Population growth (annual %) 55 247
1224 SP.POP.TOTL Population, total 55 247

We decided to represent the development status of a country based on different categories:

  • Economy (exports/imports of good and services, GDP, GNI, ...)
  • Social health (mortality rate, hospital beds, ...)
  • Population (age classes ratio, total, population growth, ...)
  • Wealth (internet users, air transport passengers, renewable energy consumption, ...)

These were the most relevant and reliable indicators we found. Again, let's see the values of some indicators for USA.


In [69]:
arms_exported_USA = pd.read_sql_query('SELECT Year, Value FROM Indicators WHERE CountryCode="USA" AND IndicatorCode="MS.MIL.MPRT.KD"', connection)
ax = arms_exported_USA.set_index('Year').plot(kind='bar', figsize=(10,5), title='Arms exports \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Exported arms\n')
plt.show()



In [70]:
hospital_beds_USA = pd.read_sql_query('SELECT Year, Value FROM Indicators WHERE CountryCode="USA" AND IndicatorCode="SH.MED.BEDS.ZS"', connection)
ax = hospital_beds_USA.set_index('Year').plot(kind='bar', figsize=(10,5), title='Hospital beds \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Number of hospital beds per 1,000\n')
plt.show()



In [73]:
pop_growth_USA = pd.read_sql_query('SELECT Year, Value FROM Indicators WHERE CountryCode="USA" AND IndicatorCode="SP.POP.GROW"', connection)
ax = pop_growth_USA.set_index('Year').plot(kind='bar', figsize=(10,5), title='Population growth \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Population growth in %\n')
plt.show()



In [67]:
electric_consumption_USA = pd.read_sql_query('SELECT Year, Value FROM Indicators WHERE CountryCode="USA" AND IndicatorCode="EG.USE.ELEC.KH.PC"', connection)
ax = electric_consumption_USA.set_index('Year').plot(kind='bar', figsize=(10,5), title='Electric consumption \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Electric consumption (kWh per capita)\n')
plt.show()


Brainstorming about implementation of the correlation comparison

For the milestone 1, we proposed some ideas to tackle the problem of correlation comparison, and asked for recommendations from the TAs.

We will quickly recall the propositions:

  1. Take a neighboring country (geographic wise) which did not get attacked, and compare it with the targeted country.
    • TA: Is sensible to outliers.
  2. Pick multiple neighboring countries which were not attacked, and do a sort of averaging to compare them with the targeted country.
    • TA: What if a country has only attacked neighbor?
  3. Try to find a country who has a high correlation with the targeted country for one or multiple indicators, during the 5 years prior to the attack. From there, compare the same indicator(s), the year of the attack and for 5 years afterwards.
    • TA: It's hard to predict what the correlation will give and how do you choose the indicators, bias?
  4. Using a mathematical-based technique such as Pearson correlation coefficient.
    • TA: Seems better indeed.

From these advices and remarks, we will try to use some mathematical-based technique (4), looking for correlation prior to 5 years (or maybe 3 years if "this is asking too much").

Pearson correlation coefficient

We might change this pearson correlation coefficient for something more robust or subject to non-linear relationships:

Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero. From Wikipedia.

Implementation of the correlation

The delicate part is to see if a link can exist between a pair of indicators chosen from each one of these two databases. However, this way of proceeding has a severe weak point: comparing only two indicators is equivalent to reducing a causal link to an unique variable while it includes several. To overcome one part of this issue, we decided to aggregate a subset of the GTD indicators in order to use the Global Terrorism Index (GTI) scoring system. This score is used in the well-known report of the same name published annually by Institute for Economics and Peace (IEP) to account for the relative impact of terrorism for a given year. The score, that was determined by GPI expert panel, is computed using four factors and different weight associated with. A description of the formula can be found here.

Let's look at the raw score of USA as an example.


In [10]:
raw_score_switzerland = pd.read_sql_query('SELECT iyear, (1*COUNT(*) + 3*SUM(nkill) + 0.5*SUM(nwound) + 2*SUM(case propextent when 1.0 then 1 else 0 end) + 2*SUM(case propextent when 2.0 then 1 else 0 end) + 2*SUM(case propextent when 3.0 then 1 else 0 end) + 2*SUM(case propextent when 4.0 then 1 else 0 end)) FROM Attacks WHERE iso_code="USA" GROUP BY iyear', connection)
ax = raw_score_switzerland.set_index('iyear').plot(kind='bar', figsize=(10,5), title='GTI \n', legend=False)
ax.set(xlabel='\nYear', ylabel='Raw score \n')
plt.show()


Of course a similar index for the development indicators would be very desirable. Unfortunately, due to the very different nature of the indicators we selected, it was difficult from our position to create and define an index that can encapsulate all these aspects with justifiable reasons. Hence, as a precaution, we decided to left the development indicators as is.

This doesn’t mean that we couldn’t draw conclusions from the data we have. Inspecting some countries we, indeed, found some interesting results were we think terrorism is directly but not only correlated. Let's go through an example with Switzerland.

As a recall, on the first graph we have the GTI score in yellow, the number of victims in red and the number of attacks in purple. On the second graph, we have the number of arrivals for the tourism as indicated.

Observing this plot, we can see that Switzerland suffered drastically from a particular terrorist attack in the development of its tourism. Indeed, the attack of the canton’s parliament of Zug in the 27th September of 2001, where 15 people died, shaked the whole country and forced him to take stronger security measures. Other similar analysis are available in our website as examples.

Hence, we see that even with a limited amount of indicators and an uncomplicated analysis, we are able to extract precious information from the context in which a particular country expands.

Again, we would like to emphasize that those analysis should not be considered as an absolute truth and rather give insight about how things can be related. We tried, in this project, to explain things through the terrorism view but many other perspective can be developped.

To explore our project with a lot more interaction and read our data story, visit our website!

Plan for the presentation

  1. Data acquisition: explain where the data comes from, what it contains and how we merged the two datasets.
  2. Data exploration and analysis: explain how we made the selection among all the indicators, how and why we aggregate them to give a better insight.
  3. Visualization: explain how we used visualization to show our data and results.