RAND Dataset analysis

This is an open challenge to apply what you have learnt analysing Pride and Prejudice with spaCy on a dataset of real events. We have preprocessed the RAND Terrorism Dataset for this task reducing the data to 10033 articles from 1968 to 2009.

Can you find out the following using the code you have written?

  • Who are the terrorist groups and other persons mentioned in each article?
  • What locations are mentioned in each article? Hint: a location just has a different label to a person
  • From all of your entities, can you find out which named entities are terrorists from the syntactic relationships?
  • With all of this information, can you plot a figure expressing the relationships between locations and terrorists?

There are no right answers to any of these questions, and there might not even be an answer at all.

Example result using the full ~40,000 article dataset


In [3]:
# To get you started we can import Pandas and Seaborn which might help you
# build a graph or visualisation of the data
% matplotlib inline

from collections import defaultdict, Counter

import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import seaborn as sns
import spacy

nlp = spacy.load('en')

def read_file_to_list(file_name):
    with open(file_name, 'r', encoding='utf8') as file:
        return file.readlines()

In [4]:
# The file has been re-encoded in UTF-8, the source encoding is Latin-1
terrorism_articles = read_file_to_list('data/rand-terrorism-dataset.txt')

In [5]:
# Create a list of spaCy Doc objects representing articles
terrorism_articles_nlp = [nlp(art) for art in terrorism_articles]

Example solution

Define some geographical areas and groups to inspect

These are commonly mentioned in the full dataset, which you can prove for yourself using the same approaches as in the previous tutorial for Pride and Prejudice. You can process the full dataset at once into one spaCy span using the read_file() function.


In [6]:
common_terrorist_groups = [
    'taliban', 
    'al - qaeda', 
    'hamas',  
    'fatah', 
    'plo', 
    'bilad al - rafidayn'
]

common_locations = [
    'iraq',
    'baghdad', 
    'kirkuk', 
    'mosul', 
    'afghanistan', 
    'kabul',
    'basra', 
    'palestine', 
    'gaza', 
    'israel', 
    'istanbul', 
    'beirut', 
    'pakistan'
]

Inspect each article for mentions of groups and locations


In [7]:
location_entity_dict = defaultdict(Counter)

for article in terrorism_articles_nlp:
    #Get all the groups and location entity in the article
    article_terrorist_cands = [ent.lemma_ for ent in article.ents if ent.label_ == 'PERSON' or ent.label_ == 'ORG']
    article_location_cands = [ent.lemma_ for ent in article.ents if ent.label_ == 'GPE']

    #Filter groups and locations for only those which we are interested in
    terrorist_candidates = [ent for ent in article_terrorist_cands if ent in common_terrorist_groups]
    location_candidates = [loc for loc in article_location_cands if loc in common_locations]

    for found_entity in terrorist_candidates:
        for found_location in location_candidates:
            location_entity_dict[found_entity][found_location] += 1

# Let's inspect a specific combination as a cursory check on the for loop operating correctly
location_entity_dict['plo']['beirut']


Out[7]:
12

Transform defaultdict to a Pandas DataFrame


In [8]:
# Transform the dictionary into a pandas DataFrame and fill NaN values with zeroes
location_entity_df = pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
location_entity_full_df = location_entity_df.fillna(value=0).astype(int)
# Show DF to console
location_entity_full_df


Out[8]:
al - qaeda bilad al - rafidayn fatah hamas plo taliban
afghanistan 6 0 0 0 0 248
baghdad 20 33 0 0 0 0
basra 0 4 0 0 0 0
beirut 0 0 1 1 12 0
gaza 0 0 9 70 0 0
iraq 56 23 1 0 8 0
israel 1 0 17 19 21 0
istanbul 3 0 0 0 0 0
kabul 2 0 0 0 0 48
kirkuk 5 0 0 0 0 0
mosul 14 4 0 0 0 0
pakistan 6 0 0 0 0 17
palestine 3 6 0 0 1 0

In [9]:
# Seaborn can transform a DataFrame directly into a figure

plt.figure()
hmap = sns.heatmap(location_entity_full_df, annot=True, fmt='d', cmap='YlGnBu', cbar=False)

# Add features using the under the hood plt interface
plt.title('Global Incidents by Terrorist group')
plt.xticks(rotation=30)
plt.show()



In [10]:
# You can also mask all the zero figures using features of the DataFrame
heat_mask = location_entity_df.isnull()

hmap = sns.heatmap(location_entity_full_df, annot=True, fmt='d', cmap='YlGnBu', cbar=False, mask=heat_mask)

# Add features using the under the hood plt interface
sns.axes_style('white')
plt.title('Global Incidents by Terrorist group')
plt.xticks(rotation=30)
plt.show()



In [ ]: