This is an open challenge to apply what you have learnt analysing Pride and Prejudice with spaCy on a dataset of real events. We have preprocessed the RAND Terrorism Dataset for this task reducing the data to 10033 articles from 1968 to 2009.
Can you find out the following using the code you have written?
There are no right answers to any of these questions, and there might not even be an answer at all.
In [3]:
# To get you started we can import Pandas and Seaborn which might help you
# build a graph or visualisation of the data
% matplotlib inline
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import seaborn as sns
import spacy
nlp = spacy.load('en')
def read_file_to_list(file_name):
with open(file_name, 'r', encoding='utf8') as file:
return file.readlines()
In [4]:
# The file has been re-encoded in UTF-8, the source encoding is Latin-1
terrorism_articles = read_file_to_list('data/rand-terrorism-dataset.txt')
In [5]:
# Create a list of spaCy Doc objects representing articles
terrorism_articles_nlp = [nlp(art) for art in terrorism_articles]
These are commonly mentioned in the full dataset, which you can prove for yourself using the same approaches as in the previous tutorial for Pride and Prejudice. You can process the full dataset at once into one spaCy span using the read_file() function.
In [6]:
common_terrorist_groups = [
'taliban',
'al - qaeda',
'hamas',
'fatah',
'plo',
'bilad al - rafidayn'
]
common_locations = [
'iraq',
'baghdad',
'kirkuk',
'mosul',
'afghanistan',
'kabul',
'basra',
'palestine',
'gaza',
'israel',
'istanbul',
'beirut',
'pakistan'
]
In [7]:
location_entity_dict = defaultdict(Counter)
for article in terrorism_articles_nlp:
#Get all the groups and location entity in the article
article_terrorist_cands = [ent.lemma_ for ent in article.ents if ent.label_ == 'PERSON' or ent.label_ == 'ORG']
article_location_cands = [ent.lemma_ for ent in article.ents if ent.label_ == 'GPE']
#Filter groups and locations for only those which we are interested in
terrorist_candidates = [ent for ent in article_terrorist_cands if ent in common_terrorist_groups]
location_candidates = [loc for loc in article_location_cands if loc in common_locations]
for found_entity in terrorist_candidates:
for found_location in location_candidates:
location_entity_dict[found_entity][found_location] += 1
# Let's inspect a specific combination as a cursory check on the for loop operating correctly
location_entity_dict['plo']['beirut']
Out[7]:
In [8]:
# Transform the dictionary into a pandas DataFrame and fill NaN values with zeroes
location_entity_df = pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
location_entity_full_df = location_entity_df.fillna(value=0).astype(int)
# Show DF to console
location_entity_full_df
Out[8]:
In [9]:
# Seaborn can transform a DataFrame directly into a figure
plt.figure()
hmap = sns.heatmap(location_entity_full_df, annot=True, fmt='d', cmap='YlGnBu', cbar=False)
# Add features using the under the hood plt interface
plt.title('Global Incidents by Terrorist group')
plt.xticks(rotation=30)
plt.show()
In [10]:
# You can also mask all the zero figures using features of the DataFrame
heat_mask = location_entity_df.isnull()
hmap = sns.heatmap(location_entity_full_df, annot=True, fmt='d', cmap='YlGnBu', cbar=False, mask=heat_mask)
# Add features using the under the hood plt interface
sns.axes_style('white')
plt.title('Global Incidents by Terrorist group')
plt.xticks(rotation=30)
plt.show()
In [ ]: