In 2010 WikiLeaks released a file that gave access to previously classified military documents pertaining to the Afghanistan war. Especially news organizations where interested but faced a big challenge: How, if possible, could they visually depict some of the information for their readers? 1
The analyzed dataset contains every recorded incident between 2004 and 2009 during the war in Afghanistan. This so called war diary is contained in a 74 Mbyte CSV file. It consists of 76911 rows and 34 columns. The most interesting columns contain dates, locations, reports about incidents, category information, parties involved and counts of the killed and wounded. Additionally some open data from the elections in Afghanistan 2009 and 2010 helped us to get some numbers of violent incidents for specific voting districts to see where the most dangerous areas for civilians during the elections are.
The true faces of war are often hidden from ordinary people in industrial countries. People in the war troubled middle eastern countries are suffering everyday but thanks to brave and investigative journalists the truths are revealed. Our motivation is to contribute by taking a closer look into the death toll and summaries thus gaining more insights into the war. With the help of machine learning methods we are going to explore how ISAF forces and the civilian population in Afghanistan can benefit from our analysis. Different visualization methods have been chosen in order to set the mood of this tragic war.
In [1]:
# data structures
import pickle
import cPickle as cp
import json
from pprint import pprint
# Pandas contains useful functions for data structures with "relational" or "labeled" data
import pandas
# math and arrays
import math
import numpy as np
from __future__ import division
# plotting
import matplotlib.pyplot as plt
import geoplotlib
from geoplotlib.utils import BoundingBox
# word processing
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import operator
import itertools
# remove words that appear only once
from collections import defaultdict, Counter
# Visualizations
# Wordclouds
from wordcloud import WordCloud
# LDA
from gensim import corpora, models, similarities
# glossary building
import lxml.html
import bs4, lxml, re, requests
# ML
from sklearn import cluster
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
# header as suggested
# by WikiLeaks: https://wikileaks.org/afg/
# by the Guardian: http://www.theguardian.com/world/datablog/2010/jul/25/wikileaks-afghanistan-data
header = [
'ReportKey', # find messages and also to reference them
'DateOccurred', 'EventType',
'Category', # describes what kind of event the message is about
'TrackingNumber', 'Title', # internal tracking number and title
'Summary', # actual description of the event
'Region', # broader region of the event.
'AttackOn', # who was attacked during an event
'ComplexAttack', # signifies that an attack was a larger operation that required more planning, coordination and preparatio
'ReportingUnit', 'UnitName', 'TypeOfUnit', # information on the military unit that authored the report
'FriendlyWounded', 'FriendlyKilled', 'HostNationWounded', 'HostNationKilled', 'CivilianWounded', 'CivilianKilled',
'EnemyWounded', 'EnemyKilled', 'EnemyDetained', # who was killed/wounded/captured
'MilitaryGridReferenceSystem', 'Latitude', 'Longitude', # location
'OriginatorGroup', 'UpdatedByGroup', # message originated from or was updated by
'CommandersCriticalInformationRequirements',
'Significant', # are analyzed and evaluated by special group in command centre
'Affiliation', # event was of friendly, neutral or enemy nature
'DisplayColor', # enemy activity - RED, friendly activity - BLUE, friend on friend - GREEN
'ClassificationLevel' # classification level of the message, e.g.: Secret
]
data = pandas.read_csv('../data/csv/afg.csv', header=None, names=header)
# lower case some columns see problems: https://wardiaries.wikileaks.org/search/?sort=date
data['Category'] = data['Category'].str.lower()
data['Title'] = data['Title'].str.lower()
data.head()
Out[1]:
The Pandas library for python that was used gives some data structures like DataFrame's that are very useful to manipulate and extract the "labeled" data. It brings lots of useful methods for data cleaning and preprocessing.
In the following cell's a small exploratory analysis of the data is provided. Number of rows, columns, the years in which the incidents took place, the number of categories and the most and least commonly occurring categories are extracted and displayed. Lowercasing all columns that we wanted to work with was necessary to unify the data.
In [2]:
data['DateOccurred'] = pandas.to_datetime(data['DateOccurred'])
data['Year'] = [date.year for date in data['DateOccurred']]
data['Hour'] = [date.hour for date in data['DateOccurred']]
# number of rows/columns
print "Number of rows: %d" % data.shape[0]
print "Number of columns: %d" % data.shape[1]
date_range = set()
for date in data['DateOccurred']:
date_range.add(date.year)
print "\nYears:\n"
print list(date_range)
# ocurrences of categories
print "\nNumber of unique categories: %d" %len(set(data['Category']))
# distribution of categoriesn_occurrences[0:20]
n_occurrences = data['Category'].value_counts()
print "\nMost commonly occurring categories:\n"
print n_occurrences.head()
print "\nMost commonly occurring category is %s with %d" % (n_occurrences.argmax(), n_occurrences.max())
print "\nLeast commonly occurring category is %s with %d" % (n_occurrences.argmin(), n_occurrences.min())
In [3]:
# plot distribution of categories (TOP 50)
plt.style.use('ggplot')
%matplotlib inline
n_occurrences_top = n_occurrences[0:50]
# plot histogram
def barplot(series, title, figsize, ylabel, flag, rotation):
ax = series.plot(kind='bar',
title = title,
figsize = figsize,
fontsize = 13)
# set ylabel
ax.set_ylabel(ylabel)
# set xlabel (depending on the flag that comes as a function parameter)
ax.get_xaxis().set_visible(flag)
# set series index as xlabels and rotate them
ax.set_xticklabels(series.index, rotation= rotation)
barplot(n_occurrences_top,'Category occurrences', figsize=(14,6), ylabel = 'category count',flag = True, rotation = 90)
In [4]:
focus_categories = n_occurrences.index[0:8]
print focus_categories
In [5]:
def yearly_category_distribution(data, focus_categories):
index = 1
for category in focus_categories:
# filter table by type of category
db = data[data['Category'] == category]
# get year counts of that category
year_counts = db['Year'].value_counts()
# sort it (from 2004 to 2009)
year_counts = year_counts.sort_index()
# plot it
plt.subplot(7,2,index)
barplot(year_counts, category, figsize=(20,35), ylabel = 'category count', flag = True, rotation = 0)
index += 1
yearly_category_distribution(data, focus_categories)
All the military language, abbreviations and acronyms in the summaries need to be replaced by their meaning in order to have extended and more understandable descriptions. Luckily the Guardian publicly released their list of acronyms online as a glossary that can be used. The file can be downloaded and information extracted using the HTML parsing library BeautifulSoup.
The additional column ExtendedSummary is appended to the data with the built glossary. Furthermore, we normalize attributes to avoid potential bias or remove noise when we do an in depth analysis on the summaries for several visualizations. Approaches used are:
The approches use the natural language toolkit nltk.
In [6]:
# read url with the glossary of the acronyms
link = 'http://www.theguardian.com/world/datablog/2010/jul/25/wikileaks-afghanistan-war-logs-glossary'
response = requests.get(link)
try:
if not response.ok:
print 'HTTP error {} trying to fetch Guradian glossary: {}'.format(response.status_code, link)
else:
glossary = dict()
soup = bs4.BeautifulSoup(response.content, 'lxml')
glossary_table = soup.find('table')
for row in glossary_table.find_all('tr'):
cells = row.find_all("td")
if len(cells) == 2:
if cells[0].string:
key = str(cells[0].string.strip().lower())
content = str(cells[1].text.lower())
glossary[key] = content
except requests.exceptions.ConnectionError as e:
'Connection error {} on {}'.format(e, link)
In [7]:
# print 10 first keys and values of the dictionary
sorted(glossary.items(), key=operator.itemgetter(1))[:10]
Out[7]:
In [8]:
# replace each word in the acronyms description by their meaning. Therefore getting extended description.
def matchwords(words):
text = list()
for word in words:
if word in glossary:
text.append(word.replace(word, glossary[word]))
else:
text.append(word)
return ' '.join(text)
# lowercase summaries
data['Summary'] = data['Summary'].str.lower()
# remove rows with NaNs in summary
data = data[data['Summary'].notnull()]
In [9]:
tokenizer = RegexpTokenizer(r'\w+')
# get indices of the filtered dataframe
indices = [index for index in data.index]
long_summary = []
for row in indices:
# tokenize words in the summary
words = tokenizer.tokenize(data['Summary'][row])
# match acronyms to meanings and generate an extended summary
long_summary.append(matchwords(words))
# create column with extended summary
data.insert(7, 'ExtendedSummary', long_summary)
data.head()
Out[9]:
The number of people wounded and killed in each group are counted and plotted after in a grouped bar chart. The two criteria wounded and killed can be compared by toggling between two buttons. This visualization was chosen in order to give the reader an overview of the involved groups:
In [10]:
# create new dataframe with the columns of interest
db = pandas.concat([data['Year'],data['HostNationWounded'], data['HostNationKilled'],
data['EnemyWounded'], data['EnemyKilled'],
data['CivilianWounded'], data['CivilianKilled'],
data['FriendlyWounded'], data['FriendlyKilled']]
, axis=1)
db.head()
Out[10]:
In [11]:
casualties = {}
for date in date_range:
# filter dataframe by year
db_year = db[db['Year'] == date]
casualties[str(date)]= {}
for column in db.columns:
if column != 'Year':
# sum every column except 'year' and store it in a dictionary with 'years' as keys
casualties[str(date)][column] = db_year[column].sum()
In [12]:
for key in sorted(casualties.iterkeys()):
print "%s: %s" % (key, casualties[key])
In [13]:
# extract wounded data and rename columns
Wounded = {}
for date in date_range:
Wounded[str(date)]= {}
Wounded[str(date)]['Afghan forces'] = casualties[str(date)]['HostNationWounded']
Wounded[str(date)]['Taliban'] = casualties[str(date)]['EnemyWounded']
Wounded[str(date)]['Civilians'] = casualties[str(date)]['CivilianWounded']
Wounded[str(date)]['ISAF/NATO forces'] = casualties[str(date)]['FriendlyWounded']
In [14]:
for key in sorted(Wounded.iterkeys()):
print "%s: %s" % (key, Wounded[key])
In [15]:
# writing wounded data to a dataframe for visualizing it with D3
data_wounded = pandas.DataFrame()
data_wounded['Year'] = [key for key in sorted(Wounded.iterkeys())]
data_wounded['Afghan forces'] = [Wounded[key]['Afghan forces'] for key in sorted(Wounded.iterkeys())]
data_wounded['ISAF/NATO forces'] = [Wounded[key]['ISAF/NATO forces'] for key in sorted(Wounded.iterkeys())]
data_wounded['Taliban'] = [Wounded[key]['Taliban'] for key in sorted(Wounded.iterkeys())]
data_wounded['Civilians'] = [Wounded[key]['Civilians'] for key in sorted(Wounded.iterkeys())]
data_wounded
Out[15]:
In [16]:
# save wounded data to csv
data_wounded.to_csv('../data/nb/wounded.csv', sep=',', index=False)
In [17]:
# multi bar plot of killed people
plt.style.use('ggplot')
%matplotlib inline
N = 6
ind = np.arange(N) # the x locations for the groups
width = 0.2 # the width of the bars
Afghan_forces = list(data_wounded['Afghan forces'])
Nato_forces = list(data_wounded['ISAF/NATO forces'])
Taliban = list(data_wounded['Taliban'])
Civilians = list(data_wounded['Civilians'])
fig, ax = plt.subplots(figsize=(20, 10))
rects1 = ax.bar(ind, Afghan_forces, width, color='r')
rects2 = ax.bar(ind + width, Nato_forces, width, color='b')
rects3 = ax.bar(ind + width*2, Taliban, width, color='g')
rects4 = ax.bar(ind + width*3, Civilians, width, color='y')
# add some text for labels, title and axes ticks
ax.set_ylabel('Counts')
ax.set_title('Wounded counts recorded by the Wikileaks Afghanistan Database')
ax.set_xticks(ind + width)
ax.set_xticklabels(('2004', '2005', '2006', '2007', '2008', '2009'))
ax.legend((rects1[0], rects2[0], rects3[0], rects4[0]), ('Afghan forces', 'ISAF/NATO forces', 'Taliban', 'Civilians'))
def autolabel(rects):
# attach some text labels
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
'%d' % int(height),
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)
plt.show()
In [18]:
# Extract "killed" data and rename columns
Killed = {}
for date in date_range:
Killed[str(date)]= {}
Killed[str(date)]['Afghan forces'] = casualties[str(date)]['HostNationKilled']
Killed[str(date)]['Taliban'] = casualties[str(date)]['EnemyKilled']
Killed[str(date)]['Civilians'] = casualties[str(date)]['CivilianKilled']
Killed[str(date)]['ISAF/NATO forces'] = casualties[str(date)]['FriendlyKilled']
In [19]:
for key in sorted(Killed.iterkeys()):
print "%s: %s" % (key, Killed[key])
In [20]:
# writing "killed" data to a dataframe for visualizing it with D3
data_killed = pandas.DataFrame()
data_killed['Year'] = [key for key in sorted(Killed.iterkeys())]
data_killed['Afghan forces'] = [Killed[key]['Afghan forces'] for key in sorted(Killed.iterkeys())]
data_killed['ISAF/NATO forces'] = [Killed[key]['ISAF/NATO forces'] for key in sorted(Killed.iterkeys())]
data_killed['Taliban'] = [Killed[key]['Taliban'] for key in sorted(Killed.iterkeys())]
data_killed['Civilians'] = [Killed[key]['Civilians'] for key in sorted(Killed.iterkeys())]
data_killed
Out[20]:
In [21]:
# save "killed" data to csv
data_killed.to_csv('../data/nb/killed.csv', sep=',', index=False)
In [22]:
Afghan_forces = list(data_killed['Afghan forces'])
Nato_forces = list(data_killed['ISAF/NATO forces'])
Taliban = list(data_killed['Taliban'])
Civilians = list(data_killed['Civilians'])
fig, ax = plt.subplots(figsize=(20, 10))
rects1 = ax.bar(ind, Afghan_forces, width, color='r')
rects2 = ax.bar(ind + width, Nato_forces, width, color='b')
rects3 = ax.bar(ind + width*2, Taliban, width, color='g')
rects4 = ax.bar(ind + width*3, Civilians, width, color='y')
# add some text for labels, title and axes ticks
ax.set_ylabel('Counts')
ax.set_title('Killed counts recorded by the Wikileaks Afghanistan Database')
ax.set_xticks(ind + width)
ax.set_xticklabels(('2004', '2005', '2006', '2007', '2008', '2009'))
ax.legend((rects1[0], rects2[0], rects3[0], rects4[0]), ('Afghan forces', 'ISAF/NATO forces', 'Taliban', 'Civilians'))
def autolabel(rects):
# attach some text labels
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
'%d' % int(height),
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)
plt.show()
(since bar charts did not count as d3 visualisation, in same group)
This map builds on top of the bar chart making clear that it was a war that impacted the local people a lot. Even with "just" all incidents that killed civilians the reader is still quite overwhelmed at first. It was necessary to make the map zoomable and interactive so that the reader can get a more distinct picture. Having the map in the introduction section was also a good way of showing how the regions are distributed between the ISAF forces or what regions even exist in Afghanistan. The reader can quickly identify that the size of the circles increases with a higher number of killed people and can also read more about why so many people died, getting sort of a feel for the war and the stories behind it.
In [23]:
# create new dataframe with the columns of interest
db_loc = pandas.concat([data['Latitude'], data['Longitude'], data['CivilianKilled'], data['ExtendedSummary']], axis=1)
# delete no casualties
db_loc = db_loc.drop(db_loc[db_loc['CivilianKilled'] == 0].index)
# exclude not set column
db_loc = db_loc[db_loc.CivilianKilled.notnull() & db_loc.Longitude.notnull() & db_loc.Latitude.notnull()]
db_loc.head()
Out[23]:
In [24]:
# write df as JSON, bring into appropriate format
obj = {'objects': list()}
for index, row in db_loc.iterrows():
obj['objects'].append({'circle': {
'coordinates': [row.Latitude, row.Longitude],
'death': int(row.CivilianKilled),
'desc': row.ExtendedSummary
}})
with open('../data/nb/civ_deaths.json', 'w') as data_file:
json.dump(obj, data_file)
pprint(obj['objects'][:5])
In this section a simple weighting scheme called tf-idf (term frequency–inverse document frequency) is used to find important words within each category. Once found, they are visualized in a word cloud. Since there are 153 categories and computing a wordcloud for each category is not feasible, just some of the categories are picked to compute wordclouds. Also data cleaning is a crucial part to have an acceptable result. The methods are explained in the introduction and respective code cells.
It seemed feasible to go a bit deeper into specific topics and also, instead of summarizing data, gain something out of it with further analysis. Word or tag clouds are in general a good way of presenting a lot of text in a weighted list of important words. Someone who wants to inform himself about more specific subjects can use these words when looking for more insigths. Also while our group was doing the analysis we could further investigate certain subjects that we were unaware about before.
In [25]:
# indices of categories that wordclouds are computed of
#for i, x in enumerate(n_occurrences.index):
# print i, x
indices = [35,50,58,63,66,68,74,93,95]
# get categories names
categ = [n_occurrences.index[idx] for idx in indices]
# filter the dataset. Only keep those rows with the categories selected
db_wc_filt = data[data['Category'].isin(categ)]
In [26]:
db_wc_filt.head()
Out[26]:
For computing the tf-idf weights for each document in the corpus, it is required in the corpus a series of steps:
In [27]:
# uncomment this to download stopwords locally
#nltk.dowload()
We start by building the corpus for each document.
In [28]:
branch_corpus = {}
for cat in categ:
d = db_wc_filt[db_wc_filt['Category'] == cat]
#for each category, join all text from the extended summary and save it in a dictionary
branch_corpus[cat] = ' '.join(d['ExtendedSummary'])
In [29]:
# load english stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))
# create a WordNetLemmatizer for stemming the tokens
wordnet_lemmatizer = WordNetLemmatizer()
# compute number of times a word appears in a document
def freq(word, doc):
return doc.count(word)
# get number of words in the document
def word_count(doc):
return len(doc)
# compute TF
def tf(word, doc):
return (freq(word, doc) / float(word_count(doc)))
# compute number of documents containing a particular word
def num_docs_containing(word, list_of_docs):
count = 0
for document in list_of_docs:
if freq(word, document) > 0:
count += 1
return 1 + count
# compute IDF
def idf(word, list_of_docs):
return math.log(len(list_of_docs) /
float(num_docs_containing(word, list_of_docs)))
# compute TF-IDF
def tf_idf(word, doc, list_of_docs):
return (tf(word, doc) * idf(word, list_of_docs))
vocabulary = list()
docs = dict()
for key,text in branch_corpus.iteritems():
# tokenize text for a particular category
tokens = nltk.word_tokenize(text)
# lower case the words
tokens = [token.lower() for token in tokens]
# only keep words, disregard digits
tokens = [token for token in tokens if token.isalpha()]
# disregard stopwords and words that are less than 2 letter in length
tokens = [token for token in tokens if token not in stopwords and len(token) > 2]
final_tokens = []
final_tokens.extend(tokens)
docs[key] = {'tf': {}, 'idf': {},
'tf-idf': {}, 'tokens': []}
# compute TF
for token in final_tokens:
# The term-frequency (Normalized Frequency)
docs[key]['tf'][token] = tf(token, final_tokens)
docs[key]['tokens'] = final_tokens
vocabulary.append(final_tokens)
# compute IDF and TF-IDF
for doc in docs:
for token in docs[doc]['tf']:
# the Inverse-Document-Frequency
docs[doc]['idf'][token] = idf(token, vocabulary)
# the tf-idf
docs[doc]['tf-idf'][token] = tf_idf(token, docs[doc]['tokens'], vocabulary)
In [30]:
# get the 10 words with highest TF-IDF score for category 'Hard Landing'
sorted_dic = sorted(docs['arrest']['tf-idf'].items(), key=operator.itemgetter(1), reverse=True)
sorted_dic[0:10]
Out[30]:
In order to create the wordclouds we need to round the tf-idf to the nearest integer value. Then all words are combined together in one long string separated by spaces repeating each word to its rounded tf-idf score.
Note: We not only round but we scale them by a factor of 100 since we get very low tf-idf values.
In [31]:
for cat in categ:
for tup in docs[cat]['tf-idf'].items():
#scale each tf-idf value by a factor of 1000 and round it
docs[cat]['tf-idf'][tup[0]] = int(round(tup[1]*1000))
In [32]:
#see how they have been scaled and rounded
sorted_dic = sorted(docs['arrest']['tf-idf'].items(), key=operator.itemgetter(1), reverse=True)
sorted_dic[0:10]
Out[32]:
In [33]:
# generate text for wordclouds
doc_wordcloud = {}
for cat in categ:
string = []
for tup in docs[cat]['tf-idf'].items():
if tup[1] > 0:
#repeat each word to its scaled and rounded tf-idf score
string.extend(np.repeat(tup[0],tup[1]))
doc_wordcloud[cat] = ' '.join(string)
In [34]:
# generate wordclouds
for cat in categ:
print "#### %s ####" % cat
wordcloud = WordCloud().generate(doc_wordcloud[cat])
img=plt.imshow(wordcloud)
plt.axis("off")
plt.show()
In [35]:
# save wordclouds to files so that they are readable in JS for visualizing with D3
def savefile(category):
filename = '../data/nb/' + category
filename = filename.replace(" ", "_")
with open(filename, "w") as text_file:
text_file.write(doc_wordcloud[category])
#print '* [download {0} data](../data/csv/v2/{0})'.format(category)
for cat in categ:
savefile(cat)
In this section Latent Dirichlet Allocation (LDA) is used, a topic model which generates topics based on word frequency from a set of documents.
Each extended description in our data is considered a document. Data cleaning needs to be done on all documents since it is crucial to generate a useful topic model. The methods are similar to the processing on the wordclouds. After the cleaning of the data, the LDA model is generated. In order to do that each terms frequency as it occurs in each document is investigated. To do so a document-term matrix (bag of words) is constructed with the gensim module. Once the bag of words is created the LDA model can be generated.
The LDA is another method of analyzing the text and give some meaningful weights to each word of a subject. Since the weighting scheme is different the extracted result could be used for another visualisation approach giving the reader the impression about importance through circle diameters. It was a good way of showing what typical war topics were mainly about with just a few words.
In [36]:
# will be used for text cleaning
stopwords = set(nltk.corpus.stopwords.words('english')) # load english stopwords
stoplist = set('for a of the and to in'.split())
manual_list=set('none nan get http https'.split())
wordnet_lemmatizer = WordNetLemmatizer() # create a WordNetLemmatizer to user for stemming our categories
In [37]:
# take some interesting categories
documents = [summary for summary in db_wc_filt['ExtendedSummary']]
For cleaning the text:
In [38]:
clean_dict={}
i=0;
for document in documents:
try:
cleaned_html = lxml.html.fromstring(str(document)).text_content()
except:
cleaned_html=""
cleaned_html.encode('ascii', 'ignore')
temp_token_list=nltk.word_tokenize(str(cleaned_html.encode('ascii', 'ignore')))
content = [wordnet_lemmatizer.lemmatize(w.lower()) for w in temp_token_list if
( w.isalpha() and w.lower() not in stopwords and w.lower() not in stoplist \
and w.lower() not in manual_list and len(w)>2)]
# save the cleaned tokens
clean_dict[i]=content
i+=1
In [39]:
# words that appear once are not needed
frequency = defaultdict(int)
for text in clean_dict.values():
for token in text:
frequency[token] += 1
i=0
for text in clean_dict.values():
clean_dict[i] = [token for token in text if frequency[token] > 1]
i+=1
In [40]:
dictionary = corpora.Dictionary(clean_dict.values())
corpus = [dictionary.doc2bow(text) for text in clean_dict.values()]
In [41]:
# Compute LDA. Set number of topics and number of words per topic to 10.
lda = models.ldamodel.LdaModel(corpus, num_topics=7, id2word=dictionary, passes=20)
In [42]:
lda.print_topics(num_topics=7, num_words=20)
Out[42]:
In [43]:
topics = [lda.show_topic(i) for i in range(7)]
In [44]:
# write to json formatted file for D3
LDA_dicts = {'name': 'LDA',
'children': [{
'name': 'Topics',
'children': []
}]};
for index, topic in enumerate(topics):
name = 'Topic {}'.format(index+1)
dic_out={'name': name , 'children':[]}
children = []
for topic in topics[index]:
dic = {'name':[], 'size':[]}
dic['name'] = topic[0]
dic['size'] = topic[1]
children.append(dic)
dic_out['children'] = children
LDA_dicts['children'][0]['children'].append(dic_out)
In [45]:
# dump dict to JSON file
with open('../data/nb/LDA_topics.json', 'w') as fp:
json.dump(LDA_dicts, fp)
In [46]:
LDA_dicts
Out[46]:
For this visualization the areas where there are most security incidents during the election 2009 are plotted. As an overlay grid points of 4 very common war incidents. To make a proper and meaningful prediction we decided to use an unbalanced dataset by having in mind that the accuracy will drop for the smaller classes. On the other hand having a unbalanced set we are closer to the ground truth with the map showing which of the category types are most likely experienced. The ISAF forces can conclude preventing measures in specific regions that are more focused on the density of the incidents. Still the distributions of the top incidents in 2009 are not spread too far:
Since the user was more familiar with the overall data the map contained more information, showing what is the most probable attack that can happen in your region during the election period and where are the hotspots of incidents. It fit into the storyline that even fundamental things like elections are a great problem and far from normal everyday situations in a war troubled country.
In [47]:
data_el = pandas.read_csv('../data/csv/elections_security_incidents.csv')
# clean rows without useful numbers
data_el = data_el.dropna()
data_el.head()
Out[47]:
In [48]:
print "election provinces total: ", len(data_el)
print "total incidents: ", int(data_el['2009 and 2010'].sum())
In [49]:
# incidents year 2009
data09 = data[data['Year'] == 2009]
# incidents injured or killed civilians
data09 = data09.drop(data09[(data09['CivilianWounded'] == 0) & (data09['CivilianKilled'] == 0)].index)
# exclude unset columns
data09 = data09[data09.CivilianKilled.notnull() & data09.CivilianWounded.notnull()]
# Distribution of categories top 4
c_occurrences = data09['Category'].value_counts()
c_occurrences[0:4]
Out[49]:
In [50]:
# drop other categories
for idx, row in data09.iterrows():
if row['Category'] not in c_occurrences.index[0:4]:
data09.drop(idx)
In [51]:
# split data to individual sets
def get_subset(dataset, category):
df = dataset[dataset['Category'] == category]
return df
# subset for categories
df_IED = get_subset(data09, 'ied explosion')
df_DIRECT = get_subset(data09, 'direct fire')
df_FORCE = get_subset(data09, 'escalation of force')
df_INDIRECT = get_subset(data09, 'indirect fire')
In [52]:
# see balance of data
ied = df_IED.shape[0]
direct = df_DIRECT.shape[0]
force = df_FORCE.shape[0]
indirect = df_INDIRECT.shape[0]
total_amount = len(data09)
print "total top 4 incidents: ", total_amount
print "Percentage of ied explosions: {}%".format(ied/(total_amount/100))
print "Percentage of direct fire: {}%".format(direct/(total_amount/100))
print "Percentage of escalation of force: {}%".format(force/(total_amount/100))
print "Percentage of indirect fire: {}%".format(indirect/(total_amount/100))
Some geodata is plotted to see the distribution. First a function to extract the geoinformation for each row from the datasets as shown in below cell is necessary.
In [53]:
def get_all_geodata(dataset):
# filter bad rows
dataset = dataset[dataset.Longitude.notnull() & dataset.Latitude.notnull()]
# only activity in Afghanistan
include = (dataset.Latitude < 39) & (dataset.Latitude > 28) & (dataset.Longitude > 59) & (dataset.Longitude < 76)
# get data in the format geoplotlib requires. We put the geodata in a dictionary structured as follows
geo_data = {
"lat": dataset.loc[include].Latitude.tolist(),
"lon": dataset.loc[include].Longitude.tolist()
}
return geo_data
# create the dictionary with lat and lon
geo_data_IED = get_all_geodata(df_IED)
geo_data_DIRECT = get_all_geodata(df_DIRECT)
geo_data_FORCE = get_all_geodata(df_FORCE)
geo_data_INDIRECT = get_all_geodata(df_INDIRECT)
geo_data_ELECTION = get_all_geodata(data_el)
All latitude and longitude information is passed to the function which computes the kernel density estimation (kde()
) from geoplotlib which is also called a heatmap. The BoundingBox is needed to fit the projection as close as possible to Afghanistan. The approximate boundaries are set by cleaning some outliers from all datapoints (geo_dim
) Maps are plotted with inline()
to make the map visible in the notebook.
In [54]:
# plot given coordinate input
def geo_plot(geodata):
# bounding box on the minima and maxima of the data
geoplotlib.set_bbox(
BoundingBox(
max(geodata['lat']),
max(geodata['lon']),
min(geodata['lat']),
min(geodata['lon'])
));
# kernel density estimation visualization
geoplotlib.kde(geodata, bw=5, cut_below=1e-3, cmap='hot', alpha=170)
# google tiles with lyrs=y ... hybrid
geoplotlib.tiles_provider({
'url': lambda zoom, xtile, ytile: 'https://mt1.google.com/vt/lyrs=y&hl=en&x=%d&y=%d&z=%d' % (xtile, ytile, zoom ),
'tiles_dir': 'DTU-social_data',
'attribution': 'DTU 02806 Social Data Analysis and Visualization'
})
geoplotlib.inline();
In [55]:
print 'ied explosion'
geo_plot(geo_data_IED)
print 'direct fire'
geo_plot(geo_data_FORCE)
print 'escalation of force'
geo_plot(geo_data_INDIRECT)
print 'indirect fire'
geo_plot(geo_data_DIRECT)
print 'election'
geo_plot(geo_data_ELECTION)