Afghanistan War Insights

Project Assignment B

Motivation

In 2010 WikiLeaks released a file that gave access to previously classified military documents pertaining to the Afghanistan war. Especially news organizations where interested but faced a big challenge: How, if possible, could they visually depict some of the information for their readers? 1

The analyzed dataset contains every recorded incident between 2004 and 2009 during the war in Afghanistan. This so called war diary is contained in a 74 Mbyte CSV file. It consists of 76911 rows and 34 columns. The most interesting columns contain dates, locations, reports about incidents, category information, parties involved and counts of the killed and wounded. Additionally some open data from the elections in Afghanistan 2009 and 2010 helped us to get some numbers of violent incidents for specific voting districts to see where the most dangerous areas for civilians during the elections are.

The true faces of war are often hidden from ordinary people in industrial countries. People in the war troubled middle eastern countries are suffering everyday but thanks to brave and investigative journalists the truths are revealed. Our motivation is to contribute by taking a closer look into the death toll and summaries thus gaining more insights into the war. With the help of machine learning methods we are going to explore how ISAF forces and the civilian population in Afghanistan can benefit from our analysis. Different visualization methods have been chosen in order to set the mood of this tragic war.

Loading the Data

Load WikiLeaks Afghan War Diary from 2004-2009.


In [1]:
# data structures
import pickle
import cPickle as cp
import json
from pprint import pprint
# Pandas contains useful functions for data structures with "relational" or "labeled" data
import pandas

# math and arrays
import math
import numpy as np
from __future__ import division

# plotting
import matplotlib.pyplot as plt
import geoplotlib
from geoplotlib.utils import BoundingBox

# word processing
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import operator
import itertools
# remove words that appear only once
from collections import defaultdict, Counter

# Visualizations
# Wordclouds
from wordcloud import WordCloud
# LDA
from gensim import corpora, models, similarities

# glossary building
import lxml.html
import bs4, lxml, re, requests

# ML
from sklearn import cluster
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

# header as suggested
# by WikiLeaks: https://wikileaks.org/afg/
# by the Guardian: http://www.theguardian.com/world/datablog/2010/jul/25/wikileaks-afghanistan-data
header = [
    'ReportKey', # find messages and also to reference them
    'DateOccurred', 'EventType', 
    'Category', # describes what kind of event the message is about
    'TrackingNumber', 'Title', # internal tracking number and title
    'Summary', # actual description of the event
    'Region', # broader region of the event.
    'AttackOn', # who was attacked during an event
    'ComplexAttack', #  signifies that an attack was a larger operation that required more planning, coordination and preparatio
    'ReportingUnit', 'UnitName', 'TypeOfUnit', # information on the military unit that authored the report
    'FriendlyWounded', 'FriendlyKilled', 'HostNationWounded', 'HostNationKilled', 'CivilianWounded', 'CivilianKilled', 
    'EnemyWounded', 'EnemyKilled', 'EnemyDetained', # who was killed/wounded/captured
    'MilitaryGridReferenceSystem', 'Latitude', 'Longitude', # location
    'OriginatorGroup', 'UpdatedByGroup', # message originated from or was updated by
    'CommandersCriticalInformationRequirements', 
    'Significant', # are analyzed and evaluated by special group in command centre
    'Affiliation', # event was of friendly, neutral or enemy nature
    'DisplayColor', # enemy activity - RED, friendly activity - BLUE, friend on friend - GREEN
    'ClassificationLevel' # classification level of the message, e.g.: Secret
]

data = pandas.read_csv('../data/csv/afg.csv', header=None, names=header)
# lower case some columns see problems: https://wardiaries.wikileaks.org/search/?sort=date
data['Category'] = data['Category'].str.lower()
data['Title'] = data['Title'].str.lower()
data.head()


Out[1]:
ReportKey DateOccurred EventType Category TrackingNumber Title Summary Region AttackOn ComplexAttack ... MilitaryGridReferenceSystem Latitude Longitude OriginatorGroup UpdatedByGroup CommandersCriticalInformationRequirements Significant Affiliation DisplayColor ClassificationLevel
0 D92871CA-D217-4124-B8FB-89B9A2CFFCB4 2004-01-01 00:00:00 Enemy Action direct fire 2007-033-004042-0756 direct fire other KAF-1BDE -S3 REPORTS: SUMMIT 09 B CO ELEMENT S... RC EAST ENEMY False ... 42SWB3900916257 32.683319 69.416107 UNKNOWN UNKNOWN NaN NaN ENEMY RED SECRET
1 C592135C-1BFF-4AEC-B469-0A495FDA78D9 2004-01-01 00:00:00 Friendly Action cache found/cleared 2007-033-004738-0185 cache found/cleared other USSF FINDS CACHE IN VILLAGE OF WALU TANGAY: US... RC EAST FRIEND False ... 42SXD7520076792 35.018608 70.920273 UNKNOWN UNKNOWN NaN NaN FRIEND BLUE SECRET
2 D50F59F0-6F32-4E63-BC02-DB2B8422DE6E 2004-01-01 00:00:00 Non-Combat Event propaganda 2007-033-010818-0798 propaganda other (M) NIGHT LETTERS DISTRIBUTED AROUND HAZARJUFT... RC SOUTH NEUTRAL False ... 41RPQ1439743120 31.116390 64.199707 UNKNOWN UNKNOWN NaN NaN NEUTRAL GREEN SECRET
3 E3F22EFB-F0CA-4821-9322-CC2250C05C8A 2004-01-01 00:00:00 Enemy Action direct fire 2007-033-004042-0850 direct fire other KAF-1BDE -S3: SUMMIT 6 REPORTS TIC SALUTE TO F... RC EAST ENEMY False ... 42SWB3399911991 32.645000 69.362511 UNKNOWN UNKNOWN NaN NaN ENEMY RED SECRET
4 4D0E1E60-9535-4D58-A374-74367F058788 2004-01-01 00:00:00 Friendly Action cache found/cleared 2007-033-004738-0279 cache found/cleared other KAF-1BDE -S3 REPORTS: GERONIMO 11 SALUTE AS FO... RC EAST FRIEND False ... 42SWB7580277789 33.236389 69.813606 UNKNOWN UNKNOWN NaN NaN FRIEND BLUE SECRET

5 rows × 32 columns

Basic stats

The Pandas library for python that was used gives some data structures like DataFrame's that are very useful to manipulate and extract the "labeled" data. It brings lots of useful methods for data cleaning and preprocessing.

In the following cell's a small exploratory analysis of the data is provided. Number of rows, columns, the years in which the incidents took place, the number of categories and the most and least commonly occurring categories are extracted and displayed. Lowercasing all columns that we wanted to work with was necessary to unify the data.


In [2]:
data['DateOccurred'] = pandas.to_datetime(data['DateOccurred'])
data['Year'] = [date.year for date in data['DateOccurred']]
data['Hour'] = [date.hour for date in data['DateOccurred']]

# number of rows/columns
print "Number of rows: %d" % data.shape[0]
print "Number of columns: %d" % data.shape[1]

date_range = set()
for date in data['DateOccurred']:
    date_range.add(date.year)

print "\nYears:\n"
print list(date_range)

# ocurrences of categories
print "\nNumber of unique categories: %d" %len(set(data['Category']))

# distribution of categoriesn_occurrences[0:20]
n_occurrences = data['Category'].value_counts()

print "\nMost commonly occurring categories:\n"
print n_occurrences.head()

print "\nMost commonly occurring category is %s with %d" % (n_occurrences.argmax(), n_occurrences.max())
print "\nLeast commonly occurring category is %s with %d" % (n_occurrences.argmin(), n_occurrences.min())


Number of rows: 76911
Number of columns: 34

Years:

[2004, 2005, 2006, 2007, 2008, 2009]

Number of unique categories: 153

Most commonly occurring categories:

direct fire          16293
ied found/cleared     8581
indirect fire         7237
ied explosion         7202
other                 4693
Name: Category, dtype: int64

Most commonly occurring category is direct fire with 16293

Least commonly occurring category is graffiti with 1

In [3]:
# plot distribution of categories (TOP 50)
plt.style.use('ggplot')
%matplotlib inline

n_occurrences_top = n_occurrences[0:50]

# plot histogram
def barplot(series, title, figsize, ylabel, flag, rotation):
    ax = series.plot(kind='bar', 
                title = title,
                figsize = figsize,
                fontsize = 13)
    
    # set ylabel
    ax.set_ylabel(ylabel)
    # set xlabel (depending on the flag that comes as a function parameter)
    ax.get_xaxis().set_visible(flag)
    # set series index as xlabels and rotate them
    ax.set_xticklabels(series.index, rotation= rotation)
    
barplot(n_occurrences_top,'Category occurrences', figsize=(14,6), ylabel = 'category count',flag = True, rotation = 90)



In [4]:
focus_categories = n_occurrences.index[0:8]
print focus_categories


Index([u'direct fire', u'ied found/cleared', u'indirect fire',
       u'ied explosion', u'other', u'medevac', u'unexploded ordnance',
       u'cache found/cleared'],
      dtype='object')

In [5]:
def yearly_category_distribution(data, focus_categories):
    
    index = 1
    for category in focus_categories:
        # filter table by type of category
        db = data[data['Category'] == category]
        # get year counts of that category
        year_counts = db['Year'].value_counts()
        # sort it (from 2004 to 2009)
        year_counts = year_counts.sort_index()
        # plot it
        plt.subplot(7,2,index)
        barplot(year_counts, category, figsize=(20,35), ylabel = 'category count', flag = True, rotation = 0)
        index += 1
        
yearly_category_distribution(data, focus_categories)


Cleaning the Data

All the military language, abbreviations and acronyms in the summaries need to be replaced by their meaning in order to have extended and more understandable descriptions. Luckily the Guardian publicly released their list of acronyms online as a glossary that can be used. The file can be downloaded and information extracted using the HTML parsing library BeautifulSoup.

The additional column ExtendedSummary is appended to the data with the built glossary. Furthermore, we normalize attributes to avoid potential bias or remove noise when we do an in depth analysis on the summaries for several visualizations. Approaches used are:

  • Tokenizing: converting a document to its atomic elements ("don't" won't be read as two tokens, "don" and "t").
  • Stopping: removing meaningless words (often used words like "a", "to" and so on are uninteresting).
  • Stemming: merging words that are equivalent in meaning (e.g.: walking, walker, walked).

The approches use the natural language toolkit nltk.


In [6]:
# read url with the glossary of the acronyms
link = 'http://www.theguardian.com/world/datablog/2010/jul/25/wikileaks-afghanistan-war-logs-glossary'
response = requests.get(link)
try:
    if not response.ok:
        print 'HTTP error {} trying to fetch Guradian glossary: {}'.format(response.status_code, link)
    else:
        glossary = dict()
        soup = bs4.BeautifulSoup(response.content, 'lxml')
        glossary_table = soup.find('table')
        for row in glossary_table.find_all('tr'):
                cells = row.find_all("td")
                if len(cells) == 2:
                    if cells[0].string:
                        key = str(cells[0].string.strip().lower())
                        content = str(cells[1].text.lower())
                        glossary[key] = content
except requests.exceptions.ConnectionError as e:
    'Connection error {} on {}'.format(e, link)

In [7]:
# print 10 first keys and values of the dictionary
sorted(glossary.items(), key=operator.itemgetter(1))[:10]


Out[7]:
[('bsn', ' (camp) bastion '),
 ('cgbg 1 coy', ' 1 company, coldstream guards battle group '),
 ('gbu-31', ' 2,000lb "smart bomb\' '),
 ('42 cdo rm', ' 42 commando royal marines '),
 ('gbu-12', ' 500lb laser-guided "smart bomb" '),
 ('508 stb', ' 508th special troops battalion '),
 ('81', ' 81mm mortar round '),
 ('9-liner', ' 9 line medevac request '),
 ('ctc', ' ? counterterrorist center '),
 ('ack', ' acknowledge ')]

In [8]:
# replace each word in the acronyms description by their meaning. Therefore getting extended description.
def matchwords(words):

    text = list()

    for word in words:
        if word in glossary:
            text.append(word.replace(word, glossary[word]))
        else:
            text.append(word)
            
    return ' '.join(text)

# lowercase summaries
data['Summary'] = data['Summary'].str.lower()
# remove rows with NaNs in summary
data = data[data['Summary'].notnull()]

In [9]:
tokenizer = RegexpTokenizer(r'\w+')

# get indices of the filtered dataframe
indices = [index for index in data.index]

long_summary = []
for row in indices:
    # tokenize words in the summary
    words = tokenizer.tokenize(data['Summary'][row])
    # match acronyms to meanings and generate an extended summary
    long_summary.append(matchwords(words))
    
# create column with extended summary
data.insert(7, 'ExtendedSummary', long_summary)
data.head()


Out[9]:
ReportKey DateOccurred EventType Category TrackingNumber Title Summary ExtendedSummary Region AttackOn ... Longitude OriginatorGroup UpdatedByGroup CommandersCriticalInformationRequirements Significant Affiliation DisplayColor ClassificationLevel Year Hour
0 D92871CA-D217-4124-B8FB-89B9A2CFFCB4 2004-01-01 Enemy Action direct fire 2007-033-004042-0756 direct fire other kaf-1bde -s3 reports: summit 09 b co element s... kandahar air field 1bde s3 reports summit 09... RC EAST ENEMY ... 69.416107 UNKNOWN UNKNOWN NaN NaN ENEMY RED SECRET 2004 0
1 C592135C-1BFF-4AEC-B469-0A495FDA78D9 2004-01-01 Friendly Action cache found/cleared 2007-033-004738-0185 cache found/cleared other ussf finds cache in village of walu tangay: us... ussf finds cache in village of walu tangay uss... RC EAST FRIEND ... 70.920273 UNKNOWN UNKNOWN NaN NaN FRIEND BLUE SECRET 2004 0
2 D50F59F0-6F32-4E63-BC02-DB2B8422DE6E 2004-01-01 Non-Combat Event propaganda 2007-033-010818-0798 propaganda other (m) night letters distributed around hazarjuft... m night letters distributed around hazarjuft t... RC SOUTH NEUTRAL ... 64.199707 UNKNOWN UNKNOWN NaN NaN NEUTRAL GREEN SECRET 2004 0
3 E3F22EFB-F0CA-4821-9322-CC2250C05C8A 2004-01-01 Enemy Action direct fire 2007-033-004042-0850 direct fire other kaf-1bde -s3: summit 6 reports tic salute to f... kandahar air field 1bde s3 summit 6 reports ... RC EAST ENEMY ... 69.362511 UNKNOWN UNKNOWN NaN NaN ENEMY RED SECRET 2004 0
4 4D0E1E60-9535-4D58-A374-74367F058788 2004-01-01 Friendly Action cache found/cleared 2007-033-004738-0279 cache found/cleared other kaf-1bde -s3 reports: geronimo 11 salute as fo... kandahar air field 1bde s3 reports geronimo ... RC EAST FRIEND ... 69.813606 UNKNOWN UNKNOWN NaN NaN FRIEND BLUE SECRET 2004 0

5 rows × 35 columns

Theory and Visualizations

Recorded Casualties

Visualization 1 - bar chart

The number of people wounded and killed in each group are counted and plotted after in a grouped bar chart. The two criteria wounded and killed can be compared by toggling between two buttons. This visualization was chosen in order to give the reader an overview of the involved groups:

  • Afghan forces
  • ISAF/NATO forces
  • Taliban
  • Civilians and to show the evolution of wounded and dead over the course of the war. Having the bars grouped depending on party and year is the best way for an approximate comparsion at a glance. It was also good to introduce a delicate subject by showing the shocking facts of people actually dying.

In [10]:
# create new dataframe with the columns of interest
db = pandas.concat([data['Year'],data['HostNationWounded'], data['HostNationKilled'],
              data['EnemyWounded'], data['EnemyKilled'],
              data['CivilianWounded'], data['CivilianKilled'],
              data['FriendlyWounded'], data['FriendlyKilled']]
              , axis=1)
db.head()


Out[10]:
Year HostNationWounded HostNationKilled EnemyWounded EnemyKilled CivilianWounded CivilianKilled FriendlyWounded FriendlyKilled
0 2004 0 0 0 3 0 0 0 0
1 2004 0 0 0 0 0 0 0 0
2 2004 0 0 0 0 0 0 0 0
3 2004 0 0 0 8 0 0 0 0
4 2004 0 0 0 0 0 0 0 0

In [11]:
casualties = {}

for date in date_range:
    # filter dataframe by year
    db_year = db[db['Year'] == date]
    casualties[str(date)]= {}
    for column in db.columns:
        if column != 'Year':
            # sum every column except 'year' and store it in a dictionary with 'years' as keys
            casualties[str(date)][column] = db_year[column].sum()

In [12]:
for key in sorted(casualties.iterkeys()):
    print "%s: %s" % (key, casualties[key])


2004: {'HostNationKilled': 218.0, 'FriendlyWounded': 133.0, 'HostNationWounded': 316.0, 'EnemyWounded': 93.0, 'FriendlyKilled': 22.0, 'EnemyKilled': 343.0, 'CivilianKilled': 219.0, 'CivilianWounded': 208.0}
2005: {'HostNationKilled': 180.0, 'FriendlyWounded': 480.0, 'HostNationWounded': 504.0, 'EnemyWounded': 200.0, 'FriendlyKilled': 71.0, 'EnemyKilled': 890.0, 'CivilianKilled': 178.0, 'CivilianWounded': 454.0}
2006: {'HostNationKilled': 605.0, 'FriendlyWounded': 1083.0, 'HostNationWounded': 1397.0, 'EnemyWounded': 249.0, 'FriendlyKilled': 142.0, 'EnemyKilled': 2687.0, 'CivilianKilled': 800.0, 'CivilianWounded': 1790.0}
2007: {'HostNationKilled': 951.0, 'FriendlyWounded': 1652.0, 'HostNationWounded': 2093.0, 'EnemyWounded': 377.0, 'FriendlyKilled': 186.0, 'EnemyKilled': 4044.0, 'CivilianKilled': 758.0, 'CivilianWounded': 1962.0}
2008: {'HostNationKilled': 699.0, 'FriendlyWounded': 1255.0, 'HostNationWounded': 1749.0, 'EnemyWounded': 285.0, 'FriendlyKilled': 244.0, 'EnemyKilled': 2816.0, 'CivilianKilled': 798.0, 'CivilianWounded': 1980.0}
2009: {'HostNationKilled': 1143.0, 'FriendlyWounded': 2693.0, 'HostNationWounded': 2444.0, 'EnemyWounded': 617.0, 'FriendlyKilled': 481.0, 'EnemyKilled': 4437.0, 'CivilianKilled': 1241.0, 'CivilianWounded': 2646.0}

In [13]:
# extract wounded data and rename columns
Wounded = {}
for date in date_range:
    
    Wounded[str(date)]= {}
    
    Wounded[str(date)]['Afghan forces'] = casualties[str(date)]['HostNationWounded']
    Wounded[str(date)]['Taliban'] = casualties[str(date)]['EnemyWounded']
    Wounded[str(date)]['Civilians'] = casualties[str(date)]['CivilianWounded']
    Wounded[str(date)]['ISAF/NATO forces'] = casualties[str(date)]['FriendlyWounded']

In [14]:
for key in sorted(Wounded.iterkeys()):
    print "%s: %s" % (key, Wounded[key])


2004: {'ISAF/NATO forces': 133.0, 'Afghan forces': 316.0, 'Taliban': 93.0, 'Civilians': 208.0}
2005: {'ISAF/NATO forces': 480.0, 'Afghan forces': 504.0, 'Taliban': 200.0, 'Civilians': 454.0}
2006: {'ISAF/NATO forces': 1083.0, 'Afghan forces': 1397.0, 'Taliban': 249.0, 'Civilians': 1790.0}
2007: {'ISAF/NATO forces': 1652.0, 'Afghan forces': 2093.0, 'Taliban': 377.0, 'Civilians': 1962.0}
2008: {'ISAF/NATO forces': 1255.0, 'Afghan forces': 1749.0, 'Taliban': 285.0, 'Civilians': 1980.0}
2009: {'ISAF/NATO forces': 2693.0, 'Afghan forces': 2444.0, 'Taliban': 617.0, 'Civilians': 2646.0}

In [15]:
# writing wounded data to a dataframe for visualizing it with D3
data_wounded = pandas.DataFrame()

data_wounded['Year'] = [key for key in sorted(Wounded.iterkeys())]
data_wounded['Afghan forces'] = [Wounded[key]['Afghan forces'] for key in sorted(Wounded.iterkeys())]
data_wounded['ISAF/NATO forces'] = [Wounded[key]['ISAF/NATO forces'] for key in sorted(Wounded.iterkeys())]
data_wounded['Taliban'] = [Wounded[key]['Taliban'] for key in sorted(Wounded.iterkeys())]
data_wounded['Civilians'] = [Wounded[key]['Civilians'] for key in sorted(Wounded.iterkeys())]

data_wounded


Out[15]:
Year Afghan forces ISAF/NATO forces Taliban Civilians
0 2004 316 133 93 208
1 2005 504 480 200 454
2 2006 1397 1083 249 1790
3 2007 2093 1652 377 1962
4 2008 1749 1255 285 1980
5 2009 2444 2693 617 2646
"Wounded" data download

download this data


In [16]:
# save wounded data to csv
data_wounded.to_csv('../data/nb/wounded.csv', sep=',', index=False)

In [17]:
# multi bar plot of killed people
plt.style.use('ggplot')
%matplotlib inline

N = 6
ind = np.arange(N)  # the x locations for the groups
width = 0.2     # the width of the bars

Afghan_forces = list(data_wounded['Afghan forces'])
Nato_forces = list(data_wounded['ISAF/NATO forces'])
Taliban = list(data_wounded['Taliban'])
Civilians = list(data_wounded['Civilians'])

fig, ax = plt.subplots(figsize=(20, 10))
rects1 = ax.bar(ind, Afghan_forces, width, color='r')
rects2 = ax.bar(ind + width, Nato_forces, width, color='b')
rects3 = ax.bar(ind + width*2, Taliban, width, color='g')
rects4 = ax.bar(ind + width*3, Civilians, width, color='y')

# add some text for labels, title and axes ticks
ax.set_ylabel('Counts')
ax.set_title('Wounded counts recorded by the Wikileaks Afghanistan Database')
ax.set_xticks(ind + width)
ax.set_xticklabels(('2004', '2005', '2006', '2007', '2008', '2009'))

ax.legend((rects1[0], rects2[0], rects3[0], rects4[0]), ('Afghan forces', 'ISAF/NATO forces', 'Taliban', 'Civilians'))

def autolabel(rects):
    # attach some text labels
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)

plt.show()



In [18]:
# Extract "killed" data and rename columns
Killed = {}
for date in date_range:
    
    Killed[str(date)]= {}
    
    Killed[str(date)]['Afghan forces'] = casualties[str(date)]['HostNationKilled']
    Killed[str(date)]['Taliban'] = casualties[str(date)]['EnemyKilled']
    Killed[str(date)]['Civilians'] = casualties[str(date)]['CivilianKilled']
    Killed[str(date)]['ISAF/NATO forces'] = casualties[str(date)]['FriendlyKilled']

In [19]:
for key in sorted(Killed.iterkeys()):
    print "%s: %s" % (key, Killed[key])


2004: {'ISAF/NATO forces': 22.0, 'Afghan forces': 218.0, 'Taliban': 343.0, 'Civilians': 219.0}
2005: {'ISAF/NATO forces': 71.0, 'Afghan forces': 180.0, 'Taliban': 890.0, 'Civilians': 178.0}
2006: {'ISAF/NATO forces': 142.0, 'Afghan forces': 605.0, 'Taliban': 2687.0, 'Civilians': 800.0}
2007: {'ISAF/NATO forces': 186.0, 'Afghan forces': 951.0, 'Taliban': 4044.0, 'Civilians': 758.0}
2008: {'ISAF/NATO forces': 244.0, 'Afghan forces': 699.0, 'Taliban': 2816.0, 'Civilians': 798.0}
2009: {'ISAF/NATO forces': 481.0, 'Afghan forces': 1143.0, 'Taliban': 4437.0, 'Civilians': 1241.0}

In [20]:
# writing "killed" data to a dataframe for visualizing it with D3
data_killed = pandas.DataFrame()

data_killed['Year'] = [key for key in sorted(Killed.iterkeys())]
data_killed['Afghan forces'] = [Killed[key]['Afghan forces'] for key in sorted(Killed.iterkeys())]
data_killed['ISAF/NATO forces'] = [Killed[key]['ISAF/NATO forces'] for key in sorted(Killed.iterkeys())]
data_killed['Taliban'] = [Killed[key]['Taliban'] for key in sorted(Killed.iterkeys())]
data_killed['Civilians'] = [Killed[key]['Civilians'] for key in sorted(Killed.iterkeys())]

data_killed


Out[20]:
Year Afghan forces ISAF/NATO forces Taliban Civilians
0 2004 218 22 343 219
1 2005 180 71 890 178
2 2006 605 142 2687 800
3 2007 951 186 4044 758
4 2008 699 244 2816 798
5 2009 1143 481 4437 1241
"Killed" data download

download this data


In [21]:
# save "killed" data to csv
data_killed.to_csv('../data/nb/killed.csv', sep=',', index=False)

In [22]:
Afghan_forces = list(data_killed['Afghan forces'])
Nato_forces = list(data_killed['ISAF/NATO forces'])
Taliban = list(data_killed['Taliban'])
Civilians = list(data_killed['Civilians'])

fig, ax = plt.subplots(figsize=(20, 10))
rects1 = ax.bar(ind, Afghan_forces, width, color='r')
rects2 = ax.bar(ind + width, Nato_forces, width, color='b')
rects3 = ax.bar(ind + width*2, Taliban, width, color='g')
rects4 = ax.bar(ind + width*3, Civilians, width, color='y')

# add some text for labels, title and axes ticks
ax.set_ylabel('Counts')
ax.set_title('Killed counts recorded by the Wikileaks Afghanistan Database')
ax.set_xticks(ind + width)
ax.set_xticklabels(('2004', '2005', '2006', '2007', '2008', '2009'))

ax.legend((rects1[0], rects2[0], rects3[0], rects4[0]), ('Afghan forces', 'ISAF/NATO forces', 'Taliban', 'Civilians'))

def autolabel(rects):
    # attach some text labels
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)

plt.show()


Killed Civilians by Location

Visualization 1 - Map

(since bar charts did not count as d3 visualisation, in same group)

This map builds on top of the bar chart making clear that it was a war that impacted the local people a lot. Even with "just" all incidents that killed civilians the reader is still quite overwhelmed at first. It was necessary to make the map zoomable and interactive so that the reader can get a more distinct picture. Having the map in the introduction section was also a good way of showing how the regions are distributed between the ISAF forces or what regions even exist in Afghanistan. The reader can quickly identify that the size of the circles increases with a higher number of killed people and can also read more about why so many people died, getting sort of a feel for the war and the stories behind it.


In [23]:
# create new dataframe with the columns of interest
db_loc = pandas.concat([data['Latitude'], data['Longitude'], data['CivilianKilled'], data['ExtendedSummary']], axis=1)
# delete no casualties
db_loc = db_loc.drop(db_loc[db_loc['CivilianKilled'] == 0].index)
# exclude not set column
db_loc = db_loc[db_loc.CivilianKilled.notnull() & db_loc.Longitude.notnull() & db_loc.Latitude.notnull()]

db_loc.head()


Out[23]:
Latitude Longitude CivilianKilled ExtendedSummary
16 31.000000 66.400002 50 explosion kills 50 near qaba mosque in spin bo...
111 34.158329 61.851940 1 s rel global counter terrorism forces ngo ro...
192 33.552219 70.045280 2 15 feb two persons were killed and six injured...
221 34.983330 69.667511 5 s rel global counter terrorism forces five a...
256 32.450001 67.416656 1 s rel global counter terrorism forces 1 x ng...
"Civilians Killed" data download

download this data


In [24]:
# write df as JSON, bring into appropriate format
obj = {'objects': list()}
for index, row in db_loc.iterrows():
    obj['objects'].append({'circle': {
                'coordinates': [row.Latitude, row.Longitude], 
                'death': int(row.CivilianKilled),
                'desc': row.ExtendedSummary
            }})

with open('../data/nb/civ_deaths.json', 'w') as data_file:    
    json.dump(obj, data_file)
    
pprint(obj['objects'][:5])


[{'circle': {'coordinates': [31.0, 66.40000153],
             'death': 50,
             'desc': 'explosion kills 50 near qaba mosque in spin boldak'}},
 {'circle': {'coordinates': [34.15832901, 61.85194016],
             'death': 1,
             'desc': 's rel  global counter terrorism forces  ngo robbed and killed'}},
 {'circle': {'coordinates': [33.55221939, 70.04528046],
             'death': 2,
             'desc': '15 feb two persons were killed and six injured in a powerful bomb blast near a police station in ali sher district in khost province southeastern afghanistan the blast occurred hours after unidentified gunmen fired dozens of rockets at a base of us led coalition forces in the province'}},
 {'circle': {'coordinates': [34.98332977, 69.66751099],
             'death': 5,
             'desc': 's rel  global counter terrorism forces  five afghan nationals working for the ngo swedish development foundation were killed 25 feb during an ambush of thier vehicle 31 miles north of kabul in kapisa province'}},
 {'circle': {'coordinates': [32.45000076, 67.41665649],
             'death': 1,
             'desc': 's rel  global counter terrorism forces  1 x ngo contractor killed 1 x taken hostage'}}]

WordClouds

Visualization 2

In this section a simple weighting scheme called tf-idf (term frequency–inverse document frequency) is used to find important words within each category. Once found, they are visualized in a word cloud. Since there are 153 categories and computing a wordcloud for each category is not feasible, just some of the categories are picked to compute wordclouds. Also data cleaning is a crucial part to have an acceptable result. The methods are explained in the introduction and respective code cells.

It seemed feasible to go a bit deeper into specific topics and also, instead of summarizing data, gain something out of it with further analysis. Word or tag clouds are in general a good way of presenting a lot of text in a weighted list of important words. Someone who wants to inform himself about more specific subjects can use these words when looking for more insigths. Also while our group was doing the analysis we could further investigate certain subjects that we were unaware about before.


In [25]:
# indices of categories that wordclouds are computed of
#for i, x in enumerate(n_occurrences.index):
#    print i, x
indices = [35,50,58,63,66,68,74,93,95]

# get categories names
categ = [n_occurrences.index[idx] for idx in indices]

# filter the dataset. Only keep those rows with the categories selected
db_wc_filt = data[data['Category'].isin(categ)]

In [26]:
db_wc_filt.head()


Out[26]:
ReportKey DateOccurred EventType Category TrackingNumber Title Summary ExtendedSummary Region AttackOn ... Longitude OriginatorGroup UpdatedByGroup CommandersCriticalInformationRequirements Significant Affiliation DisplayColor ClassificationLevel Year Hour
184 4A9E37A8-367A-4B73-84C4-CB4C1F5F2C47 2004-02-14 00:00:00 Enemy Action assassination 2007-033-004004-0670 direct fire ana other 1 ana kia (s//rel gctf) tb attack home and assasinate an... s rel global counter terrorism forces talib... RC SOUTH ENEMY ... 67.416656 UNKNOWN UNKNOWN NaN NaN ENEMY RED SECRET 2004 0
449 DE3A0314-FE7D-469D-B105-8592E91FEB86 2004-05-03 00:00:00 Non-Combat Event demonstration 2007-033-011133-0745 demonstration other (s//rel gctf) an estimated 100 or more support... s rel global counter terrorism forces an est... RC EAST NEUTRAL ... 68.440002 UNKNOWN UNKNOWN NaN NaN NEUTRAL GREEN SECRET 2004 0
692 846645EA-37EB-44DE-BD21-61A6C0D6BB26 2004-06-11 12:43:00 Non-Combat Event demonstration 2007-033-011133-0635 demonstration anp other 15anp wia (s//rel gctf) (delayed report) tf victory repo... s rel global counter terrorism forces delaye... RC NORTH NEUTRAL ... 65.853333 UNKNOWN UNKNOWN NaN NaN NEUTRAL GREEN SECRET 2004 12
1115 7A367320-2666-4379-86F1-CB455637E603 2004-08-15 10:00:00 Non-Combat Event demonstration 2007-033-011133-0854 demonstration other (fouo) at 1000hrs there was a demonstration, o... fouo at 1000hrs there was a demonstration of a... RC NORTH NEUTRAL ... 68.639160 UNKNOWN UNKNOWN NaN NaN NEUTRAL GREEN SECRET 2004 10
1654 29DFB7ED-6A00-4740-8806-5E91950D39E3 2004-11-04 05:00:00 Non-Combat Event demonstration 2007-033-011133-0932 demonstration other cjsotf reports pak civ protest 4k ne lwara. at... combined joint special operations task force ... RC EAST NEUTRAL ... 69.494164 UNKNOWN UNKNOWN NaN NaN NEUTRAL GREEN SECRET 2004 5

5 rows × 35 columns

For computing the tf-idf weights for each document in the corpus, it is required in the corpus a series of steps:

  • Tokenize the corpus
  • Model the vector space
  • Compute the tf-idf weight for each document in the corpus

In [27]:
# uncomment this to download stopwords locally
#nltk.dowload()

We start by building the corpus for each document.


In [28]:
branch_corpus = {}
for cat in categ:
    d = db_wc_filt[db_wc_filt['Category'] == cat]
    #for each category, join all text from the extended summary and save it in a dictionary
    branch_corpus[cat] = ' '.join(d['ExtendedSummary'])

In [29]:
# load english stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))
# create a WordNetLemmatizer for stemming the tokens
wordnet_lemmatizer = WordNetLemmatizer() 

# compute number of times a word appears in a document
def freq(word, doc):
    return doc.count(word)

# get number of words in the document
def word_count(doc):
    return len(doc)

# compute TF
def tf(word, doc):
    return (freq(word, doc) / float(word_count(doc)))

# compute number of documents containing a particular word
def num_docs_containing(word, list_of_docs):
    count = 0
    for document in list_of_docs:
        if freq(word, document) > 0:
            count += 1
    return 1 + count


# compute IDF
def idf(word, list_of_docs):
    return math.log(len(list_of_docs) /
            float(num_docs_containing(word, list_of_docs)))

# compute TF-IDF
def tf_idf(word, doc, list_of_docs):
    return (tf(word, doc) * idf(word, list_of_docs))


vocabulary = list()
docs = dict()

for key,text in branch_corpus.iteritems():
        # tokenize text for a particular category
        tokens = nltk.word_tokenize(text)
        # lower case the words
        tokens = [token.lower() for token in tokens]
        # only keep words, disregard digits
        tokens = [token for token in tokens if token.isalpha()]
        # disregard stopwords and words that are less than 2 letter in length
        tokens = [token for token in tokens if token not in stopwords and len(token) > 2]
        final_tokens = []
        final_tokens.extend(tokens)
        
        docs[key] = {'tf': {}, 'idf': {},
                        'tf-idf': {}, 'tokens': []}
        # compute TF 
        for token in final_tokens:
            # The term-frequency (Normalized Frequency)
            docs[key]['tf'][token] = tf(token, final_tokens)
            docs[key]['tokens'] = final_tokens
        vocabulary.append(final_tokens)

# compute IDF and TF-IDF
for doc in docs:
    for token in docs[doc]['tf']:
        # the Inverse-Document-Frequency
        docs[doc]['idf'][token] = idf(token, vocabulary)
        # the tf-idf
        docs[doc]['tf-idf'][token] = tf_idf(token, docs[doc]['tokens'], vocabulary)

In [30]:
# get the 10 words with highest TF-IDF score for category 'Hard Landing'
sorted_dic = sorted(docs['arrest']['tf-idf'].items(), key=operator.itemgetter(1), reverse=True)
sorted_dic[0:10]


Out[30]:
[('arrested', 0.0052880539302087),
 ('arrest', 0.004259688112565812),
 ('seized', 0.004001770541779557),
 ('detained', 0.0035221618975453026),
 ('suspicious', 0.0031809885021704776),
 ('detention', 0.0026508237518087312),
 ('sayeed', 0.0026508237518087312),
 ('subjects', 0.0026508237518087312),
 ('prosecution', 0.0026508237518087312),
 ('mines', 0.0023234662432177152)]

In order to create the wordclouds we need to round the tf-idf to the nearest integer value. Then all words are combined together in one long string separated by spaces repeating each word to its rounded tf-idf score.

Note: We not only round but we scale them by a factor of 100 since we get very low tf-idf values.


In [31]:
for cat in categ:
    for tup in docs[cat]['tf-idf'].items():
        #scale each tf-idf value by a factor of 1000 and round it
        docs[cat]['tf-idf'][tup[0]] = int(round(tup[1]*1000))

In [32]:
#see how they have been scaled and rounded
sorted_dic = sorted(docs['arrest']['tf-idf'].items(), key=operator.itemgetter(1), reverse=True)
sorted_dic[0:10]


Out[32]:
[('arrested', 5),
 ('seized', 4),
 ('arrest', 4),
 ('detained', 4),
 ('detention', 3),
 ('sayeed', 3),
 ('suspicious', 3),
 ('subjects', 3),
 ('prosecution', 3),
 ('juma', 2)]

In [33]:
# generate text for wordclouds
doc_wordcloud = {}

for cat in categ:
    string = []
    for tup in docs[cat]['tf-idf'].items():
        if tup[1] > 0:
            #repeat each word to its scaled and rounded tf-idf score
            string.extend(np.repeat(tup[0],tup[1]))
            
    doc_wordcloud[cat] = ' '.join(string)

In [34]:
# generate wordclouds
for cat in categ:
    print "#### %s ####" % cat
    wordcloud = WordCloud().generate(doc_wordcloud[cat])
    img=plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()


#### demonstration ####
#### murder ####
#### green-green ####
#### natural disaster ####
#### arrest ####
#### assassination ####
#### carjacking ####
#### downed aircraft ####
#### refugees ####

In [35]:
# save wordclouds to files so that they are readable in JS for visualizing with D3
def savefile(category):
    filename = '../data/nb/' + category
    filename = filename.replace(" ", "_")
    with open(filename, "w") as text_file:
        text_file.write(doc_wordcloud[category])
    #print '* [download {0} data](../data/csv/v2/{0})'.format(category)

for cat in categ:
    savefile(cat)

LDA

Visualization 3

In this section Latent Dirichlet Allocation (LDA) is used, a topic model which generates topics based on word frequency from a set of documents.

Each extended description in our data is considered a document. Data cleaning needs to be done on all documents since it is crucial to generate a useful topic model. The methods are similar to the processing on the wordclouds. After the cleaning of the data, the LDA model is generated. In order to do that each terms frequency as it occurs in each document is investigated. To do so a document-term matrix (bag of words) is constructed with the gensim module. Once the bag of words is created the LDA model can be generated.

The LDA is another method of analyzing the text and give some meaningful weights to each word of a subject. Since the weighting scheme is different the extracted result could be used for another visualisation approach giving the reader the impression about importance through circle diameters. It was a good way of showing what typical war topics were mainly about with just a few words.


In [36]:
# will be used for text cleaning
stopwords = set(nltk.corpus.stopwords.words('english')) # load english stopwords
stoplist = set('for a of the and to in'.split())
manual_list=set('none nan get http https'.split())
wordnet_lemmatizer = WordNetLemmatizer() # create a WordNetLemmatizer to user for stemming our categories

In [37]:
# take some interesting categories
documents = [summary for summary in db_wc_filt['ExtendedSummary']]

For cleaning the text:

  • Lemmatize words (get the root of the words)
  • Only keep words and disregard digits
  • Make them lowercase so that there is no difference between "Then" and "then" for instance
  • Remove stopwords which are english common words that do not have any meaning
  • Remove words that are less than 2 letters in length since they do not usually provide any meaning

In [38]:
clean_dict={}
i=0;
for document in documents:
    try:
        cleaned_html = lxml.html.fromstring(str(document)).text_content()
    except:
        cleaned_html=""
    cleaned_html.encode('ascii', 'ignore')
    temp_token_list=nltk.word_tokenize(str(cleaned_html.encode('ascii', 'ignore')))
    content = [wordnet_lemmatizer.lemmatize(w.lower()) for w in temp_token_list if 
               ( w.isalpha() and w.lower() not in stopwords and w.lower() not in stoplist \
                and w.lower() not in manual_list and len(w)>2)]
    
    # save the cleaned tokens
    clean_dict[i]=content
    i+=1

In [39]:
# words that appear once are not needed
frequency = defaultdict(int)
for text in clean_dict.values():
    for token in text:
        frequency[token] += 1

i=0
for text in clean_dict.values():
    clean_dict[i] = [token for token in text if frequency[token] > 1]
    i+=1

In [40]:
dictionary = corpora.Dictionary(clean_dict.values())
corpus = [dictionary.doc2bow(text) for text in clean_dict.values()]

In [41]:
# Compute LDA. Set number of topics and number of words per topic to 10.
lda = models.ldamodel.LdaModel(corpus, num_topics=7, id2word=dictionary, passes=20)

In [42]:
lda.print_topics(num_topics=7, num_words=20)


Out[42]:
[(0,
  u'0.070*national + 0.064*afghan + 0.042*police + 0.027*force + 0.018*army + 0.014*local + 0.014*reported + 0.013*report + 0.013*security + 0.011*task + 0.010*killed + 0.007*shot + 0.007*action + 0.007*wounded + 0.006*assistance + 0.006*event + 0.006*officer + 0.006*base + 0.006*district + 0.006*soldier'),
 (1,
  u'0.019*road + 0.011*village + 0.010*flood + 0.009*local + 0.008*amp + 0.007*damage + 0.007*team + 0.007*force + 0.007*river + 0.007*bridge + 0.006*provincial + 0.006*elder + 0.006*reconstruction + 0.005*valley + 0.005*vehicle + 0.005*work + 0.005*destroyed + 0.005*area + 0.005*also + 0.005*one'),
 (2,
  u'0.018*driver + 0.013*truck + 0.012*district + 0.011*two + 0.009*base + 0.009*information + 0.007*afghan + 0.007*afghanistan + 0.007*taliban + 0.007*hijacked + 0.007*cargo + 0.007*khan + 0.007*incident + 0.007*three + 0.007*time + 0.006*one + 0.006*air + 0.006*operation + 0.006*operating + 0.006*province'),
 (3,
  u'0.026*force + 0.020*provincial + 0.018*reconstruction + 0.017*team + 0.015*amp + 0.012*reported + 0.011*task + 0.010*assistance + 0.008*afghanistan + 0.008*report + 0.008*security + 0.007*area + 0.006*international + 0.006*demonstration + 0.006*kabul + 0.006*air + 0.006*apos + 0.006*support + 0.005*route + 0.005*response'),
 (4,
  u'0.031*demonstration + 0.017*reported + 0.017*update + 0.013*report + 0.012*force + 0.012*event + 0.011*officer + 0.010*local + 0.009*afghan + 0.009*closed + 0.008*kcp + 0.008*national + 0.008*peaceful + 0.007*people + 0.006*security + 0.006*protest + 0.005*time + 0.005*null + 0.005*anp + 0.005*police'),
 (5,
  u'0.021*afghan + 0.018*police + 0.016*national + 0.013*protest + 0.013*force + 0.009*group + 0.009*area + 0.008*local + 0.008*demonstration + 0.007*land + 0.006*report + 0.006*reported + 0.006*task + 0.006*border + 0.006*point + 0.005*two + 0.005*elder + 0.005*one + 0.005*district + 0.005*coalition'),
 (6,
  u'0.026*amp + 0.020*force + 0.011*quot + 0.011*apos + 0.009*valley + 0.008*conducted + 0.008*site + 0.008*said + 0.007*resident + 0.007*operation + 0.006*element + 0.006*base + 0.005*team + 0.005*weapon + 0.005*police + 0.005*crash + 0.005*government + 0.005*abdul + 0.005*task + 0.005*mosque')]

In [43]:
topics = [lda.show_topic(i) for i in range(7)]

In [44]:
# write to json formatted file for D3
LDA_dicts = {'name': 'LDA',
            'children': [{
            'name': 'Topics',
            'children': []
        }]};

for index, topic in enumerate(topics):
    name = 'Topic {}'.format(index+1)
    dic_out={'name': name , 'children':[]}
    children = []
    for topic in topics[index]:
        dic = {'name':[], 'size':[]}
        dic['name'] = topic[0]
        dic['size'] = topic[1]
        children.append(dic)
    dic_out['children'] = children
    LDA_dicts['children'][0]['children'].append(dic_out)
"LDA circle packing" data download

download this data


In [45]:
# dump dict to JSON file
with open('../data/nb/LDA_topics.json', 'w') as fp:
    json.dump(LDA_dicts, fp)

In [46]:
LDA_dicts


Out[46]:
{'children': [{'children': [{'children': [{'name': u'national',
       'size': 0.069868870639559294},
      {'name': u'afghan', 'size': 0.063605654662812441},
      {'name': u'police', 'size': 0.041889127640513675},
      {'name': u'force', 'size': 0.027268690439316807},
      {'name': u'army', 'size': 0.017952194480238451},
      {'name': u'local', 'size': 0.014141723900489964},
      {'name': u'reported', 'size': 0.013789218636012908},
      {'name': u'report', 'size': 0.013473817777303871},
      {'name': u'security', 'size': 0.012525788957855492},
      {'name': u'task', 'size': 0.011019235892127307}],
     'name': 'Topic 1'},
    {'children': [{'name': u'road', 'size': 0.01911937397857745},
      {'name': u'village', 'size': 0.011285768647962036},
      {'name': u'flood', 'size': 0.010340603144697334},
      {'name': u'local', 'size': 0.0087371411364523185},
      {'name': u'amp', 'size': 0.0076720521153163411},
      {'name': u'damage', 'size': 0.0072668211177573461},
      {'name': u'team', 'size': 0.0069400595175719906},
      {'name': u'force', 'size': 0.0069070461539689548},
      {'name': u'river', 'size': 0.0068917178572562993},
      {'name': u'bridge', 'size': 0.0066828835641202754}],
     'name': 'Topic 2'},
    {'children': [{'name': u'driver', 'size': 0.018028862455535682},
      {'name': u'truck', 'size': 0.013344417713139527},
      {'name': u'district', 'size': 0.012191255905178288},
      {'name': u'two', 'size': 0.01098479914340657},
      {'name': u'base', 'size': 0.0093694238818415911},
      {'name': u'information', 'size': 0.0086251028073260783},
      {'name': u'afghan', 'size': 0.007456443965182225},
      {'name': u'afghanistan', 'size': 0.007356108090654772},
      {'name': u'taliban', 'size': 0.0073113164244047431},
      {'name': u'hijacked', 'size': 0.0072857216307342992}],
     'name': 'Topic 3'},
    {'children': [{'name': u'force', 'size': 0.025827890342217719},
      {'name': u'provincial', 'size': 0.019963778585861253},
      {'name': u'reconstruction', 'size': 0.017668355470018268},
      {'name': u'team', 'size': 0.01745542437059127},
      {'name': u'amp', 'size': 0.014561566303860126},
      {'name': u'reported', 'size': 0.012305291359687715},
      {'name': u'task', 'size': 0.010623446252894462},
      {'name': u'assistance', 'size': 0.010470182838195006},
      {'name': u'afghanistan', 'size': 0.0083447359449635803},
      {'name': u'report', 'size': 0.0082570203788887506}],
     'name': 'Topic 4'},
    {'children': [{'name': u'demonstration', 'size': 0.030826337833415141},
      {'name': u'reported', 'size': 0.017211479695076255},
      {'name': u'update', 'size': 0.017206349205865897},
      {'name': u'report', 'size': 0.013011402610355294},
      {'name': u'force', 'size': 0.012360031814064377},
      {'name': u'event', 'size': 0.011752149042644398},
      {'name': u'officer', 'size': 0.011293210230836494},
      {'name': u'local', 'size': 0.0098379900126823894},
      {'name': u'afghan', 'size': 0.0093851963989767789},
      {'name': u'closed', 'size': 0.0093230420307748255}],
     'name': 'Topic 5'},
    {'children': [{'name': u'afghan', 'size': 0.021177817945350157},
      {'name': u'police', 'size': 0.017703863957756721},
      {'name': u'national', 'size': 0.016220688916579712},
      {'name': u'protest', 'size': 0.012860815867027213},
      {'name': u'force', 'size': 0.012542270611238845},
      {'name': u'group', 'size': 0.009375250976456511},
      {'name': u'area', 'size': 0.0085342180510033418},
      {'name': u'local', 'size': 0.0083285940465644154},
      {'name': u'demonstration', 'size': 0.0082427413272877833},
      {'name': u'land', 'size': 0.0071096160800931405}],
     'name': 'Topic 6'},
    {'children': [{'name': u'amp', 'size': 0.026443711738154279},
      {'name': u'force', 'size': 0.020242095335857697},
      {'name': u'quot', 'size': 0.011098598318549492},
      {'name': u'apos', 'size': 0.010607676577690895},
      {'name': u'valley', 'size': 0.0091425060413576759},
      {'name': u'conducted', 'size': 0.0081088822691027154},
      {'name': u'site', 'size': 0.0078841978241618073},
      {'name': u'said', 'size': 0.0077528701470336433},
      {'name': u'resident', 'size': 0.0071489449803598913},
      {'name': u'operation', 'size': 0.0066612433166570537}],
     'name': 'Topic 7'}],
   'name': 'Topics'}],
 'name': 'LDA'}

KNN - Election and War incidents

Visualization 4

For this visualization the areas where there are most security incidents during the election 2009 are plotted. As an overlay grid points of 4 very common war incidents. To make a proper and meaningful prediction we decided to use an unbalanced dataset by having in mind that the accuracy will drop for the smaller classes. On the other hand having a unbalanced set we are closer to the ground truth with the map showing which of the category types are most likely experienced. The ISAF forces can conclude preventing measures in specific regions that are more focused on the density of the incidents. Still the distributions of the top incidents in 2009 are not spread too far:

  • ied explosions: 42%
  • direct fire: 21%
  • escalation of force: 8%
  • indirect fire: 6%

Since the user was more familiar with the overall data the map contained more information, showing what is the most probable attack that can happen in your region during the election period and where are the hotspots of incidents. It fit into the storyline that even fundamental things like elections are a great problem and far from normal everyday situations in a war troubled country.


In [47]:
data_el = pandas.read_csv('../data/csv/elections_security_incidents.csv')
# clean rows without useful numbers
data_el = data_el.dropna()
data_el.head()


Out[47]:
Longitude Latitude Province Name District Name August 2010 May - August 2010 2010 2009 and 2010 Election Day 2009 Election Day 2010
0 70.82 37.27 Badakhshan Arghanj Khwa 0 1 2 5 0 0
1 70.43 37.06 Badakhshan Argo 2 5 8 19 0 0
2 70.90 37.03 Badakhshan Baharak 0 6 7 8 0 0
3 70.47 36.87 Badakhshan Darayem 1 7 7 8 0 2
5 71.15 38.13 Badakhshan Darwaz-e-Balla 0 0 0 2 0 0

In [48]:
print "election provinces total: ", len(data_el)
print "total incidents: ", int(data_el['2009 and 2010'].sum())


election provinces total:  389
total incidents:  21849

In [49]:
# incidents year 2009
data09 = data[data['Year'] == 2009]
# incidents injured or killed civilians
data09 = data09.drop(data09[(data09['CivilianWounded'] == 0) & (data09['CivilianKilled'] == 0)].index)
# exclude unset columns
data09 = data09[data09.CivilianKilled.notnull() & data09.CivilianWounded.notnull()]

# Distribution of categories top 4
c_occurrences = data09['Category'].value_counts()
c_occurrences[0:4]


Out[49]:
ied explosion          433
direct fire            214
escalation of force     80
indirect fire           63
Name: Category, dtype: int64

In [50]:
# drop other categories
for idx, row in data09.iterrows():
    if row['Category'] not in c_occurrences.index[0:4]:
        data09.drop(idx)

In [51]:
# split data to individual sets
def get_subset(dataset, category):
    df = dataset[dataset['Category'] == category]

    return df

# subset for categories
df_IED = get_subset(data09, 'ied explosion')
df_DIRECT = get_subset(data09, 'direct fire')
df_FORCE = get_subset(data09, 'escalation of force')
df_INDIRECT = get_subset(data09, 'indirect fire')

In [52]:
# see balance of data
ied = df_IED.shape[0]
direct = df_DIRECT.shape[0]
force = df_FORCE.shape[0]
indirect = df_INDIRECT.shape[0]

total_amount = len(data09)
print "total top 4 incidents: ", total_amount

print "Percentage of ied explosions: {}%".format(ied/(total_amount/100))
print "Percentage of direct fire: {}%".format(direct/(total_amount/100))
print "Percentage of escalation of force: {}%".format(force/(total_amount/100))
print "Percentage of indirect fire: {}%".format(indirect/(total_amount/100))


total top 4 incidents:  1041
Percentage of ied explosions: 41.5946205572%
Percentage of direct fire: 20.5571565802%
Percentage of escalation of force: 7.68491834774%
Percentage of indirect fire: 6.05187319885%

Some geodata is plotted to see the distribution. First a function to extract the geoinformation for each row from the datasets as shown in below cell is necessary.


In [53]:
def get_all_geodata(dataset):

    # filter bad rows
    dataset = dataset[dataset.Longitude.notnull() & dataset.Latitude.notnull()]
    
    # only activity in Afghanistan
    include = (dataset.Latitude < 39) & (dataset.Latitude > 28) & (dataset.Longitude > 59) & (dataset.Longitude < 76)
    # get data in the format geoplotlib requires. We put the geodata in a dictionary structured as follows
    geo_data = {
        "lat": dataset.loc[include].Latitude.tolist(), 
        "lon": dataset.loc[include].Longitude.tolist()
    }
    return geo_data

# create the dictionary with lat and lon
geo_data_IED = get_all_geodata(df_IED)
geo_data_DIRECT = get_all_geodata(df_DIRECT)
geo_data_FORCE = get_all_geodata(df_FORCE)
geo_data_INDIRECT = get_all_geodata(df_INDIRECT)
geo_data_ELECTION = get_all_geodata(data_el)

All latitude and longitude information is passed to the function which computes the kernel density estimation (kde()) from geoplotlib which is also called a heatmap. The BoundingBox is needed to fit the projection as close as possible to Afghanistan. The approximate boundaries are set by cleaning some outliers from all datapoints (geo_dim) Maps are plotted with inline() to make the map visible in the notebook.


In [54]:
# plot given coordinate input
def geo_plot(geodata):

    # bounding box on the minima and maxima of the data
    geoplotlib.set_bbox(
        BoundingBox(
            max(geodata['lat']), 
            max(geodata['lon']), 
            min(geodata['lat']), 
            min(geodata['lon'])
        ));
    
    # kernel density estimation visualization
    geoplotlib.kde(geodata, bw=5, cut_below=1e-3, cmap='hot', alpha=170)
    # google tiles with lyrs=y ... hybrid
    geoplotlib.tiles_provider({
        'url': lambda zoom, xtile, ytile: 'https://mt1.google.com/vt/lyrs=y&hl=en&x=%d&y=%d&z=%d' % (xtile, ytile, zoom ),
        'tiles_dir': 'DTU-social_data',
        'attribution': 'DTU 02806 Social Data Analysis and Visualization'
    })
    
    geoplotlib.inline();

In [55]:
print 'ied explosion'
geo_plot(geo_data_IED)
print 'direct fire'
geo_plot(geo_data_FORCE)
print 'escalation of force'
geo_plot(geo_data_INDIRECT)
print 'indirect fire'
geo_plot(geo_data_DIRECT)
print 'election'
geo_plot(geo_data_ELECTION)


ied explosion
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.13252371406381036)
direct fire
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.034221570399321423)
escalation of force
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.029033994220663158)
indirect fire
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.087317618275628936)
election
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.013160187040692951)

A $100 \times 100$ grid over the span of geo coordinates of the voting areas is laid. The grid needs its own dimensions we first set the boundaries for maximum and minimum latitudes and after use the numpy arange() function to create evenly spaced values within this given interval (the points within the grid). geoplotlib.dot() expects a dictonary containing the latitude and longitude for the grid. The grid and geographical data can later be reused for other plots.


In [56]:
sampledData = pandas.concat([df_IED,df_FORCE,df_INDIRECT,df_DIRECT])

def plot_knn(K):
    # create lists with range from min to max of sampled data. Makes 100 steps by dividing difference by 100
    gridXVals = pandas.Series(
        np.arange(
            min(geo_data_ELECTION['lat']),
            max(geo_data_ELECTION['lat']), 
            abs((min(geo_data_ELECTION['lat']) - max(geo_data_ELECTION['lat'])) / 100.0))
    )
    gridYVals = pandas.Series(
        np.arange(
            min(geo_data_ELECTION['lon']), 
            max(geo_data_ELECTION['lon']), 
            abs((max(geo_data_ELECTION['lon']) - min(geo_data_ELECTION['lon'])) / 100.0))
    )
    gridYVals.sort_values(ascending=False, inplace=True)

    # convert product of the two lists to grid values
    product = list(itertools.product(gridXVals, gridYVals))
    # convert product tuples into dataframe
    grid = pandas.DataFrame({'lat': map(lambda coord: coord[0], product), 'lon': map(lambda coord: coord[1], product)})

    # X maxtrix from coordinates
    X = sampledData[['Latitude', 'Longitude']]
    # Y is the category of these - The label to train from
    Y = sampledData["Category"]

    neigh = KNeighborsClassifier()
    neigh.n_neighbors = K
    #Train
    neigh.fit(X,Y)
    
    # Predict
    p_string = 'Prediction_k{}'.format(K)
    grid[p_string] = pandas.Series(neigh.predict(grid))
    geoplotlib.set_bbox(BoundingBox(max(grid['lat']), max(grid['lon']), min(grid['lat']),min(grid['lon'])))
    geoplotlib.tiles_provider('toner-lite')

    # Plot dots
    print p_string
    geoplotlib.dot(grid[grid[p_string] == 'indirect fire'], color=[0,255,0,255], point_size=2)
    geoplotlib.dot(grid[grid[p_string] == 'escalation of force'], color=[255,0,0,255], point_size=2)
    geoplotlib.dot(grid[grid[p_string] == 'ied explosion'], color=[0,0,255,255], point_size=2)
    geoplotlib.dot(grid[grid[p_string] == 'direct fire'], color=[255,255,0,255], point_size=2)
    geoplotlib.inline()
    return grid

In [57]:
grid = plot_knn(5)
grid = grid.append(plot_knn(10), ignore_index=True)
grid = grid.append(plot_knn(20), ignore_index=True)


Prediction_k5
Prediction_k10
Prediction_k20

Since a unbalanced data set is used to show the ground truth of most common attacks, there is a stronger distribution of IED attacks and direct fire. Especially when K is set to 20 nearest neighbours those two categories are dominating.


In [58]:
# to json file
with open('../data/nb/knn.json', 'w') as outfile:
    json.dump(json.loads(grid.to_json(orient='records')), outfile)
grid.head()


Out[58]:
Prediction_k10 Prediction_k20 Prediction_k5 lat lon
0 NaN NaN ied explosion 29.88 73.228
1 NaN NaN ied explosion 29.88 73.106
2 NaN NaN ied explosion 29.88 72.984
3 NaN NaN ied explosion 29.88 72.862
4 NaN NaN ied explosion 29.88 72.740
"k-NN prediction" data download

download this data


In [59]:
# extract election data
df_el = pandas.concat([data_el['Longitude'],data_el['Latitude'], data_el['Province Name'],
              data_el['District Name'], data_el['2009 and 2010']], axis=1)
df_el = df_el.rename(columns = {
        '2009 and 2010':'total', 
        'District Name':'district', 
        'Province Name': 'province', 
        'Latitude': 'lat', 
        'Longitude': 'lon'
    }                
)
df_el.head()


Out[59]:
lon lat province district total
0 70.82 37.27 Badakhshan Arghanj Khwa 5
1 70.43 37.06 Badakhshan Argo 19
2 70.90 37.03 Badakhshan Baharak 8
3 70.47 36.87 Badakhshan Darayem 8
5 71.15 38.13 Badakhshan Darwaz-e-Balla 2
"election summary" data download

download this data


In [60]:
df_el.to_csv('../data/nb/election.csv', sep=',', encoding='utf-8', index=False)

Classifier Analysis

Visualization 5 - bar chart

We think it would be interesting to have a proper classification capable to predict a war category just by analyzing the content of its summary. This is done on for all incidents Afghanistan's capital Kabul. To do so, removal of useless data from the summary and substitution of the acronyms for the corresponding words is also necessary again. Later a matrix that represents the incidents on the rows and every unique word on the columns is created. With this matrix, it's simple to represent the words that every summary contains. Four classifiers were used in this project, to be able to see the differences between them.

To train the model all words from the "ExtendedSummary" were used as only feature, more might be added to increase the score for predicting the "Category". The necessary normalization would have been time consuming however and we tend to believe that the score is quite accurate given the many different categories. To reduce the bar chart only the 11 categories with most incidents were picked, with more categories it would make sense to flip x- and y-axis in the bar chart.

This problem description might only attract scientific readers, nevertheless we created a animated map that is using clustering to show the potential benefit of having models that are able to fit training data. With the evaluation of the knee point we made sure to get the optimal number of clusters for our visualization. There is no advantage in adding more means as the visualization does not become clearer with more information.

The best result of the classification is achieved by the random forest method which is an extension of the decision tree that performs equally good. The random forest method is more complex but due to added complexity it prevents trees from overfitting. Assume a data point runs trough the branches of one tree, if many decision trees are fit together in a forest each tree in the forest is made of different subsets of the training data. Covariats are taken randomly between the different trees and so you achieve diversity. However, overfitting is not a concern because the given training data has a low entropy (not too many different outcomes). Also the classification with k-NN achieves a good score since it is along with decision trees suitable for categorical (and numerical) data. Decision trees have the advantage to handle heterogeneous data better compared to k-NN that relies on a distance function to classify elements. k-NN also uses a considerable amount of time making a prediction which might become a problem on bigger datasets. Finally, the Gaussian Naive Bayes classifier would only be considerable when execution time would be a big concern since the classifier is highly scalable for more features and predictors. Choosing the right classifier depends where priorities are but having the most accurate score is probably the most critical according to the problem description.


In [61]:
# get the rows that contain incidents in the capital, Kabul
crimes_Kabul = data[data['Region'] == 'RC CAPITAL']
crimes_Kabul.index = range(3191)
summaries = crimes_Kabul['Summary']
categories = crimes_Kabul['Category']

In [62]:
def hasNumbers(word):
    return any(i.isdigit() for i in word)

In [63]:
# remove punctuation
tokenizer = RegexpTokenizer(r'\w+')

incidence = 0

# Dictionary that will contain all the tokens per summary.
tokens_per_summary_kabul = {}


for s in summaries:
    # get all the tokens of the summary S, which still contains the acronyms
    tokens_with_acronyms = tokenizer.tokenize(str(s))
    
    # replace acronyms
    tokens_no_acronyms = []
    for t in tokens_with_acronyms:
        t = t.lower()
        if t in glossary:
            tokens_no_acronyms.append(glossary[t].lower())
        else:
            tokens_no_acronyms.append(t)
            
    # remove now the stopwords
    tokens = []
    for w in tokens_no_acronyms:
        if w not in nltk.corpus.stopwords.words('english') and len(w) > 1 and (hasNumbers(w) == False):
            tokens.append(w)
            
    # store cleaned data in dictionary
    tokens_per_summary_kabul[incidence] = tokens
    incidence = incidence+1

In [64]:
# use pickle to store the dictionaries of the summaries, in order to avoid running the cleaning each time.
pickle.dump(tokens_per_summary_kabul, open("../data/nb/tokens_per_summary_kabul.p","wb"))

In [65]:
# load the data stored using pickle
tokens_per_summary_kabul = pickle.load(open( "../data/nb/tokens_per_summary_kabul.p","rb"))

Create a list of unique words


In [66]:
tokens_per_summary_sets = {} # each key contains a list with unique words of the summary
big_list = [] # all the unique lists concatenated. Thus, a word appear as many times as in how many summaries it is.

for key in tokens_per_summary_kabul:
    tokens_per_summary_sets[key] = list(set(tokens_per_summary_kabul[key]))
    big_list += tokens_per_summary_sets[key]

counter = Counter(big_list)

# create a set with the unique words that appear in more than 3 files.
unique_words = set()
for c in counter:
    if counter[c] > 3:
        unique_words.add(c)

In [67]:
# create the dictionary that maps each unique word with an index
unique_words2 = list(unique_words)
d = {unique_words2[i].lower() : i for i in xrange(len(unique_words2))}

Create a matrix with incident summaries in rows and unique words in columns, and set each value as the number of occurrences of a specific word in a specific incident summary.


In [68]:
# remove punctuation
tokenizer = RegexpTokenizer(r'\w+')

feature_vectors_kabul = [[0 for x in range(len(list(unique_words)))] for x in range(3191)]

# create the feature vectors
row = 0
for s in summaries:
    # get all the tokens of the summary S, which still contains the acronyms
    tokens_with_acronyms = tokenizer.tokenize(str(s))
    
    # replace acronyms
    tokens_no_acronyms = []
    for t in tokens_with_acronyms:
        t = t.lower()
        if t in glossary:
            tokens_no_acronyms.append(glossary[t].lower())
        else:
            tokens_no_acronyms.append(t)
            
    # remove stopwords
    for w in tokens_no_acronyms:
        if (w in d):
            col = d[w]
            feature_vectors_kabul[row][col] += 1
    row += 1

In [69]:
cp.dump(feature_vectors_kabul, open("../data/nb/feature_vectors_kabul.p","wb"))
feature_vectors_kabul = cp.load(open("../data/nb/feature_vectors_kabul.p","rb"))

Assign an index per each category.


In [70]:
unique_categories = list(set(crimes_Kabul['Category']))

In [71]:
cat2index = {}
index2cat = {}

for i in range (len(unique_categories)):
    cat2index[unique_categories[i]] = i
    
for i in range(len(unique_categories)):
    index2cat[i] = unique_categories[i]

In [72]:
dic_categories = {}

# for each file look at what list attached and then add the corresponding index
for i in range(len(categories)):
    dic_categories[i] = cat2index[categories[i]]

In [73]:
top11_categories = ['escalation of force','ied explosion','demonstration','indirect fire','meeting','accident','direct fire',
                    'medevac','ied found/cleared','cache found/cleared','meeting - security']

RANDOM FOREST


In [74]:
X_train, X_test, Y_train, Y_test = train_test_split(feature_vectors_kabul, dic_categories.values(), test_size=0.2)

In [75]:
forest = RandomForestClassifier(n_estimators = 50)

In [76]:
%timeit -n1 -r3 forest.fit(X_train,Y_train)


1 loops, best of 3: 2.31 s per loop

In [77]:
%timeit -n1 -r3 rf_predicted = forest.predict(X_test)
rf_predicted = forest.predict(X_test)


1 loops, best of 3: 144 ms per loop

In [78]:
forest.score(X_test,Y_test)


Out[78]:
0.75743348982785608

In [79]:
# get false posities and negatives
results_rf = {}

for i in top11_categories:
    results_rf[i] = [0,0]
    
for i in range(len(Y_test)):
    key = index2cat[Y_test[i]]
    if key in results_rf.keys():                
        if Y_test[i] == rf_predicted[i]: # if correct prediction
            results_rf[index2cat[Y_test[i]]][0] += 1
        else:
            results_rf[index2cat[Y_test[i]]][1] += 1
            
for i in range(len(top11_categories)):
    c = results_rf[top11_categories[i]][0]
    b = results_rf[top11_categories[i]][1]
    results_rf[top11_categories[i]][0] = (c/(c+b))*100
    results_rf[top11_categories[i]][1] = (b/(c+b))*100
"random forest" data download

download this data


In [80]:
# save the data
data_rf = pandas.DataFrame()

data_rf['Cat'] = top11_categories
correct = []
bad = []
for i in top11_categories:
    correct.append(results_rf[i][0])
    bad.append(results_rf[i][1])
data_rf['Correct'] = correct
data_rf['Incorrect'] = bad
# save killed data to csv
data_rf.to_csv('../data/nb/rf_top11_p.csv', sep=',', index=False)

K-NEIGHBORS


In [81]:
neigh = KNeighborsClassifier(n_neighbors=3)

In [82]:
%timeit -n1 -r3 neigh.fit(X_train, Y_train)


1 loops, best of 3: 841 ms per loop

In [83]:
%timeit -n1 -r3 nc_predicted = neigh.predict(X_test)
nc_predicted = neigh.predict(X_test)


1 loops, best of 3: 11.6 s per loop

In [84]:
neigh.score(X_test,Y_test)


Out[84]:
0.59624413145539901

In [85]:
results_neigh = {}

for i in top11_categories:
    results_neigh[i] = [0,0]
    
for i in range(len(Y_test)):
    key = index2cat[Y_test[i]]
    if key in results_neigh.keys():                
        if Y_test[i] == nc_predicted[i]: #if correct prediction
            results_neigh[index2cat[Y_test[i]]][0] += 1
        else:
            results_neigh[index2cat[Y_test[i]]][1] += 1
            
for i in range(len(top11_categories)):
    c = results_neigh[top11_categories[i]][0]
    b = results_neigh[top11_categories[i]][1]
    results_neigh[top11_categories[i]][0] = (c/(c+b))*100
    results_neigh[top11_categories[i]][1] = (b/(c+b))*100
"k-NN" data download

download this data


In [86]:
data_nc = pandas.DataFrame()

data_nc['Cat'] = top11_categories
correct = []
bad = []
for i in top11_categories:
    correct.append(results_neigh[i][0])
    bad.append(results_neigh[i][1])
data_nc['Correct'] = correct
data_nc['Incorrect'] = bad
#save killed data to csv
data_nc.to_csv('../data/nb/nc_top11_p.csv', sep=',', index=False)

GAUSSIAN Naive Bayes


In [87]:
clf = GaussianNB()

In [88]:
%timeit -n1 -r3 clf.fit(X_train,Y_train)


1 loops, best of 3: 617 ms per loop

In [89]:
%timeit -n1 -r3 clf_predicted = clf.predict(X_test)
clf_predicted = clf.predict(X_test)


1 loops, best of 3: 1.57 s per loop

In [90]:
clf.score(X_test,Y_test)


Out[90]:
0.36463223787167448

In [91]:
results_clf = {}

for i in list(set(categories)):
    results_clf[i] = [0,0]
    
for i in range(len(Y_test)):
    if Y_test[i] == clf_predicted[i]: #if correct prediction
        results_clf[index2cat[Y_test[i]]][0] += 1
    else:
        results_clf[index2cat[Y_test[i]]][1] += 1
        
for i in range(len(top11_categories)):
    c = results_clf[top11_categories[i]][0]
    b = results_clf[top11_categories[i]][1]
    results_clf[top11_categories[i]][0] = (c/(c+b))*100
    results_clf[top11_categories[i]][1] = (b/(c+b))*100
"Gaussian Naive Bayes" data download

download this data


In [92]:
data_clf = pandas.DataFrame()

data_clf['Cat'] = top11_categories
correct = []
bad = []
for i in top11_categories:
    correct.append(results_clf[i][0])
    bad.append(results_clf[i][1])
data_clf['Correct'] = correct
data_clf['Incorrect'] = bad
#save killed data to csv
data_clf.to_csv('../data/nb/clf_top11_p.csv', sep=',', index=False)

DECISION TREE


In [93]:
tr = tree.DecisionTreeClassifier()

In [94]:
%timeit -n1 -r3 tr.fit(X_train,Y_train)


1 loops, best of 3: 1.62 s per loop

In [95]:
%timeit -n1 -r3 tree_predicted = tr.predict(X_test)
tree_predicted = tr.predict(X_test)


1 loops, best of 3: 73.1 ms per loop

In [96]:
tr.score(X_test,Y_test)


Out[96]:
0.66666666666666663

In [97]:
results_tree = {}

for i in list(set(categories)):
    results_tree[i] = [0,0]
    
for i in range(len(Y_test)):
    if Y_test[i] == tree_predicted[i]: #if correct prediction
        results_tree[index2cat[Y_test[i]]][0] += 1
    else:
        results_tree[index2cat[Y_test[i]]][1] += 1
        
for i in range(len(top11_categories)):
    c = results_tree[top11_categories[i]][0]
    b = results_tree[top11_categories[i]][1]
    results_tree[top11_categories[i]][0] = (c/(c+b))*100
    results_tree[top11_categories[i]][1] = (b/(c+b))*100
"decission tree" data download

download this data


In [98]:
data_tr = pandas.DataFrame()

data_tr['Cat'] = top11_categories
correct = []
bad = []
for i in top11_categories:
    correct.append(results_tree[i][0])
    bad.append(results_tree[i][1])
data_tr['Correct'] = correct
data_tr['Incorrect'] = bad
# save killed data to csv
data_tr.to_csv('../data/nb/tr_top11_p.csv', sep=',', index=False)

Cluster Map

Visualization 5 - map

The map of Kabul with all the incidents was used to show the potential benefit of having models that are able to fit training data. That way the reader gets to see machine learning tools in action.


In [99]:
# get the rows that contain incidents in the capital
incidents_kabul = data[data['Region'] == 'RC CAPITAL']

In [100]:
def create_geodata(dataset):
    #get latitudes, longitudes and display colors from dataframe
    latitudes = []
    longitudes = []
    colors = []
    for item in dataset['Longitude'].iteritems():
        if ((math.isnan(item[1]) == False) and (item[1] != 70.20531464)): # to remove one outlier
            longitudes.append(float(item[1]))
            latitudes.append(dataset['Latitude'][item[0]])
            colors.append(dataset['DisplayColor'][item[0]])

    # get data in the format geoplotlib requires. We put the geodata in a dictionary structured as follows
    geo_data = {"lat": latitudes,
                "lon": longitudes}
    
    return geo_data, colors

In [101]:
geodata, colors = create_geodata(incidents_kabul)

In [102]:
def plot_geodata_kabul(geodata, enemy, friendly, neutral):

    for i in range(len(colors)):
        if colors[i] == 'RED':
            enemy['lat'].append(geodata['lat'][i])
            enemy['lon'].append(geodata['lon'][i])
        if colors[i] == 'BLUE':
            friendly['lat'].append(geodata['lat'][i])
            friendly['lon'].append(geodata['lon'][i])
        if colors[i] == 'GREEN':
            neutral['lat'].append(geodata['lat'][i])
            neutral['lon'].append(geodata['lon'][i])
    
    geoplotlib.dot(enemy, color='red', point_size = 2.5)
    geoplotlib.dot(friendly, color='blue', point_size = 2.5)
    geoplotlib.dot(neutral, color='green', point_size = 2.5)
    
    # bounding box on the minima and maxima of the data
    geoplotlib.set_bbox(
        BoundingBox(
            max(geodata['lat']), 
            max(geodata['lon']), 
            min(geodata['lat']), 
            min(geodata['lon'])
        ));

    geoplotlib.inline()

In [103]:
enemy = {}
enemy['lat'] = []
enemy['lon'] = []
friendly = {}
friendly['lat'] = []
friendly['lon'] = []
neutral = {}
neutral['lat'] = []
neutral['lon'] = []

plot_geodata_kabul(geodata, enemy, friendly, neutral)



In [104]:
def plot_clusters(n_clusters_in, d):
    print "K={}".format(n_clusters_in)
    X_train = pandas.DataFrame.from_dict(d)
    # create classifier for estimation
    cluster_mdl = cluster.KMeans(n_clusters=n_clusters_in)
    k_means = cluster_mdl.fit(X_train)

    centroids = {"lat": k_means.cluster_centers_[:,0].tolist(),
                 "lon": k_means.cluster_centers_[:,1].tolist()}
    
    # get cluster labels
    labels = k_means.labels_
    
    # plot the centers of the respective type ... DisplayColor
    geoplotlib.dot(centroids, color ='blue', point_size=10);
    geoplotlib.inline();
    
    # recombine lat+lon in tuple
    centroids = zip(centroids['lat'], centroids['lon'])
    return centroids, labels, d

cen_e, l_e, d_e = plot_clusters(5, enemy)
cen_f, l_f, d_f = plot_clusters(4, friendly)
cen_n, l_n, d_n = plot_clusters(4, neutral)


K=5
K=4
K=4

In [105]:
# try from K = 2, ..., 10
def check_kneepoint(d):
    X_train = pandas.DataFrame.from_dict(d)
    kmean = [cluster.KMeans(n_clusters=k).fit(X_train).inertia_ for k in range(2,10)]
    plt.plot(range(2,10), kmean)
    plt.xlabel('# centroids');
    # sum of squared errors where the graph bends
    plt.ylabel('inertia (minimizing criterion)');

# get good sense out of data by finding optimal point
check_kneepoint(neutral) # 4
check_kneepoint(friendly) # 4
check_kneepoint(enemy) # 5


"cluster" data download

In [106]:
# concatenate data
df_geo = pandas.DataFrame()
df_geo['lat'] = d_e['lat'] + d_f['lat'] + d_n['lat']
df_geo['lon'] = d_e['lon'] + d_f['lon'] + d_n['lon']
df_geo['label'] = ['enemy']*len(enemy['lat']) + ['friendly']*len(friendly['lat']) + ['neutral']*len(neutral['lat'])
con = np.concatenate((l_e, l_f), axis=0)
df_geo['cluster'] = np.concatenate((con, l_n), axis=0)

centr = {
    'friendly': cen_f,
    'enemy': cen_e,
    'neutral': cen_n
}

# write files
with open('../data/nb/centr.json', 'w') as data_file:    
    json.dump(centr, data_file)
df_geo.to_csv('../data/nb/geodata_kabul.csv', sep=',', index=False)

pprint(centr)
df_geo.tail()


{'enemy': [(34.77685700414634, 69.2127621352439),
           (34.53447809092872, 69.14913472462203),
           (34.62446718365079, 69.76936146579365),
           (34.53366977011321, 69.32291406626415),
           (34.36171674742647, 69.18763418764706)],
 'friendly': [(34.54866221994118, 69.27396343588235),
              (34.57109507930693, 69.70428376138614),
              (34.78434195146341, 69.17174325341463),
              (34.50980557267081, 69.14575593354037)],
 'neutral': [(34.54672341751938, 69.29199337100775),
             (34.762533862955976, 69.13903868937106),
             (34.59424866611111, 69.67290836),
             (34.53745502713396, 69.11695638361371)]}
Out[106]:
lat lon label cluster
3138 34.512527 69.174309 neutral 3
3139 34.729282 69.861694 neutral 2
3140 34.544861 69.249039 neutral 0
3141 34.617100 69.078529 neutral 3
3142 34.516678 69.171379 neutral 3

Discussion

Generally there was no trouble to find enough sources to get some more background information about the data. This is probably mainly because of the big press releases that came along with the classified content. We hope that the insights into the war that we gained can also be transported to the reader which would not be possible by just having the raw data.

The uniqueness of dataset does not give too many options to incorporate other data. We had high ambitions to detect something meaningful inspired by the article from the lecture about terrorist classification and an article about creating a decision tree classifying whether to shoot or not to shoot on approaching vehicles to reduce innocent civilians getting killed. In the article the German scientist describe they could extract how many people are in the car, does it slow down and other threat cues that are used. In those affect situations the soldiers make the most mistakes to protect themselves.

However, these approaches would need much more work on extracting proper features from the data to predict something meaningful which might be out of scope for the course. The scientist could simply devote more time on this issue. Thus we went with a rather descriptive approach of the machine learning models. We also wanted to avoid summarizing too many facts from the datasets and thus tried to experiment more with some JavaScript libraries.

What else could be improved is the LDA visualization where we were not able to trace the original topic back which might be useful if it cannot be inferred from the words. Furthermore, on the two map visualizations that used leaflet.js there is a minor bug: Initially there is no detection of the d3 layer (to show it as enabled in the leaflet control menu) because the DOM is dynamically created after the initialization, although a few things have been tried we did not get rid of that behavior. Instead the reader needs to click twice unfortunately.