Afghanistan War Insights

Project Assignment B

Motivation

In 2010 WikiLeaks released a file that gave access to previously classified military documents pertaining to the Afghanistan war. Especially news organizations where interested but faced a big challenge: How, if possible, could they visually depict some of the information for their readers? 1

The analyzed dataset contains every recorded incident between 2004 and 2009 during the war in Afghanistan. This so called war diary is contained in a 74 Mbyte CSV file. It consists of 76911 rows and 34 columns. The most interesting columns contain dates, locations, reports about incidents, category information, parties involved and counts of the killed and wounded. Additionally some open data from the elections in Afghanistan 2009 and 2010 helped us to get some numbers of violent incidents for specific voting districts to see where the most dangerous areas for civilians during the elections are.

The true faces of war are often hidden from ordinary people in industrial countries. People in the war troubled middle eastern countries are suffering everyday but thanks to brave and investigative journalists the truths are revealed. Our motivation is to contribute by taking a closer look into the death toll and summaries thus gaining more insights into the war. With the help of machine learning methods we are going to explore how ISAF forces and the civilian population in Afghanistan can benefit from our analysis. Different visualization methods have been chosen in order to set the mood of this tragic war.

Loading the Data

Load WikiLeaks Afghan War Diary from 2004-2009.



In [1]:

    
# data structures
import pickle
import cPickle as cp
import json
from pprint import pprint
# Pandas contains useful functions for data structures with "relational" or "labeled" data
import pandas

# math and arrays
import math
import numpy as np
from __future__ import division

# plotting
import matplotlib.pyplot as plt
import geoplotlib
from geoplotlib.utils import BoundingBox

# word processing
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import operator
import itertools
# remove words that appear only once
from collections import defaultdict, Counter

# Visualizations
# Wordclouds
from wordcloud import WordCloud
# LDA
from gensim import corpora, models, similarities

# glossary building
import lxml.html
import bs4, lxml, re, requests

# ML
from sklearn import cluster
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

# header as suggested
# by WikiLeaks: https://wikileaks.org/afg/
# by the Guardian: http://www.theguardian.com/world/datablog/2010/jul/25/wikileaks-afghanistan-data
header = [
    'ReportKey', # find messages and also to reference them
    'DateOccurred', 'EventType', 
    'Category', # describes what kind of event the message is about
    'TrackingNumber', 'Title', # internal tracking number and title
    'Summary', # actual description of the event
    'Region', # broader region of the event.
    'AttackOn', # who was attacked during an event
    'ComplexAttack', #  signifies that an attack was a larger operation that required more planning, coordination and preparatio
    'ReportingUnit', 'UnitName', 'TypeOfUnit', # information on the military unit that authored the report
    'FriendlyWounded', 'FriendlyKilled', 'HostNationWounded', 'HostNationKilled', 'CivilianWounded', 'CivilianKilled', 
    'EnemyWounded', 'EnemyKilled', 'EnemyDetained', # who was killed/wounded/captured
    'MilitaryGridReferenceSystem', 'Latitude', 'Longitude', # location
    'OriginatorGroup', 'UpdatedByGroup', # message originated from or was updated by
    'CommandersCriticalInformationRequirements', 
    'Significant', # are analyzed and evaluated by special group in command centre
    'Affiliation', # event was of friendly, neutral or enemy nature
    'DisplayColor', # enemy activity - RED, friendly activity - BLUE, friend on friend - GREEN
    'ClassificationLevel' # classification level of the message, e.g.: Secret
]

data = pandas.read_csv('../data/csv/afg.csv', header=None, names=header)
# lower case some columns see problems: https://wardiaries.wikileaks.org/search/?sort=date
data['Category'] = data['Category'].str.lower()
data['Title'] = data['Title'].str.lower()
data.head()









    Out[1]:






  
    
      
      ReportKey
      DateOccurred
      EventType
      Category
      TrackingNumber
      Title
      Summary
      Region
      AttackOn
      ComplexAttack
      ...
      MilitaryGridReferenceSystem
      Latitude
      Longitude
      OriginatorGroup
      UpdatedByGroup
      CommandersCriticalInformationRequirements
      Significant
      Affiliation
      DisplayColor
      ClassificationLevel
    
  
  
    
      0
      D92871CA-D217-4124-B8FB-89B9A2CFFCB4
      2004-01-01 00:00:00
      Enemy Action
      direct fire
      2007-033-004042-0756
      direct fire  other
      KAF-1BDE -S3 REPORTS: SUMMIT 09 B CO ELEMENT S...
      RC EAST
      ENEMY
      False
      ...
      42SWB3900916257
      32.683319
      69.416107
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
    
    
      1
      C592135C-1BFF-4AEC-B469-0A495FDA78D9
      2004-01-01 00:00:00
      Friendly Action
      cache found/cleared
      2007-033-004738-0185
      cache found/cleared  other
      USSF FINDS CACHE IN VILLAGE OF WALU TANGAY: US...
      RC EAST
      FRIEND
      False
      ...
      42SXD7520076792
      35.018608
      70.920273
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      FRIEND
      BLUE
      SECRET
    
    
      2
      D50F59F0-6F32-4E63-BC02-DB2B8422DE6E
      2004-01-01 00:00:00
      Non-Combat Event
      propaganda
      2007-033-010818-0798
      propaganda  other
      (M) NIGHT LETTERS DISTRIBUTED AROUND HAZARJUFT...
      RC SOUTH
      NEUTRAL
      False
      ...
      41RPQ1439743120
      31.116390
      64.199707
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
    
    
      3
      E3F22EFB-F0CA-4821-9322-CC2250C05C8A
      2004-01-01 00:00:00
      Enemy Action
      direct fire
      2007-033-004042-0850
      direct fire  other
      KAF-1BDE -S3: SUMMIT 6 REPORTS TIC SALUTE TO F...
      RC EAST
      ENEMY
      False
      ...
      42SWB3399911991
      32.645000
      69.362511
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
    
    
      4
      4D0E1E60-9535-4D58-A374-74367F058788
      2004-01-01 00:00:00
      Friendly Action
      cache found/cleared
      2007-033-004738-0279
      cache found/cleared  other
      KAF-1BDE -S3 REPORTS: GERONIMO 11 SALUTE AS FO...
      RC EAST
      FRIEND
      False
      ...
      42SWB7580277789
      33.236389
      69.813606
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      FRIEND
      BLUE
      SECRET
    
  

5 rows × 32 columns

Basic stats

The Pandas library for python that was used gives some data structures like DataFrame's that are very useful to manipulate and extract the "labeled" data. It brings lots of useful methods for data cleaning and preprocessing.

In the following cell's a small exploratory analysis of the data is provided. Number of rows, columns, the years in which the incidents took place, the number of categories and the most and least commonly occurring categories are extracted and displayed. Lowercasing all columns that we wanted to work with was necessary to unify the data.



In [2]:

    
data['DateOccurred'] = pandas.to_datetime(data['DateOccurred'])
data['Year'] = [date.year for date in data['DateOccurred']]
data['Hour'] = [date.hour for date in data['DateOccurred']]

# number of rows/columns
print "Number of rows: %d" % data.shape[0]
print "Number of columns: %d" % data.shape[1]

date_range = set()
for date in data['DateOccurred']:
    date_range.add(date.year)

print "\nYears:\n"
print list(date_range)

# ocurrences of categories
print "\nNumber of unique categories: %d" %len(set(data['Category']))

# distribution of categoriesn_occurrences[0:20]
n_occurrences = data['Category'].value_counts()

print "\nMost commonly occurring categories:\n"
print n_occurrences.head()

print "\nMost commonly occurring category is %s with %d" % (n_occurrences.argmax(), n_occurrences.max())
print "\nLeast commonly occurring category is %s with %d" % (n_occurrences.argmin(), n_occurrences.min())









    



Number of rows: 76911
Number of columns: 34

Years:

[2004, 2005, 2006, 2007, 2008, 2009]

Number of unique categories: 153

Most commonly occurring categories:

direct fire          16293
ied found/cleared     8581
indirect fire         7237
ied explosion         7202
other                 4693
Name: Category, dtype: int64

Most commonly occurring category is direct fire with 16293

Least commonly occurring category is graffiti with 1



In [3]:

    
# plot distribution of categories (TOP 50)
plt.style.use('ggplot')
%matplotlib inline

n_occurrences_top = n_occurrences[0:50]

# plot histogram
def barplot(series, title, figsize, ylabel, flag, rotation):
    ax = series.plot(kind='bar', 
                title = title,
                figsize = figsize,
                fontsize = 13)
    
    # set ylabel
    ax.set_ylabel(ylabel)
    # set xlabel (depending on the flag that comes as a function parameter)
    ax.get_xaxis().set_visible(flag)
    # set series index as xlabels and rotate them
    ax.set_xticklabels(series.index, rotation= rotation)
    
barplot(n_occurrences_top,'Category occurrences', figsize=(14,6), ylabel = 'category count',flag = True, rotation = 90)



In [4]:

    
focus_categories = n_occurrences.index[0:8]
print focus_categories









    



Index([u'direct fire', u'ied found/cleared', u'indirect fire',
       u'ied explosion', u'other', u'medevac', u'unexploded ordnance',
       u'cache found/cleared'],
      dtype='object')



In [5]:

    
def yearly_category_distribution(data, focus_categories):
    
    index = 1
    for category in focus_categories:
        # filter table by type of category
        db = data[data['Category'] == category]
        # get year counts of that category
        year_counts = db['Year'].value_counts()
        # sort it (from 2004 to 2009)
        year_counts = year_counts.sort_index()
        # plot it
        plt.subplot(7,2,index)
        barplot(year_counts, category, figsize=(20,35), ylabel = 'category count', flag = True, rotation = 0)
        index += 1
        
yearly_category_distribution(data, focus_categories)

Cleaning the Data

All the military language, abbreviations and acronyms in the summaries need to be replaced by their meaning in order to have extended and more understandable descriptions. Luckily the Guardian publicly released their list of acronyms online as a glossary that can be used. The file can be downloaded and information extracted using the HTML parsing library BeautifulSoup.

The additional column ExtendedSummary is appended to the data with the built glossary. Furthermore, we normalize attributes to avoid potential bias or remove noise when we do an in depth analysis on the summaries for several visualizations. Approaches used are:

Tokenizing: converting a document to its atomic elements ("don't" won't be read as two tokens, "don" and "t").
Stopping: removing meaningless words (often used words like "a", "to" and so on are uninteresting).
Stemming: merging words that are equivalent in meaning (e.g.: walking, walker, walked).

The approches use the natural language toolkit nltk.



In [6]:

    
# read url with the glossary of the acronyms
link = 'http://www.theguardian.com/world/datablog/2010/jul/25/wikileaks-afghanistan-war-logs-glossary'
response = requests.get(link)
try:
    if not response.ok:
        print 'HTTP error {} trying to fetch Guradian glossary: {}'.format(response.status_code, link)
    else:
        glossary = dict()
        soup = bs4.BeautifulSoup(response.content, 'lxml')
        glossary_table = soup.find('table')
        for row in glossary_table.find_all('tr'):
                cells = row.find_all("td")
                if len(cells) == 2:
                    if cells[0].string:
                        key = str(cells[0].string.strip().lower())
                        content = str(cells[1].text.lower())
                        glossary[key] = content
except requests.exceptions.ConnectionError as e:
    'Connection error {} on {}'.format(e, link)



In [7]:

    
# print 10 first keys and values of the dictionary
sorted(glossary.items(), key=operator.itemgetter(1))[:10]









    Out[7]:





[('bsn', ' (camp) bastion '),
 ('cgbg 1 coy', ' 1 company, coldstream guards battle group '),
 ('gbu-31', ' 2,000lb "smart bomb\' '),
 ('42 cdo rm', ' 42 commando royal marines '),
 ('gbu-12', ' 500lb laser-guided "smart bomb" '),
 ('508 stb', ' 508th special troops battalion '),
 ('81', ' 81mm mortar round '),
 ('9-liner', ' 9 line medevac request '),
 ('ctc', ' ? counterterrorist center '),
 ('ack', ' acknowledge ')]



In [8]:

    
# replace each word in the acronyms description by their meaning. Therefore getting extended description.
def matchwords(words):

    text = list()

    for word in words:
        if word in glossary:
            text.append(word.replace(word, glossary[word]))
        else:
            text.append(word)
            
    return ' '.join(text)

# lowercase summaries
data['Summary'] = data['Summary'].str.lower()
# remove rows with NaNs in summary
data = data[data['Summary'].notnull()]



In [9]:

    
tokenizer = RegexpTokenizer(r'\w+')

# get indices of the filtered dataframe
indices = [index for index in data.index]

long_summary = []
for row in indices:
    # tokenize words in the summary
    words = tokenizer.tokenize(data['Summary'][row])
    # match acronyms to meanings and generate an extended summary
    long_summary.append(matchwords(words))
    
# create column with extended summary
data.insert(7, 'ExtendedSummary', long_summary)
data.head()









    Out[9]:






  
    
      
      ReportKey
      DateOccurred
      EventType
      Category
      TrackingNumber
      Title
      Summary
      ExtendedSummary
      Region
      AttackOn
      ...
      Longitude
      OriginatorGroup
      UpdatedByGroup
      CommandersCriticalInformationRequirements
      Significant
      Affiliation
      DisplayColor
      ClassificationLevel
      Year
      Hour
    
  
  
    
      0
      D92871CA-D217-4124-B8FB-89B9A2CFFCB4
      2004-01-01
      Enemy Action
      direct fire
      2007-033-004042-0756
      direct fire  other
      kaf-1bde -s3 reports: summit 09 b co element s...
      kandahar air field  1bde s3 reports summit 09...
      RC EAST
      ENEMY
      ...
      69.416107
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
      2004
      0
    
    
      1
      C592135C-1BFF-4AEC-B469-0A495FDA78D9
      2004-01-01
      Friendly Action
      cache found/cleared
      2007-033-004738-0185
      cache found/cleared  other
      ussf finds cache in village of walu tangay: us...
      ussf finds cache in village of walu tangay uss...
      RC EAST
      FRIEND
      ...
      70.920273
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      FRIEND
      BLUE
      SECRET
      2004
      0
    
    
      2
      D50F59F0-6F32-4E63-BC02-DB2B8422DE6E
      2004-01-01
      Non-Combat Event
      propaganda
      2007-033-010818-0798
      propaganda  other
      (m) night letters distributed around hazarjuft...
      m night letters distributed around hazarjuft t...
      RC SOUTH
      NEUTRAL
      ...
      64.199707
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
      2004
      0
    
    
      3
      E3F22EFB-F0CA-4821-9322-CC2250C05C8A
      2004-01-01
      Enemy Action
      direct fire
      2007-033-004042-0850
      direct fire  other
      kaf-1bde -s3: summit 6 reports tic salute to f...
      kandahar air field  1bde s3 summit 6 reports ...
      RC EAST
      ENEMY
      ...
      69.362511
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
      2004
      0
    
    
      4
      4D0E1E60-9535-4D58-A374-74367F058788
      2004-01-01
      Friendly Action
      cache found/cleared
      2007-033-004738-0279
      cache found/cleared  other
      kaf-1bde -s3 reports: geronimo 11 salute as fo...
      kandahar air field  1bde s3 reports geronimo ...
      RC EAST
      FRIEND
      ...
      69.813606
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      FRIEND
      BLUE
      SECRET
      2004
      0
    
  

5 rows × 35 columns

Theory and Visualizations

Recorded Casualties

Visualization 1 - bar chart

The number of people wounded and killed in each group are counted and plotted after in a grouped bar chart. The two criteria wounded and killed can be compared by toggling between two buttons. This visualization was chosen in order to give the reader an overview of the involved groups:

Afghan forces
ISAF/NATO forces
Taliban
Civilians and to show the evolution of wounded and dead over the course of the war. Having the bars grouped depending on party and year is the best way for an approximate comparsion at a glance. It was also good to introduce a delicate subject by showing the shocking facts of people actually dying.



In [10]:

    
# create new dataframe with the columns of interest
db = pandas.concat([data['Year'],data['HostNationWounded'], data['HostNationKilled'],
              data['EnemyWounded'], data['EnemyKilled'],
              data['CivilianWounded'], data['CivilianKilled'],
              data['FriendlyWounded'], data['FriendlyKilled']]
              , axis=1)
db.head()









    Out[10]:






  
    
      
      Year
      HostNationWounded
      HostNationKilled
      EnemyWounded
      EnemyKilled
      CivilianWounded
      CivilianKilled
      FriendlyWounded
      FriendlyKilled
    
  
  
    
      0
      2004
      0
      0
      0
      3
      0
      0
      0
      0
    
    
      1
      2004
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      2004
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      2004
      0
      0
      0
      8
      0
      0
      0
      0
    
    
      4
      2004
      0
      0
      0
      0
      0
      0
      0
      0



In [11]:

    
casualties = {}

for date in date_range:
    # filter dataframe by year
    db_year = db[db['Year'] == date]
    casualties[str(date)]= {}
    for column in db.columns:
        if column != 'Year':
            # sum every column except 'year' and store it in a dictionary with 'years' as keys
            casualties[str(date)][column] = db_year[column].sum()



In [12]:

    
for key in sorted(casualties.iterkeys()):
    print "%s: %s" % (key, casualties[key])









    



2004: {'HostNationKilled': 218.0, 'FriendlyWounded': 133.0, 'HostNationWounded': 316.0, 'EnemyWounded': 93.0, 'FriendlyKilled': 22.0, 'EnemyKilled': 343.0, 'CivilianKilled': 219.0, 'CivilianWounded': 208.0}
2005: {'HostNationKilled': 180.0, 'FriendlyWounded': 480.0, 'HostNationWounded': 504.0, 'EnemyWounded': 200.0, 'FriendlyKilled': 71.0, 'EnemyKilled': 890.0, 'CivilianKilled': 178.0, 'CivilianWounded': 454.0}
2006: {'HostNationKilled': 605.0, 'FriendlyWounded': 1083.0, 'HostNationWounded': 1397.0, 'EnemyWounded': 249.0, 'FriendlyKilled': 142.0, 'EnemyKilled': 2687.0, 'CivilianKilled': 800.0, 'CivilianWounded': 1790.0}
2007: {'HostNationKilled': 951.0, 'FriendlyWounded': 1652.0, 'HostNationWounded': 2093.0, 'EnemyWounded': 377.0, 'FriendlyKilled': 186.0, 'EnemyKilled': 4044.0, 'CivilianKilled': 758.0, 'CivilianWounded': 1962.0}
2008: {'HostNationKilled': 699.0, 'FriendlyWounded': 1255.0, 'HostNationWounded': 1749.0, 'EnemyWounded': 285.0, 'FriendlyKilled': 244.0, 'EnemyKilled': 2816.0, 'CivilianKilled': 798.0, 'CivilianWounded': 1980.0}
2009: {'HostNationKilled': 1143.0, 'FriendlyWounded': 2693.0, 'HostNationWounded': 2444.0, 'EnemyWounded': 617.0, 'FriendlyKilled': 481.0, 'EnemyKilled': 4437.0, 'CivilianKilled': 1241.0, 'CivilianWounded': 2646.0}



In [13]:

    
# extract wounded data and rename columns
Wounded = {}
for date in date_range:
    
    Wounded[str(date)]= {}
    
    Wounded[str(date)]['Afghan forces'] = casualties[str(date)]['HostNationWounded']
    Wounded[str(date)]['Taliban'] = casualties[str(date)]['EnemyWounded']
    Wounded[str(date)]['Civilians'] = casualties[str(date)]['CivilianWounded']
    Wounded[str(date)]['ISAF/NATO forces'] = casualties[str(date)]['FriendlyWounded']



In [14]:

    
for key in sorted(Wounded.iterkeys()):
    print "%s: %s" % (key, Wounded[key])









    



2004: {'ISAF/NATO forces': 133.0, 'Afghan forces': 316.0, 'Taliban': 93.0, 'Civilians': 208.0}
2005: {'ISAF/NATO forces': 480.0, 'Afghan forces': 504.0, 'Taliban': 200.0, 'Civilians': 454.0}
2006: {'ISAF/NATO forces': 1083.0, 'Afghan forces': 1397.0, 'Taliban': 249.0, 'Civilians': 1790.0}
2007: {'ISAF/NATO forces': 1652.0, 'Afghan forces': 2093.0, 'Taliban': 377.0, 'Civilians': 1962.0}
2008: {'ISAF/NATO forces': 1255.0, 'Afghan forces': 1749.0, 'Taliban': 285.0, 'Civilians': 1980.0}
2009: {'ISAF/NATO forces': 2693.0, 'Afghan forces': 2444.0, 'Taliban': 617.0, 'Civilians': 2646.0}



In [15]:

    
# writing wounded data to a dataframe for visualizing it with D3
data_wounded = pandas.DataFrame()

data_wounded['Year'] = [key for key in sorted(Wounded.iterkeys())]
data_wounded['Afghan forces'] = [Wounded[key]['Afghan forces'] for key in sorted(Wounded.iterkeys())]
data_wounded['ISAF/NATO forces'] = [Wounded[key]['ISAF/NATO forces'] for key in sorted(Wounded.iterkeys())]
data_wounded['Taliban'] = [Wounded[key]['Taliban'] for key in sorted(Wounded.iterkeys())]
data_wounded['Civilians'] = [Wounded[key]['Civilians'] for key in sorted(Wounded.iterkeys())]

data_wounded









    Out[15]:






  
    
      
      Year
      Afghan forces
      ISAF/NATO forces
      Taliban
      Civilians
    
  
  
    
      0
      2004
      316
      133
      93
      208
    
    
      1
      2005
      504
      480
      200
      454
    
    
      2
      2006
      1397
      1083
      249
      1790
    
    
      3
      2007
      2093
      1652
      377
      1962
    
    
      4
      2008
      1749
      1255
      285
      1980
    
    
      5
      2009
      2444
      2693
      617
      2646

"Wounded" data download

download this data



In [16]:

    
# save wounded data to csv
data_wounded.to_csv('../data/nb/wounded.csv', sep=',', index=False)



In [17]:

    
# multi bar plot of killed people
plt.style.use('ggplot')
%matplotlib inline

N = 6
ind = np.arange(N)  # the x locations for the groups
width = 0.2     # the width of the bars

Afghan_forces = list(data_wounded['Afghan forces'])
Nato_forces = list(data_wounded['ISAF/NATO forces'])
Taliban = list(data_wounded['Taliban'])
Civilians = list(data_wounded['Civilians'])

fig, ax = plt.subplots(figsize=(20, 10))
rects1 = ax.bar(ind, Afghan_forces, width, color='r')
rects2 = ax.bar(ind + width, Nato_forces, width, color='b')
rects3 = ax.bar(ind + width*2, Taliban, width, color='g')
rects4 = ax.bar(ind + width*3, Civilians, width, color='y')

# add some text for labels, title and axes ticks
ax.set_ylabel('Counts')
ax.set_title('Wounded counts recorded by the Wikileaks Afghanistan Database')
ax.set_xticks(ind + width)
ax.set_xticklabels(('2004', '2005', '2006', '2007', '2008', '2009'))

ax.legend((rects1[0], rects2[0], rects3[0], rects4[0]), ('Afghan forces', 'ISAF/NATO forces', 'Taliban', 'Civilians'))

def autolabel(rects):
    # attach some text labels
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)

plt.show()



In [18]:

    
# Extract "killed" data and rename columns
Killed = {}
for date in date_range:
    
    Killed[str(date)]= {}
    
    Killed[str(date)]['Afghan forces'] = casualties[str(date)]['HostNationKilled']
    Killed[str(date)]['Taliban'] = casualties[str(date)]['EnemyKilled']
    Killed[str(date)]['Civilians'] = casualties[str(date)]['CivilianKilled']
    Killed[str(date)]['ISAF/NATO forces'] = casualties[str(date)]['FriendlyKilled']



In [19]:

    
for key in sorted(Killed.iterkeys()):
    print "%s: %s" % (key, Killed[key])









    



2004: {'ISAF/NATO forces': 22.0, 'Afghan forces': 218.0, 'Taliban': 343.0, 'Civilians': 219.0}
2005: {'ISAF/NATO forces': 71.0, 'Afghan forces': 180.0, 'Taliban': 890.0, 'Civilians': 178.0}
2006: {'ISAF/NATO forces': 142.0, 'Afghan forces': 605.0, 'Taliban': 2687.0, 'Civilians': 800.0}
2007: {'ISAF/NATO forces': 186.0, 'Afghan forces': 951.0, 'Taliban': 4044.0, 'Civilians': 758.0}
2008: {'ISAF/NATO forces': 244.0, 'Afghan forces': 699.0, 'Taliban': 2816.0, 'Civilians': 798.0}
2009: {'ISAF/NATO forces': 481.0, 'Afghan forces': 1143.0, 'Taliban': 4437.0, 'Civilians': 1241.0}



In [20]:

    
# writing "killed" data to a dataframe for visualizing it with D3
data_killed = pandas.DataFrame()

data_killed['Year'] = [key for key in sorted(Killed.iterkeys())]
data_killed['Afghan forces'] = [Killed[key]['Afghan forces'] for key in sorted(Killed.iterkeys())]
data_killed['ISAF/NATO forces'] = [Killed[key]['ISAF/NATO forces'] for key in sorted(Killed.iterkeys())]
data_killed['Taliban'] = [Killed[key]['Taliban'] for key in sorted(Killed.iterkeys())]
data_killed['Civilians'] = [Killed[key]['Civilians'] for key in sorted(Killed.iterkeys())]

data_killed









    Out[20]:






  
    
      
      Year
      Afghan forces
      ISAF/NATO forces
      Taliban
      Civilians
    
  
  
    
      0
      2004
      218
      22
      343
      219
    
    
      1
      2005
      180
      71
      890
      178
    
    
      2
      2006
      605
      142
      2687
      800
    
    
      3
      2007
      951
      186
      4044
      758
    
    
      4
      2008
      699
      244
      2816
      798
    
    
      5
      2009
      1143
      481
      4437
      1241

"Killed" data download

download this data



In [21]:

    
# save "killed" data to csv
data_killed.to_csv('../data/nb/killed.csv', sep=',', index=False)



In [22]:

    
Afghan_forces = list(data_killed['Afghan forces'])
Nato_forces = list(data_killed['ISAF/NATO forces'])
Taliban = list(data_killed['Taliban'])
Civilians = list(data_killed['Civilians'])

fig, ax = plt.subplots(figsize=(20, 10))
rects1 = ax.bar(ind, Afghan_forces, width, color='r')
rects2 = ax.bar(ind + width, Nato_forces, width, color='b')
rects3 = ax.bar(ind + width*2, Taliban, width, color='g')
rects4 = ax.bar(ind + width*3, Civilians, width, color='y')

# add some text for labels, title and axes ticks
ax.set_ylabel('Counts')
ax.set_title('Killed counts recorded by the Wikileaks Afghanistan Database')
ax.set_xticks(ind + width)
ax.set_xticklabels(('2004', '2005', '2006', '2007', '2008', '2009'))

ax.legend((rects1[0], rects2[0], rects3[0], rects4[0]), ('Afghan forces', 'ISAF/NATO forces', 'Taliban', 'Civilians'))

def autolabel(rects):
    # attach some text labels
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)

plt.show()

Killed Civilians by Location

Visualization 1 - Map

(since bar charts did not count as d3 visualisation, in same group)

This map builds on top of the bar chart making clear that it was a war that impacted the local people a lot. Even with "just" all incidents that killed civilians the reader is still quite overwhelmed at first. It was necessary to make the map zoomable and interactive so that the reader can get a more distinct picture. Having the map in the introduction section was also a good way of showing how the regions are distributed between the ISAF forces or what regions even exist in Afghanistan. The reader can quickly identify that the size of the circles increases with a higher number of killed people and can also read more about why so many people died, getting sort of a feel for the war and the stories behind it.



In [23]:

    
# create new dataframe with the columns of interest
db_loc = pandas.concat([data['Latitude'], data['Longitude'], data['CivilianKilled'], data['ExtendedSummary']], axis=1)
# delete no casualties
db_loc = db_loc.drop(db_loc[db_loc['CivilianKilled'] == 0].index)
# exclude not set column
db_loc = db_loc[db_loc.CivilianKilled.notnull() & db_loc.Longitude.notnull() & db_loc.Latitude.notnull()]

db_loc.head()









    Out[23]:






  
    
      
      Latitude
      Longitude
      CivilianKilled
      ExtendedSummary
    
  
  
    
      16
      31.000000
      66.400002
      50
      explosion kills 50 near qaba mosque in spin bo...
    
    
      111
      34.158329
      61.851940
      1
      s rel  global counter terrorism forces  ngo ro...
    
    
      192
      33.552219
      70.045280
      2
      15 feb two persons were killed and six injured...
    
    
      221
      34.983330
      69.667511
      5
      s rel  global counter terrorism forces  five a...
    
    
      256
      32.450001
      67.416656
      1
      s rel  global counter terrorism forces  1 x ng...

"Civilians Killed" data download

download this data



In [24]:

    
# write df as JSON, bring into appropriate format
obj = {'objects': list()}
for index, row in db_loc.iterrows():
    obj['objects'].append({'circle': {
                'coordinates': [row.Latitude, row.Longitude], 
                'death': int(row.CivilianKilled),
                'desc': row.ExtendedSummary
            }})

with open('../data/nb/civ_deaths.json', 'w') as data_file:    
    json.dump(obj, data_file)
    
pprint(obj['objects'][:5])









    



[{'circle': {'coordinates': [31.0, 66.40000153],
             'death': 50,
             'desc': 'explosion kills 50 near qaba mosque in spin boldak'}},
 {'circle': {'coordinates': [34.15832901, 61.85194016],
             'death': 1,
             'desc': 's rel  global counter terrorism forces  ngo robbed and killed'}},
 {'circle': {'coordinates': [33.55221939, 70.04528046],
             'death': 2,
             'desc': '15 feb two persons were killed and six injured in a powerful bomb blast near a police station in ali sher district in khost province southeastern afghanistan the blast occurred hours after unidentified gunmen fired dozens of rockets at a base of us led coalition forces in the province'}},
 {'circle': {'coordinates': [34.98332977, 69.66751099],
             'death': 5,
             'desc': 's rel  global counter terrorism forces  five afghan nationals working for the ngo swedish development foundation were killed 25 feb during an ambush of thier vehicle 31 miles north of kabul in kapisa province'}},
 {'circle': {'coordinates': [32.45000076, 67.41665649],
             'death': 1,
             'desc': 's rel  global counter terrorism forces  1 x ngo contractor killed 1 x taken hostage'}}]

WordClouds

Visualization 2

In this section a simple weighting scheme called tf-idf (term frequency–inverse document frequency) is used to find important words within each category. Once found, they are visualized in a word cloud. Since there are 153 categories and computing a wordcloud for each category is not feasible, just some of the categories are picked to compute wordclouds. Also data cleaning is a crucial part to have an acceptable result. The methods are explained in the introduction and respective code cells.

It seemed feasible to go a bit deeper into specific topics and also, instead of summarizing data, gain something out of it with further analysis. Word or tag clouds are in general a good way of presenting a lot of text in a weighted list of important words. Someone who wants to inform himself about more specific subjects can use these words when looking for more insigths. Also while our group was doing the analysis we could further investigate certain subjects that we were unaware about before.



In [25]:

    
# indices of categories that wordclouds are computed of
#for i, x in enumerate(n_occurrences.index):
#    print i, x
indices = [35,50,58,63,66,68,74,93,95]

# get categories names
categ = [n_occurrences.index[idx] for idx in indices]

# filter the dataset. Only keep those rows with the categories selected
db_wc_filt = data[data['Category'].isin(categ)]



In [26]:

    
db_wc_filt.head()









    Out[26]:






  
    
      
      ReportKey
      DateOccurred
      EventType
      Category
      TrackingNumber
      Title
      Summary
      ExtendedSummary
      Region
      AttackOn
      ...
      Longitude
      OriginatorGroup
      UpdatedByGroup
      CommandersCriticalInformationRequirements
      Significant
      Affiliation
      DisplayColor
      ClassificationLevel
      Year
      Hour
    
  
  
    
      184
      4A9E37A8-367A-4B73-84C4-CB4C1F5F2C47
      2004-02-14 00:00:00
      Enemy Action
      assassination
      2007-033-004004-0670
      direct fire ana other 1 ana kia
      (s//rel gctf) tb attack home and assasinate an...
      s rel  global counter terrorism forces   talib...
      RC SOUTH
      ENEMY
      ...
      67.416656
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
      2004
      0
    
    
      449
      DE3A0314-FE7D-469D-B105-8592E91FEB86
      2004-05-03 00:00:00
      Non-Combat Event
      demonstration
      2007-033-011133-0745
      demonstration  other
      (s//rel gctf) an estimated 100 or more support...
      s rel  global counter terrorism forces  an est...
      RC EAST
      NEUTRAL
      ...
      68.440002
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
      2004
      0
    
    
      692
      846645EA-37EB-44DE-BD21-61A6C0D6BB26
      2004-06-11 12:43:00
      Non-Combat Event
      demonstration
      2007-033-011133-0635
      demonstration anp other 15anp wia
      (s//rel gctf) (delayed report) tf victory repo...
      s rel  global counter terrorism forces  delaye...
      RC NORTH
      NEUTRAL
      ...
      65.853333
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
      2004
      12
    
    
      1115
      7A367320-2666-4379-86F1-CB455637E603
      2004-08-15 10:00:00
      Non-Combat Event
      demonstration
      2007-033-011133-0854
      demonstration  other
      (fouo) at 1000hrs there was a demonstration, o...
      fouo at 1000hrs there was a demonstration of a...
      RC NORTH
      NEUTRAL
      ...
      68.639160
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
      2004
      10
    
    
      1654
      29DFB7ED-6A00-4740-8806-5E91950D39E3
      2004-11-04 05:00:00
      Non-Combat Event
      demonstration
      2007-033-011133-0932
      demonstration  other
      cjsotf reports pak civ protest 4k ne lwara. at...
      combined joint special operations task force ...
      RC EAST
      NEUTRAL
      ...
      69.494164
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
      2004
      5
    
  

5 rows × 35 columns

For computing the tf-idf weights for each document in the corpus, it is required in the corpus a series of steps:

Tokenize the corpus
Model the vector space
Compute the tf-idf weight for each document in the corpus



In [27]:

    
# uncomment this to download stopwords locally
#nltk.dowload()

We start by building the corpus for each document.



In [28]:

    
branch_corpus = {}
for cat in categ:
    d = db_wc_filt[db_wc_filt['Category'] == cat]
    #for each category, join all text from the extended summary and save it in a dictionary
    branch_corpus[cat] = ' '.join(d['ExtendedSummary'])



In [29]:

    
# load english stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))
# create a WordNetLemmatizer for stemming the tokens
wordnet_lemmatizer = WordNetLemmatizer() 

# compute number of times a word appears in a document
def freq(word, doc):
    return doc.count(word)

# get number of words in the document
def word_count(doc):
    return len(doc)

# compute TF
def tf(word, doc):
    return (freq(word, doc) / float(word_count(doc)))

# compute number of documents containing a particular word
def num_docs_containing(word, list_of_docs):
    count = 0
    for document in list_of_docs:
        if freq(word, document) > 0:
            count += 1
    return 1 + count


# compute IDF
def idf(word, list_of_docs):
    return math.log(len(list_of_docs) /
            float(num_docs_containing(word, list_of_docs)))

# compute TF-IDF
def tf_idf(word, doc, list_of_docs):
    return (tf(word, doc) * idf(word, list_of_docs))


vocabulary = list()
docs = dict()

for key,text in branch_corpus.iteritems():
        # tokenize text for a particular category
        tokens = nltk.word_tokenize(text)
        # lower case the words
        tokens = [token.lower() for token in tokens]
        # only keep words, disregard digits
        tokens = [token for token in tokens if token.isalpha()]
        # disregard stopwords and words that are less than 2 letter in length
        tokens = [token for token in tokens if token not in stopwords and len(token) > 2]
        final_tokens = []
        final_tokens.extend(tokens)
        
        docs[key] = {'tf': {}, 'idf': {},
                        'tf-idf': {}, 'tokens': []}
        # compute TF 
        for token in final_tokens:
            # The term-frequency (Normalized Frequency)
            docs[key]['tf'][token] = tf(token, final_tokens)
            docs[key]['tokens'] = final_tokens
        vocabulary.append(final_tokens)

# compute IDF and TF-IDF
for doc in docs:
    for token in docs[doc]['tf']:
        # the Inverse-Document-Frequency
        docs[doc]['idf'][token] = idf(token, vocabulary)
        # the tf-idf
        docs[doc]['tf-idf'][token] = tf_idf(token, docs[doc]['tokens'], vocabulary)



In [30]:

    
# get the 10 words with highest TF-IDF score for category 'Hard Landing'
sorted_dic = sorted(docs['arrest']['tf-idf'].items(), key=operator.itemgetter(1), reverse=True)
sorted_dic[0:10]









    Out[30]:





[('arrested', 0.0052880539302087),
 ('arrest', 0.004259688112565812),
 ('seized', 0.004001770541779557),
 ('detained', 0.0035221618975453026),
 ('suspicious', 0.0031809885021704776),
 ('detention', 0.0026508237518087312),
 ('sayeed', 0.0026508237518087312),
 ('subjects', 0.0026508237518087312),
 ('prosecution', 0.0026508237518087312),
 ('mines', 0.0023234662432177152)]

In order to create the wordclouds we need to round the tf-idf to the nearest integer value. Then all words are combined together in one long string separated by spaces repeating each word to its rounded tf-idf score.

Note: We not only round but we scale them by a factor of 100 since we get very low tf-idf values.



In [31]:

    
for cat in categ:
    for tup in docs[cat]['tf-idf'].items():
        #scale each tf-idf value by a factor of 1000 and round it
        docs[cat]['tf-idf'][tup[0]] = int(round(tup[1]*1000))



In [32]:

    
#see how they have been scaled and rounded
sorted_dic = sorted(docs['arrest']['tf-idf'].items(), key=operator.itemgetter(1), reverse=True)
sorted_dic[0:10]









    Out[32]:





[('arrested', 5),
 ('seized', 4),
 ('arrest', 4),
 ('detained', 4),
 ('detention', 3),
 ('sayeed', 3),
 ('suspicious', 3),
 ('subjects', 3),
 ('prosecution', 3),
 ('juma', 2)]



In [33]:

    
# generate text for wordclouds
doc_wordcloud = {}

for cat in categ:
    string = []
    for tup in docs[cat]['tf-idf'].items():
        if tup[1] > 0:
            #repeat each word to its scaled and rounded tf-idf score
            string.extend(np.repeat(tup[0],tup[1]))
            
    doc_wordcloud[cat] = ' '.join(string)



In [34]:

    
# generate wordclouds
for cat in categ:
    print "#### %s ####" % cat
    wordcloud = WordCloud().generate(doc_wordcloud[cat])
    img=plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()









    



#### demonstration ####






    












    



#### murder ####






    












    



#### green-green ####






    












    



#### natural disaster ####






    












    



#### arrest ####






    












    



#### assassination ####






    












    



#### carjacking ####






    












    



#### downed aircraft ####






    












    



#### refugees ####



In [35]:

    
# save wordclouds to files so that they are readable in JS for visualizing with D3
def savefile(category):
    filename = '../data/nb/' + category
    filename = filename.replace(" ", "_")
    with open(filename, "w") as text_file:
        text_file.write(doc_wordcloud[category])
    #print '* [download {0} data](../data/csv/v2/{0})'.format(category)

for cat in categ:
    savefile(cat)

"Wordclouds" data download

LDA

Visualization 3

In this section Latent Dirichlet Allocation (LDA) is used, a topic model which generates topics based on word frequency from a set of documents.

Each extended description in our data is considered a document. Data cleaning needs to be done on all documents since it is crucial to generate a useful topic model. The methods are similar to the processing on the wordclouds. After the cleaning of the data, the LDA model is generated. In order to do that each terms frequency as it occurs in each document is investigated. To do so a document-term matrix (bag of words) is constructed with the gensim module. Once the bag of words is created the LDA model can be generated.

The LDA is another method of analyzing the text and give some meaningful weights to each word of a subject. Since the weighting scheme is different the extracted result could be used for another visualisation approach giving the reader the impression about importance through circle diameters. It was a good way of showing what typical war topics were mainly about with just a few words.



In [36]:

    
# will be used for text cleaning
stopwords = set(nltk.corpus.stopwords.words('english')) # load english stopwords
stoplist = set('for a of the and to in'.split())
manual_list=set('none nan get http https'.split())
wordnet_lemmatizer = WordNetLemmatizer() # create a WordNetLemmatizer to user for stemming our categories



In [37]:

    
# take some interesting categories
documents = [summary for summary in db_wc_filt['ExtendedSummary']]

For cleaning the text:

Lemmatize words (get the root of the words)
Only keep words and disregard digits
Make them lowercase so that there is no difference between "Then" and "then" for instance
Remove stopwords which are english common words that do not have any meaning
Remove words that are less than 2 letters in length since they do not usually provide any meaning



In [38]:

    
clean_dict={}
i=0;
for document in documents:
    try:
        cleaned_html = lxml.html.fromstring(str(document)).text_content()
    except:
        cleaned_html=""
    cleaned_html.encode('ascii', 'ignore')
    temp_token_list=nltk.word_tokenize(str(cleaned_html.encode('ascii', 'ignore')))
    content = [wordnet_lemmatizer.lemmatize(w.lower()) for w in temp_token_list if 
               ( w.isalpha() and w.lower() not in stopwords and w.lower() not in stoplist \
                and w.lower() not in manual_list and len(w)>2)]
    
    # save the cleaned tokens
    clean_dict[i]=content
    i+=1



In [39]:

    
# words that appear once are not needed
frequency = defaultdict(int)
for text in clean_dict.values():
    for token in text:
        frequency[token] += 1

i=0
for text in clean_dict.values():
    clean_dict[i] = [token for token in text if frequency[token] > 1]
    i+=1



In [40]:

    
dictionary = corpora.Dictionary(clean_dict.values())
corpus = [dictionary.doc2bow(text) for text in clean_dict.values()]



In [41]:

    
# Compute LDA. Set number of topics and number of words per topic to 10.
lda = models.ldamodel.LdaModel(corpus, num_topics=7, id2word=dictionary, passes=20)



In [42]:

    
lda.print_topics(num_topics=7, num_words=20)









    Out[42]:





[(0,
  u'0.070*national + 0.064*afghan + 0.042*police + 0.027*force + 0.018*army + 0.014*local + 0.014*reported + 0.013*report + 0.013*security + 0.011*task + 0.010*killed + 0.007*shot + 0.007*action + 0.007*wounded + 0.006*assistance + 0.006*event + 0.006*officer + 0.006*base + 0.006*district + 0.006*soldier'),
 (1,
  u'0.019*road + 0.011*village + 0.010*flood + 0.009*local + 0.008*amp + 0.007*damage + 0.007*team + 0.007*force + 0.007*river + 0.007*bridge + 0.006*provincial + 0.006*elder + 0.006*reconstruction + 0.005*valley + 0.005*vehicle + 0.005*work + 0.005*destroyed + 0.005*area + 0.005*also + 0.005*one'),
 (2,
  u'0.018*driver + 0.013*truck + 0.012*district + 0.011*two + 0.009*base + 0.009*information + 0.007*afghan + 0.007*afghanistan + 0.007*taliban + 0.007*hijacked + 0.007*cargo + 0.007*khan + 0.007*incident + 0.007*three + 0.007*time + 0.006*one + 0.006*air + 0.006*operation + 0.006*operating + 0.006*province'),
 (3,
  u'0.026*force + 0.020*provincial + 0.018*reconstruction + 0.017*team + 0.015*amp + 0.012*reported + 0.011*task + 0.010*assistance + 0.008*afghanistan + 0.008*report + 0.008*security + 0.007*area + 0.006*international + 0.006*demonstration + 0.006*kabul + 0.006*air + 0.006*apos + 0.006*support + 0.005*route + 0.005*response'),
 (4,
  u'0.031*demonstration + 0.017*reported + 0.017*update + 0.013*report + 0.012*force + 0.012*event + 0.011*officer + 0.010*local + 0.009*afghan + 0.009*closed + 0.008*kcp + 0.008*national + 0.008*peaceful + 0.007*people + 0.006*security + 0.006*protest + 0.005*time + 0.005*null + 0.005*anp + 0.005*police'),
 (5,
  u'0.021*afghan + 0.018*police + 0.016*national + 0.013*protest + 0.013*force + 0.009*group + 0.009*area + 0.008*local + 0.008*demonstration + 0.007*land + 0.006*report + 0.006*reported + 0.006*task + 0.006*border + 0.006*point + 0.005*two + 0.005*elder + 0.005*one + 0.005*district + 0.005*coalition'),
 (6,
  u'0.026*amp + 0.020*force + 0.011*quot + 0.011*apos + 0.009*valley + 0.008*conducted + 0.008*site + 0.008*said + 0.007*resident + 0.007*operation + 0.006*element + 0.006*base + 0.005*team + 0.005*weapon + 0.005*police + 0.005*crash + 0.005*government + 0.005*abdul + 0.005*task + 0.005*mosque')]



In [43]:

    
topics = [lda.show_topic(i) for i in range(7)]



In [44]:

    
# write to json formatted file for D3
LDA_dicts = {'name': 'LDA',
            'children': [{
            'name': 'Topics',
            'children': []
        }]};

for index, topic in enumerate(topics):
    name = 'Topic {}'.format(index+1)
    dic_out={'name': name , 'children':[]}
    children = []
    for topic in topics[index]:
        dic = {'name':[], 'size':[]}
        dic['name'] = topic[0]
        dic['size'] = topic[1]
        children.append(dic)
    dic_out['children'] = children
    LDA_dicts['children'][0]['children'].append(dic_out)

"LDA circle packing" data download

download this data



In [45]:

    
# dump dict to JSON file
with open('../data/nb/LDA_topics.json', 'w') as fp:
    json.dump(LDA_dicts, fp)



In [46]:

    
LDA_dicts









    Out[46]:





{'children': [{'children': [{'children': [{'name': u'national',
       'size': 0.069868870639559294},
      {'name': u'afghan', 'size': 0.063605654662812441},
      {'name': u'police', 'size': 0.041889127640513675},
      {'name': u'force', 'size': 0.027268690439316807},
      {'name': u'army', 'size': 0.017952194480238451},
      {'name': u'local', 'size': 0.014141723900489964},
      {'name': u'reported', 'size': 0.013789218636012908},
      {'name': u'report', 'size': 0.013473817777303871},
      {'name': u'security', 'size': 0.012525788957855492},
      {'name': u'task', 'size': 0.011019235892127307}],
     'name': 'Topic 1'},
    {'children': [{'name': u'road', 'size': 0.01911937397857745},
      {'name': u'village', 'size': 0.011285768647962036},
      {'name': u'flood', 'size': 0.010340603144697334},
      {'name': u'local', 'size': 0.0087371411364523185},
      {'name': u'amp', 'size': 0.0076720521153163411},
      {'name': u'damage', 'size': 0.0072668211177573461},
      {'name': u'team', 'size': 0.0069400595175719906},
      {'name': u'force', 'size': 0.0069070461539689548},
      {'name': u'river', 'size': 0.0068917178572562993},
      {'name': u'bridge', 'size': 0.0066828835641202754}],
     'name': 'Topic 2'},
    {'children': [{'name': u'driver', 'size': 0.018028862455535682},
      {'name': u'truck', 'size': 0.013344417713139527},
      {'name': u'district', 'size': 0.012191255905178288},
      {'name': u'two', 'size': 0.01098479914340657},
      {'name': u'base', 'size': 0.0093694238818415911},
      {'name': u'information', 'size': 0.0086251028073260783},
      {'name': u'afghan', 'size': 0.007456443965182225},
      {'name': u'afghanistan', 'size': 0.007356108090654772},
      {'name': u'taliban', 'size': 0.0073113164244047431},
      {'name': u'hijacked', 'size': 0.0072857216307342992}],
     'name': 'Topic 3'},
    {'children': [{'name': u'force', 'size': 0.025827890342217719},
      {'name': u'provincial', 'size': 0.019963778585861253},
      {'name': u'reconstruction', 'size': 0.017668355470018268},
      {'name': u'team', 'size': 0.01745542437059127},
      {'name': u'amp', 'size': 0.014561566303860126},
      {'name': u'reported', 'size': 0.012305291359687715},
      {'name': u'task', 'size': 0.010623446252894462},
      {'name': u'assistance', 'size': 0.010470182838195006},
      {'name': u'afghanistan', 'size': 0.0083447359449635803},
      {'name': u'report', 'size': 0.0082570203788887506}],
     'name': 'Topic 4'},
    {'children': [{'name': u'demonstration', 'size': 0.030826337833415141},
      {'name': u'reported', 'size': 0.017211479695076255},
      {'name': u'update', 'size': 0.017206349205865897},
      {'name': u'report', 'size': 0.013011402610355294},
      {'name': u'force', 'size': 0.012360031814064377},
      {'name': u'event', 'size': 0.011752149042644398},
      {'name': u'officer', 'size': 0.011293210230836494},
      {'name': u'local', 'size': 0.0098379900126823894},
      {'name': u'afghan', 'size': 0.0093851963989767789},
      {'name': u'closed', 'size': 0.0093230420307748255}],
     'name': 'Topic 5'},
    {'children': [{'name': u'afghan', 'size': 0.021177817945350157},
      {'name': u'police', 'size': 0.017703863957756721},
      {'name': u'national', 'size': 0.016220688916579712},
      {'name': u'protest', 'size': 0.012860815867027213},
      {'name': u'force', 'size': 0.012542270611238845},
      {'name': u'group', 'size': 0.009375250976456511},
      {'name': u'area', 'size': 0.0085342180510033418},
      {'name': u'local', 'size': 0.0083285940465644154},
      {'name': u'demonstration', 'size': 0.0082427413272877833},
      {'name': u'land', 'size': 0.0071096160800931405}],
     'name': 'Topic 6'},
    {'children': [{'name': u'amp', 'size': 0.026443711738154279},
      {'name': u'force', 'size': 0.020242095335857697},
      {'name': u'quot', 'size': 0.011098598318549492},
      {'name': u'apos', 'size': 0.010607676577690895},
      {'name': u'valley', 'size': 0.0091425060413576759},
      {'name': u'conducted', 'size': 0.0081088822691027154},
      {'name': u'site', 'size': 0.0078841978241618073},
      {'name': u'said', 'size': 0.0077528701470336433},
      {'name': u'resident', 'size': 0.0071489449803598913},
      {'name': u'operation', 'size': 0.0066612433166570537}],
     'name': 'Topic 7'}],
   'name': 'Topics'}],
 'name': 'LDA'}

KNN - Election and War incidents

Visualization 4

For this visualization the areas where there are most security incidents during the election 2009 are plotted. As an overlay grid points of 4 very common war incidents. To make a proper and meaningful prediction we decided to use an unbalanced dataset by having in mind that the accuracy will drop for the smaller classes. On the other hand having a unbalanced set we are closer to the ground truth with the map showing which of the category types are most likely experienced. The ISAF forces can conclude preventing measures in specific regions that are more focused on the density of the incidents. Still the distributions of the top incidents in 2009 are not spread too far:

ied explosions: 42%
direct fire: 21%
escalation of force: 8%
indirect fire: 6%

Since the user was more familiar with the overall data the map contained more information, showing what is the most probable attack that can happen in your region during the election period and where are the hotspots of incidents. It fit into the storyline that even fundamental things like elections are a great problem and far from normal everyday situations in a war troubled country.



In [47]:

    
data_el = pandas.read_csv('../data/csv/elections_security_incidents.csv')
# clean rows without useful numbers
data_el = data_el.dropna()
data_el.head()









    Out[47]:






  
    
      
      Longitude
      Latitude
      Province Name
      District Name
      August 2010
      May - August 2010
      2010
      2009 and 2010
      Election Day 2009
      Election Day 2010
    
  
  
    
      0
      70.82
      37.27
      Badakhshan
      Arghanj Khwa
      0
      1
      2
      5
      0
      0
    
    
      1
      70.43
      37.06
      Badakhshan
      Argo
      2
      5
      8
      19
      0
      0
    
    
      2
      70.90
      37.03
      Badakhshan
      Baharak
      0
      6
      7
      8
      0
      0
    
    
      3
      70.47
      36.87
      Badakhshan
      Darayem
      1
      7
      7
      8
      0
      2
    
    
      5
      71.15
      38.13
      Badakhshan
      Darwaz-e-Balla
      0
      0
      0
      2
      0
      0



In [48]:

    
print "election provinces total: ", len(data_el)
print "total incidents: ", int(data_el['2009 and 2010'].sum())









    



election provinces total:  389
total incidents:  21849



In [49]:

    
# incidents year 2009
data09 = data[data['Year'] == 2009]
# incidents injured or killed civilians
data09 = data09.drop(data09[(data09['CivilianWounded'] == 0) & (data09['CivilianKilled'] == 0)].index)
# exclude unset columns
data09 = data09[data09.CivilianKilled.notnull() & data09.CivilianWounded.notnull()]

# Distribution of categories top 4
c_occurrences = data09['Category'].value_counts()
c_occurrences[0:4]









    Out[49]:





ied explosion          433
direct fire            214
escalation of force     80
indirect fire           63
Name: Category, dtype: int64



In [50]:

    
# drop other categories
for idx, row in data09.iterrows():
    if row['Category'] not in c_occurrences.index[0:4]:
        data09.drop(idx)



In [51]:

    
# split data to individual sets
def get_subset(dataset, category):
    df = dataset[dataset['Category'] == category]

    return df

# subset for categories
df_IED = get_subset(data09, 'ied explosion')
df_DIRECT = get_subset(data09, 'direct fire')
df_FORCE = get_subset(data09, 'escalation of force')
df_INDIRECT = get_subset(data09, 'indirect fire')



In [52]:

    
# see balance of data
ied = df_IED.shape[0]
direct = df_DIRECT.shape[0]
force = df_FORCE.shape[0]
indirect = df_INDIRECT.shape[0]

total_amount = len(data09)
print "total top 4 incidents: ", total_amount

print "Percentage of ied explosions: {}%".format(ied/(total_amount/100))
print "Percentage of direct fire: {}%".format(direct/(total_amount/100))
print "Percentage of escalation of force: {}%".format(force/(total_amount/100))
print "Percentage of indirect fire: {}%".format(indirect/(total_amount/100))









    



total top 4 incidents:  1041
Percentage of ied explosions: 41.5946205572%
Percentage of direct fire: 20.5571565802%
Percentage of escalation of force: 7.68491834774%
Percentage of indirect fire: 6.05187319885%

Some geodata is plotted to see the distribution. First a function to extract the geoinformation for each row from the datasets as shown in below cell is necessary.



In [53]:

    
def get_all_geodata(dataset):

    # filter bad rows
    dataset = dataset[dataset.Longitude.notnull() & dataset.Latitude.notnull()]
    
    # only activity in Afghanistan
    include = (dataset.Latitude < 39) & (dataset.Latitude > 28) & (dataset.Longitude > 59) & (dataset.Longitude < 76)
    # get data in the format geoplotlib requires. We put the geodata in a dictionary structured as follows
    geo_data = {
        "lat": dataset.loc[include].Latitude.tolist(), 
        "lon": dataset.loc[include].Longitude.tolist()
    }
    return geo_data

# create the dictionary with lat and lon
geo_data_IED = get_all_geodata(df_IED)
geo_data_DIRECT = get_all_geodata(df_DIRECT)
geo_data_FORCE = get_all_geodata(df_FORCE)
geo_data_INDIRECT = get_all_geodata(df_INDIRECT)
geo_data_ELECTION = get_all_geodata(data_el)

All latitude and longitude information is passed to the function which computes the kernel density estimation (kde()) from geoplotlib which is also called a heatmap. The BoundingBox is needed to fit the projection as close as possible to Afghanistan. The approximate boundaries are set by cleaning some outliers from all datapoints (geo_dim) Maps are plotted with inline() to make the map visible in the notebook.



In [54]:

    
# plot given coordinate input
def geo_plot(geodata):

    # bounding box on the minima and maxima of the data
    geoplotlib.set_bbox(
        BoundingBox(
            max(geodata['lat']), 
            max(geodata['lon']), 
            min(geodata['lat']), 
            min(geodata['lon'])
        ));
    
    # kernel density estimation visualization
    geoplotlib.kde(geodata, bw=5, cut_below=1e-3, cmap='hot', alpha=170)
    # google tiles with lyrs=y ... hybrid
    geoplotlib.tiles_provider({
        'url': lambda zoom, xtile, ytile: 'https://mt1.google.com/vt/lyrs=y&hl=en&x=%d&y=%d&z=%d' % (xtile, ytile, zoom ),
        'tiles_dir': 'DTU-social_data',
        'attribution': 'DTU 02806 Social Data Analysis and Visualization'
    })
    
    geoplotlib.inline();



In [55]:

    
print 'ied explosion'
geo_plot(geo_data_IED)
print 'direct fire'
geo_plot(geo_data_FORCE)
print 'escalation of force'
geo_plot(geo_data_INDIRECT)
print 'indirect fire'
geo_plot(geo_data_DIRECT)
print 'election'
geo_plot(geo_data_ELECTION)









    



ied explosion
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.13252371406381036)






    











    



direct fire
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.034221570399321423)






    











    



escalation of force
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.029033994220663158)






    











    



indirect fire
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.087317618275628936)






    











    



election
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.013160187040692951)

A $100 \times 100$ grid over the span of geo coordinates of the voting areas is laid. The grid needs its own dimensions we first set the boundaries for maximum and minimum latitudes and after use the numpy arange() function to create evenly spaced values within this given interval (the points within the grid). geoplotlib.dot() expects a dictonary containing the latitude and longitude for the grid. The grid and geographical data can later be reused for other plots.



In [56]:

    
sampledData = pandas.concat([df_IED,df_FORCE,df_INDIRECT,df_DIRECT])

def plot_knn(K):
    # create lists with range from min to max of sampled data. Makes 100 steps by dividing difference by 100
    gridXVals = pandas.Series(
        np.arange(
            min(geo_data_ELECTION['lat']),
            max(geo_data_ELECTION['lat']), 
            abs((min(geo_data_ELECTION['lat']) - max(geo_data_ELECTION['lat'])) / 100.0))
    )
    gridYVals = pandas.Series(
        np.arange(
            min(geo_data_ELECTION['lon']), 
            max(geo_data_ELECTION['lon']), 
            abs((max(geo_data_ELECTION['lon']) - min(geo_data_ELECTION['lon'])) / 100.0))
    )
    gridYVals.sort_values(ascending=False, inplace=True)

    # convert product of the two lists to grid values
    product = list(itertools.product(gridXVals, gridYVals))
    # convert product tuples into dataframe
    grid = pandas.DataFrame({'lat': map(lambda coord: coord[0], product), 'lon': map(lambda coord: coord[1], product)})

    # X maxtrix from coordinates
    X = sampledData[['Latitude', 'Longitude']]
    # Y is the category of these - The label to train from
    Y = sampledData["Category"]

    neigh = KNeighborsClassifier()
    neigh.n_neighbors = K
    #Train
    neigh.fit(X,Y)
    
    # Predict
    p_string = 'Prediction_k{}'.format(K)
    grid[p_string] = pandas.Series(neigh.predict(grid))
    geoplotlib.set_bbox(BoundingBox(max(grid['lat']), max(grid['lon']), min(grid['lat']),min(grid['lon'])))
    geoplotlib.tiles_provider('toner-lite')

    # Plot dots
    print p_string
    geoplotlib.dot(grid[grid[p_string] == 'indirect fire'], color=[0,255,0,255], point_size=2)
    geoplotlib.dot(grid[grid[p_string] == 'escalation of force'], color=[255,0,0,255], point_size=2)
    geoplotlib.dot(grid[grid[p_string] == 'ied explosion'], color=[0,0,255,255], point_size=2)
    geoplotlib.dot(grid[grid[p_string] == 'direct fire'], color=[255,255,0,255], point_size=2)
    geoplotlib.inline()
    return grid



In [57]:

    
grid = plot_knn(5)
grid = grid.append(plot_knn(10), ignore_index=True)
grid = grid.append(plot_knn(20), ignore_index=True)









    



Prediction_k5






    











    



Prediction_k10






    











    



Prediction_k20

Since a unbalanced data set is used to show the ground truth of most common attacks, there is a stronger distribution of IED attacks and direct fire. Especially when K is set to 20 nearest neighbours those two categories are dominating.



In [58]:

    
# to json file
with open('../data/nb/knn.json', 'w') as outfile:
    json.dump(json.loads(grid.to_json(orient='records')), outfile)
grid.head()









    Out[58]:






  
    
      
      Prediction_k10
      Prediction_k20
      Prediction_k5
      lat
      lon
    
  
  
    
      0
      NaN
      NaN
      ied explosion
      29.88
      73.228
    
    
      1
      NaN
      NaN
      ied explosion
      29.88
      73.106
    
    
      2
      NaN
      NaN
      ied explosion
      29.88
      72.984
    
    
      3
      NaN
      NaN
      ied explosion
      29.88
      72.862
    
    
      4
      NaN
      NaN
      ied explosion
      29.88
      72.740

"k-NN prediction" data download

download this data



In [59]:

    
# extract election data
df_el = pandas.concat([data_el['Longitude'],data_el['Latitude'], data_el['Province Name'],
              data_el['District Name'], data_el['2009 and 2010']], axis=1)
df_el = df_el.rename(columns = {
        '2009 and 2010':'total', 
        'District Name':'district', 
        'Province Name': 'province', 
        'Latitude': 'lat', 
        'Longitude': 'lon'
    }                
)
df_el.head()









    Out[59]:






  
    
      
      lon
      lat
      province
      district
      total
    
  
  
    
      0
      70.82
      37.27
      Badakhshan
      Arghanj Khwa
      5
    
    
      1
      70.43
      37.06
      Badakhshan
      Argo
      19
    
    
      2
      70.90
      37.03
      Badakhshan
      Baharak
      8
    
    
      3
      70.47
      36.87
      Badakhshan
      Darayem
      8
    
    
      5
      71.15
      38.13
      Badakhshan
      Darwaz-e-Balla
      2

"election summary" data download

download this data



In [60]:

    
df_el.to_csv('../data/nb/election.csv', sep=',', encoding='utf-8', index=False)

Classifier Analysis

Visualization 5 - bar chart

We think it would be interesting to have a proper classification capable to predict a war category just by analyzing the content of its summary. This is done on for all incidents Afghanistan's capital Kabul. To do so, removal of useless data from the summary and substitution of the acronyms for the corresponding words is also necessary again. Later a matrix that represents the incidents on the rows and every unique word on the columns is created. With this matrix, it's simple to represent the words that every summary contains. Four classifiers were used in this project, to be able to see the differences between them.

To train the model all words from the "ExtendedSummary" were used as only feature, more might be added to increase the score for predicting the "Category". The necessary normalization would have been time consuming however and we tend to believe that the score is quite accurate given the many different categories. To reduce the bar chart only the 11 categories with most incidents were picked, with more categories it would make sense to flip x- and y-axis in the bar chart.

This problem description might only attract scientific readers, nevertheless we created a animated map that is using clustering to show the potential benefit of having models that are able to fit training data. With the evaluation of the knee point we made sure to get the optimal number of clusters for our visualization. There is no advantage in adding more means as the visualization does not become clearer with more information.

The best result of the classification is achieved by the random forest method which is an extension of the decision tree that performs equally good. The random forest method is more complex but due to added complexity it prevents trees from overfitting. Assume a data point runs trough the branches of one tree, if many decision trees are fit together in a forest each tree in the forest is made of different subsets of the training data. Covariats are taken randomly between the different trees and so you achieve diversity. However, overfitting is not a concern because the given training data has a low entropy (not too many different outcomes). Also the classification with k-NN achieves a good score since it is along with decision trees suitable for categorical (and numerical) data. Decision trees have the advantage to handle heterogeneous data better compared to k-NN that relies on a distance function to classify elements. k-NN also uses a considerable amount of time making a prediction which might become a problem on bigger datasets. Finally, the Gaussian Naive Bayes classifier would only be considerable when execution time would be a big concern since the classifier is highly scalable for more features and predictors. Choosing the right classifier depends where priorities are but having the most accurate score is probably the most critical according to the problem description.



In [61]:

    
# get the rows that contain incidents in the capital, Kabul
crimes_Kabul = data[data['Region'] == 'RC CAPITAL']
crimes_Kabul.index = range(3191)
summaries = crimes_Kabul['Summary']
categories = crimes_Kabul['Category']



In [62]:

    
def hasNumbers(word):
    return any(i.isdigit() for i in word)



In [63]:

    
# remove punctuation
tokenizer = RegexpTokenizer(r'\w+')

incidence = 0

# Dictionary that will contain all the tokens per summary.
tokens_per_summary_kabul = {}


for s in summaries:
    # get all the tokens of the summary S, which still contains the acronyms
    tokens_with_acronyms = tokenizer.tokenize(str(s))
    
    # replace acronyms
    tokens_no_acronyms = []
    for t in tokens_with_acronyms:
        t = t.lower()
        if t in glossary:
            tokens_no_acronyms.append(glossary[t].lower())
        else:
            tokens_no_acronyms.append(t)
            
    # remove now the stopwords
    tokens = []
    for w in tokens_no_acronyms:
        if w not in nltk.corpus.stopwords.words('english') and len(w) > 1 and (hasNumbers(w) == False):
            tokens.append(w)
            
    # store cleaned data in dictionary
    tokens_per_summary_kabul[incidence] = tokens
    incidence = incidence+1



In [64]:

    
# use pickle to store the dictionaries of the summaries, in order to avoid running the cleaning each time.
pickle.dump(tokens_per_summary_kabul, open("../data/nb/tokens_per_summary_kabul.p","wb"))



In [65]:

    
# load the data stored using pickle
tokens_per_summary_kabul = pickle.load(open( "../data/nb/tokens_per_summary_kabul.p","rb"))

Create a list of unique words



In [66]:

    
tokens_per_summary_sets = {} # each key contains a list with unique words of the summary
big_list = [] # all the unique lists concatenated. Thus, a word appear as many times as in how many summaries it is.

for key in tokens_per_summary_kabul:
    tokens_per_summary_sets[key] = list(set(tokens_per_summary_kabul[key]))
    big_list += tokens_per_summary_sets[key]

counter = Counter(big_list)

# create a set with the unique words that appear in more than 3 files.
unique_words = set()
for c in counter:
    if counter[c] > 3:
        unique_words.add(c)



In [67]:

    
# create the dictionary that maps each unique word with an index
unique_words2 = list(unique_words)
d = {unique_words2[i].lower() : i for i in xrange(len(unique_words2))}

Create a matrix with incident summaries in rows and unique words in columns, and set each value as the number of occurrences of a specific word in a specific incident summary.



In [68]:

    
# remove punctuation
tokenizer = RegexpTokenizer(r'\w+')

feature_vectors_kabul = [[0 for x in range(len(list(unique_words)))] for x in range(3191)]

# create the feature vectors
row = 0
for s in summaries:
    # get all the tokens of the summary S, which still contains the acronyms
    tokens_with_acronyms = tokenizer.tokenize(str(s))
    
    # replace acronyms
    tokens_no_acronyms = []
    for t in tokens_with_acronyms:
        t = t.lower()
        if t in glossary:
            tokens_no_acronyms.append(glossary[t].lower())
        else:
            tokens_no_acronyms.append(t)
            
    # remove stopwords
    for w in tokens_no_acronyms:
        if (w in d):
            col = d[w]
            feature_vectors_kabul[row][col] += 1
    row += 1



In [69]:

    
cp.dump(feature_vectors_kabul, open("../data/nb/feature_vectors_kabul.p","wb"))
feature_vectors_kabul = cp.load(open("../data/nb/feature_vectors_kabul.p","rb"))

Assign an index per each category.



In [70]:

    
unique_categories = list(set(crimes_Kabul['Category']))



In [71]:

    
cat2index = {}
index2cat = {}

for i in range (len(unique_categories)):
    cat2index[unique_categories[i]] = i
    
for i in range(len(unique_categories)):
    index2cat[i] = unique_categories[i]



In [72]:

    
dic_categories = {}

# for each file look at what list attached and then add the corresponding index
for i in range(len(categories)):
    dic_categories[i] = cat2index[categories[i]]



In [73]:

    
top11_categories = ['escalation of force','ied explosion','demonstration','indirect fire','meeting','accident','direct fire',
                    'medevac','ied found/cleared','cache found/cleared','meeting - security']

RANDOM FOREST



In [74]:

    
X_train, X_test, Y_train, Y_test = train_test_split(feature_vectors_kabul, dic_categories.values(), test_size=0.2)



In [75]:

    
forest = RandomForestClassifier(n_estimators = 50)



In [76]:

    
%timeit -n1 -r3 forest.fit(X_train,Y_train)









    



1 loops, best of 3: 2.31 s per loop



In [77]:

    
%timeit -n1 -r3 rf_predicted = forest.predict(X_test)
rf_predicted = forest.predict(X_test)









    



1 loops, best of 3: 144 ms per loop



In [78]:

    
forest.score(X_test,Y_test)









    Out[78]:





0.75743348982785608



In [79]:

    
# get false posities and negatives
results_rf = {}

for i in top11_categories:
    results_rf[i] = [0,0]
    
for i in range(len(Y_test)):
    key = index2cat[Y_test[i]]
    if key in results_rf.keys():                
        if Y_test[i] == rf_predicted[i]: # if correct prediction
            results_rf[index2cat[Y_test[i]]][0] += 1
        else:
            results_rf[index2cat[Y_test[i]]][1] += 1
            
for i in range(len(top11_categories)):
    c = results_rf[top11_categories[i]][0]
    b = results_rf[top11_categories[i]][1]
    results_rf[top11_categories[i]][0] = (c/(c+b))*100
    results_rf[top11_categories[i]][1] = (b/(c+b))*100

"random forest" data download

download this data



In [80]:

    
# save the data
data_rf = pandas.DataFrame()

data_rf['Cat'] = top11_categories
correct = []
bad = []
for i in top11_categories:
    correct.append(results_rf[i][0])
    bad.append(results_rf[i][1])
data_rf['Correct'] = correct
data_rf['Incorrect'] = bad
# save killed data to csv
data_rf.to_csv('../data/nb/rf_top11_p.csv', sep=',', index=False)

K-NEIGHBORS



In [81]:

    
neigh = KNeighborsClassifier(n_neighbors=3)



In [82]:

    
%timeit -n1 -r3 neigh.fit(X_train, Y_train)









    



1 loops, best of 3: 841 ms per loop



In [83]:

    
%timeit -n1 -r3 nc_predicted = neigh.predict(X_test)
nc_predicted = neigh.predict(X_test)









    



1 loops, best of 3: 11.6 s per loop



In [84]:

    
neigh.score(X_test,Y_test)









    Out[84]:





0.59624413145539901



In [85]:

    
results_neigh = {}

for i in top11_categories:
    results_neigh[i] = [0,0]
    
for i in range(len(Y_test)):
    key = index2cat[Y_test[i]]
    if key in results_neigh.keys():                
        if Y_test[i] == nc_predicted[i]: #if correct prediction
            results_neigh[index2cat[Y_test[i]]][0] += 1
        else:
            results_neigh[index2cat[Y_test[i]]][1] += 1
            
for i in range(len(top11_categories)):
    c = results_neigh[top11_categories[i]][0]
    b = results_neigh[top11_categories[i]][1]
    results_neigh[top11_categories[i]][0] = (c/(c+b))*100
    results_neigh[top11_categories[i]][1] = (b/(c+b))*100

"k-NN" data download

download this data



In [86]:

    
data_nc = pandas.DataFrame()

data_nc['Cat'] = top11_categories
correct = []
bad = []
for i in top11_categories:
    correct.append(results_neigh[i][0])
    bad.append(results_neigh[i][1])
data_nc['Correct'] = correct
data_nc['Incorrect'] = bad
#save killed data to csv
data_nc.to_csv('../data/nb/nc_top11_p.csv', sep=',', index=False)

GAUSSIAN Naive Bayes



In [87]:

    
clf = GaussianNB()



In [88]:

    
%timeit -n1 -r3 clf.fit(X_train,Y_train)









    



1 loops, best of 3: 617 ms per loop



In [89]:

    
%timeit -n1 -r3 clf_predicted = clf.predict(X_test)
clf_predicted = clf.predict(X_test)









    



1 loops, best of 3: 1.57 s per loop



In [90]:

    
clf.score(X_test,Y_test)









    Out[90]:





0.36463223787167448



In [91]:

    
results_clf = {}

for i in list(set(categories)):
    results_clf[i] = [0,0]
    
for i in range(len(Y_test)):
    if Y_test[i] == clf_predicted[i]: #if correct prediction
        results_clf[index2cat[Y_test[i]]][0] += 1
    else:
        results_clf[index2cat[Y_test[i]]][1] += 1
        
for i in range(len(top11_categories)):
    c = results_clf[top11_categories[i]][0]
    b = results_clf[top11_categories[i]][1]
    results_clf[top11_categories[i]][0] = (c/(c+b))*100
    results_clf[top11_categories[i]][1] = (b/(c+b))*100

"Gaussian Naive Bayes" data download

download this data



In [92]:

    
data_clf = pandas.DataFrame()

data_clf['Cat'] = top11_categories
correct = []
bad = []
for i in top11_categories:
    correct.append(results_clf[i][0])
    bad.append(results_clf[i][1])
data_clf['Correct'] = correct
data_clf['Incorrect'] = bad
#save killed data to csv
data_clf.to_csv('../data/nb/clf_top11_p.csv', sep=',', index=False)

DECISION TREE



In [93]:

    
tr = tree.DecisionTreeClassifier()



In [94]:

    
%timeit -n1 -r3 tr.fit(X_train,Y_train)









    



1 loops, best of 3: 1.62 s per loop



In [95]:

    
%timeit -n1 -r3 tree_predicted = tr.predict(X_test)
tree_predicted = tr.predict(X_test)









    



1 loops, best of 3: 73.1 ms per loop



In [96]:

    
tr.score(X_test,Y_test)









    Out[96]:





0.66666666666666663



In [97]:

    
results_tree = {}

for i in list(set(categories)):
    results_tree[i] = [0,0]
    
for i in range(len(Y_test)):
    if Y_test[i] == tree_predicted[i]: #if correct prediction
        results_tree[index2cat[Y_test[i]]][0] += 1
    else:
        results_tree[index2cat[Y_test[i]]][1] += 1
        
for i in range(len(top11_categories)):
    c = results_tree[top11_categories[i]][0]
    b = results_tree[top11_categories[i]][1]
    results_tree[top11_categories[i]][0] = (c/(c+b))*100
    results_tree[top11_categories[i]][1] = (b/(c+b))*100

"decission tree" data download

download this data



In [98]:

    
data_tr = pandas.DataFrame()

data_tr['Cat'] = top11_categories
correct = []
bad = []
for i in top11_categories:
    correct.append(results_tree[i][0])
    bad.append(results_tree[i][1])
data_tr['Correct'] = correct
data_tr['Incorrect'] = bad
# save killed data to csv
data_tr.to_csv('../data/nb/tr_top11_p.csv', sep=',', index=False)

Cluster Map

Visualization 5 - map

The map of Kabul with all the incidents was used to show the potential benefit of having models that are able to fit training data. That way the reader gets to see machine learning tools in action.



In [99]:

    
# get the rows that contain incidents in the capital
incidents_kabul = data[data['Region'] == 'RC CAPITAL']



In [100]:

    
def create_geodata(dataset):
    #get latitudes, longitudes and display colors from dataframe
    latitudes = []
    longitudes = []
    colors = []
    for item in dataset['Longitude'].iteritems():
        if ((math.isnan(item[1]) == False) and (item[1] != 70.20531464)): # to remove one outlier
            longitudes.append(float(item[1]))
            latitudes.append(dataset['Latitude'][item[0]])
            colors.append(dataset['DisplayColor'][item[0]])

    # get data in the format geoplotlib requires. We put the geodata in a dictionary structured as follows
    geo_data = {"lat": latitudes,
                "lon": longitudes}
    
    return geo_data, colors



In [101]:

    
geodata, colors = create_geodata(incidents_kabul)



In [102]:

    
def plot_geodata_kabul(geodata, enemy, friendly, neutral):

    for i in range(len(colors)):
        if colors[i] == 'RED':
            enemy['lat'].append(geodata['lat'][i])
            enemy['lon'].append(geodata['lon'][i])
        if colors[i] == 'BLUE':
            friendly['lat'].append(geodata['lat'][i])
            friendly['lon'].append(geodata['lon'][i])
        if colors[i] == 'GREEN':
            neutral['lat'].append(geodata['lat'][i])
            neutral['lon'].append(geodata['lon'][i])
    
    geoplotlib.dot(enemy, color='red', point_size = 2.5)
    geoplotlib.dot(friendly, color='blue', point_size = 2.5)
    geoplotlib.dot(neutral, color='green', point_size = 2.5)
    
    # bounding box on the minima and maxima of the data
    geoplotlib.set_bbox(
        BoundingBox(
            max(geodata['lat']), 
            max(geodata['lon']), 
            min(geodata['lat']), 
            min(geodata['lon'])
        ));

    geoplotlib.inline()



In [103]:

    
enemy = {}
enemy['lat'] = []
enemy['lon'] = []
friendly = {}
friendly['lat'] = []
friendly['lon'] = []
neutral = {}
neutral['lat'] = []
neutral['lon'] = []

plot_geodata_kabul(geodata, enemy, friendly, neutral)



In [104]:

    
def plot_clusters(n_clusters_in, d):
    print "K={}".format(n_clusters_in)
    X_train = pandas.DataFrame.from_dict(d)
    # create classifier for estimation
    cluster_mdl = cluster.KMeans(n_clusters=n_clusters_in)
    k_means = cluster_mdl.fit(X_train)

    centroids = {"lat": k_means.cluster_centers_[:,0].tolist(),
                 "lon": k_means.cluster_centers_[:,1].tolist()}
    
    # get cluster labels
    labels = k_means.labels_
    
    # plot the centers of the respective type ... DisplayColor
    geoplotlib.dot(centroids, color ='blue', point_size=10);
    geoplotlib.inline();
    
    # recombine lat+lon in tuple
    centroids = zip(centroids['lat'], centroids['lon'])
    return centroids, labels, d

cen_e, l_e, d_e = plot_clusters(5, enemy)
cen_f, l_f, d_f = plot_clusters(4, friendly)
cen_n, l_n, d_n = plot_clusters(4, neutral)



In [105]:

    
# try from K = 2, ..., 10
def check_kneepoint(d):
    X_train = pandas.DataFrame.from_dict(d)
    kmean = [cluster.KMeans(n_clusters=k).fit(X_train).inertia_ for k in range(2,10)]
    plt.plot(range(2,10), kmean)
    plt.xlabel('# centroids');
    # sum of squared errors where the graph bends
    plt.ylabel('inertia (minimizing criterion)');

# get good sense out of data by finding optimal point
check_kneepoint(neutral) # 4
check_kneepoint(friendly) # 4
check_kneepoint(enemy) # 5

"cluster" data download



In [106]:

    
# concatenate data
df_geo = pandas.DataFrame()
df_geo['lat'] = d_e['lat'] + d_f['lat'] + d_n['lat']
df_geo['lon'] = d_e['lon'] + d_f['lon'] + d_n['lon']
df_geo['label'] = ['enemy']*len(enemy['lat']) + ['friendly']*len(friendly['lat']) + ['neutral']*len(neutral['lat'])
con = np.concatenate((l_e, l_f), axis=0)
df_geo['cluster'] = np.concatenate((con, l_n), axis=0)

centr = {
    'friendly': cen_f,
    'enemy': cen_e,
    'neutral': cen_n
}

# write files
with open('../data/nb/centr.json', 'w') as data_file:    
    json.dump(centr, data_file)
df_geo.to_csv('../data/nb/geodata_kabul.csv', sep=',', index=False)

pprint(centr)
df_geo.tail()









    



{'enemy': [(34.77685700414634, 69.2127621352439),
           (34.53447809092872, 69.14913472462203),
           (34.62446718365079, 69.76936146579365),
           (34.53366977011321, 69.32291406626415),
           (34.36171674742647, 69.18763418764706)],
 'friendly': [(34.54866221994118, 69.27396343588235),
              (34.57109507930693, 69.70428376138614),
              (34.78434195146341, 69.17174325341463),
              (34.50980557267081, 69.14575593354037)],
 'neutral': [(34.54672341751938, 69.29199337100775),
             (34.762533862955976, 69.13903868937106),
             (34.59424866611111, 69.67290836),
             (34.53745502713396, 69.11695638361371)]}






    Out[106]:






  
    
      
      lat
      lon
      label
      cluster
    
  
  
    
      3138
      34.512527
      69.174309
      neutral
      3
    
    
      3139
      34.729282
      69.861694
      neutral
      2
    
    
      3140
      34.544861
      69.249039
      neutral
      0
    
    
      3141
      34.617100
      69.078529
      neutral
      3
    
    
      3142
      34.516678
      69.171379
      neutral
      3

Discussion

Generally there was no trouble to find enough sources to get some more background information about the data. This is probably mainly because of the big press releases that came along with the classified content. We hope that the insights into the war that we gained can also be transported to the reader which would not be possible by just having the raw data.

The uniqueness of dataset does not give too many options to incorporate other data. We had high ambitions to detect something meaningful inspired by the article from the lecture about terrorist classification and an article about creating a decision tree classifying whether to shoot or not to shoot on approaching vehicles to reduce innocent civilians getting killed. In the article the German scientist describe they could extract how many people are in the car, does it slow down and other threat cues that are used. In those affect situations the soldiers make the most mistakes to protect themselves.

However, these approaches would need much more work on extracting proper features from the data to predict something meaningful which might be out of scope for the course. The scientist could simply devote more time on this issue. Thus we went with a rather descriptive approach of the machine learning models. We also wanted to avoid summarizing too many facts from the datasets and thus tried to experiment more with some JavaScript libraries.

What else could be improved is the LDA visualization where we were not able to trace the original topic back which might be useful if it cannot be inferred from the words. Furthermore, on the two map visualizations that used leaflet.js there is a minor bug: Initially there is no detection of the d3 layer (to show it as enabled in the leaflet control menu) because the DOM is dynamically created after the initialization, although a few things have been tried we did not get rid of that behavior. Instead the reader needs to click twice unfortunately.

	ReportKey	DateOccurred	EventType	Category	TrackingNumber	Title	Summary	Region	AttackOn	ComplexAttack	...	MilitaryGridReferenceSystem	Latitude	Longitude	OriginatorGroup	UpdatedByGroup	CommandersCriticalInformationRequirements	Significant	Affiliation	DisplayColor	ClassificationLevel
0	D92871CA-D217-4124-B8FB-89B9A2CFFCB4	2004-01-01 00:00:00	Enemy Action	direct fire	2007-033-004042-0756	direct fire other	KAF-1BDE -S3 REPORTS: SUMMIT 09 B CO ELEMENT S...	RC EAST	ENEMY	False	...	42SWB3900916257	32.683319	69.416107	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET
1	C592135C-1BFF-4AEC-B469-0A495FDA78D9	2004-01-01 00:00:00	Friendly Action	cache found/cleared	2007-033-004738-0185	cache found/cleared other	USSF FINDS CACHE IN VILLAGE OF WALU TANGAY: US...	RC EAST	FRIEND	False	...	42SXD7520076792	35.018608	70.920273	UNKNOWN	UNKNOWN	NaN	NaN	FRIEND	BLUE	SECRET
2	D50F59F0-6F32-4E63-BC02-DB2B8422DE6E	2004-01-01 00:00:00	Non-Combat Event	propaganda	2007-033-010818-0798	propaganda other	(M) NIGHT LETTERS DISTRIBUTED AROUND HAZARJUFT...	RC SOUTH	NEUTRAL	False	...	41RPQ1439743120	31.116390	64.199707	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET
3	E3F22EFB-F0CA-4821-9322-CC2250C05C8A	2004-01-01 00:00:00	Enemy Action	direct fire	2007-033-004042-0850	direct fire other	KAF-1BDE -S3: SUMMIT 6 REPORTS TIC SALUTE TO F...	RC EAST	ENEMY	False	...	42SWB3399911991	32.645000	69.362511	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET
4	4D0E1E60-9535-4D58-A374-74367F058788	2004-01-01 00:00:00	Friendly Action	cache found/cleared	2007-033-004738-0279	cache found/cleared other	KAF-1BDE -S3 REPORTS: GERONIMO 11 SALUTE AS FO...	RC EAST	FRIEND	False	...	42SWB7580277789	33.236389	69.813606	UNKNOWN	UNKNOWN	NaN	NaN	FRIEND	BLUE	SECRET

	ReportKey	DateOccurred	EventType	Category	TrackingNumber	Title	Summary	ExtendedSummary	Region	AttackOn	...	Longitude	OriginatorGroup	UpdatedByGroup	CommandersCriticalInformationRequirements	Significant	Affiliation	DisplayColor	ClassificationLevel	Year
0	D92871CA-D217-4124-B8FB-89B9A2CFFCB4	2004-01-01	Enemy Action	direct fire	2007-033-004042-0756	direct fire other	kaf-1bde -s3 reports: summit 09 b co element s...	kandahar air field 1bde s3 reports summit 09...	RC EAST	ENEMY	...	69.416107	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET	2004
1	C592135C-1BFF-4AEC-B469-0A495FDA78D9	2004-01-01	Friendly Action	cache found/cleared	2007-033-004738-0185	cache found/cleared other	ussf finds cache in village of walu tangay: us...	ussf finds cache in village of walu tangay uss...	RC EAST	FRIEND	...	70.920273	UNKNOWN	UNKNOWN	NaN	NaN	FRIEND	BLUE	SECRET	2004
2	D50F59F0-6F32-4E63-BC02-DB2B8422DE6E	2004-01-01	Non-Combat Event	propaganda	2007-033-010818-0798	propaganda other	(m) night letters distributed around hazarjuft...	m night letters distributed around hazarjuft t...	RC SOUTH	NEUTRAL	...	64.199707	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET	2004
3	E3F22EFB-F0CA-4821-9322-CC2250C05C8A	2004-01-01	Enemy Action	direct fire	2007-033-004042-0850	direct fire other	kaf-1bde -s3: summit 6 reports tic salute to f...	kandahar air field 1bde s3 summit 6 reports ...	RC EAST	ENEMY	...	69.362511	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET	2004
4	4D0E1E60-9535-4D58-A374-74367F058788	2004-01-01	Friendly Action	cache found/cleared	2007-033-004738-0279	cache found/cleared other	kaf-1bde -s3 reports: geronimo 11 salute as fo...	kandahar air field 1bde s3 reports geronimo ...	RC EAST	FRIEND	...	69.813606	UNKNOWN	UNKNOWN	NaN	NaN	FRIEND	BLUE	SECRET	2004

	Year	Afghan forces	ISAF/NATO forces	Taliban	Civilians
0	2004	316	133	93	208
1	2005	504	480	200	454
2	2006	1397	1083	249	1790
3	2007	2093	1652	377	1962
4	2008	1749	1255	285	1980
5	2009	2444	2693	617	2646

	Latitude	Longitude	CivilianKilled	ExtendedSummary
16	31.000000	66.400002	50	explosion kills 50 near qaba mosque in spin bo...
111	34.158329	61.851940	1	s rel global counter terrorism forces ngo ro...
192	33.552219	70.045280	2	15 feb two persons were killed and six injured...
221	34.983330	69.667511	5	s rel global counter terrorism forces five a...
256	32.450001	67.416656	1	s rel global counter terrorism forces 1 x ng...

	ReportKey	DateOccurred	EventType	Category	TrackingNumber	Title	Summary	ExtendedSummary	Region	AttackOn	...	Longitude	OriginatorGroup	UpdatedByGroup	CommandersCriticalInformationRequirements	Significant	Affiliation	DisplayColor	ClassificationLevel	Year	Hour
184	4A9E37A8-367A-4B73-84C4-CB4C1F5F2C47	2004-02-14 00:00:00	Enemy Action	assassination	2007-033-004004-0670	direct fire ana other 1 ana kia	(s//rel gctf) tb attack home and assasinate an...	s rel global counter terrorism forces talib...	RC SOUTH	ENEMY	...	67.416656	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET	2004	0
449	DE3A0314-FE7D-469D-B105-8592E91FEB86	2004-05-03 00:00:00	Non-Combat Event	demonstration	2007-033-011133-0745	demonstration other	(s//rel gctf) an estimated 100 or more support...	s rel global counter terrorism forces an est...	RC EAST	NEUTRAL	...	68.440002	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET	2004	0
692	846645EA-37EB-44DE-BD21-61A6C0D6BB26	2004-06-11 12:43:00	Non-Combat Event	demonstration	2007-033-011133-0635	demonstration anp other 15anp wia	(s//rel gctf) (delayed report) tf victory repo...	s rel global counter terrorism forces delaye...	RC NORTH	NEUTRAL	...	65.853333	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET	2004	12
1115	7A367320-2666-4379-86F1-CB455637E603	2004-08-15 10:00:00	Non-Combat Event	demonstration	2007-033-011133-0854	demonstration other	(fouo) at 1000hrs there was a demonstration, o...	fouo at 1000hrs there was a demonstration of a...	RC NORTH	NEUTRAL	...	68.639160	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET	2004	10
1654	29DFB7ED-6A00-4740-8806-5E91950D39E3	2004-11-04 05:00:00	Non-Combat Event	demonstration	2007-033-011133-0932	demonstration other	cjsotf reports pak civ protest 4k ne lwara. at...	combined joint special operations task force ...	RC EAST	NEUTRAL	...	69.494164	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET	2004	5

	Longitude	Latitude	Province Name	District Name	August 2010	May - August 2010	2010	2009 and 2010	Election Day 2010
0	70.82	37.27	Badakhshan	Arghanj Khwa	0	1	2	5	0
1	70.43	37.06	Badakhshan	Argo	2	5	8	19	0
2	70.90	37.03	Badakhshan	Baharak	0	6	7	8	0
3	70.47	36.87	Badakhshan	Darayem	1	7	7	8	2
5	71.15	38.13	Badakhshan	Darwaz-e-Balla	0	0	0	2	0

	Prediction_k10	Prediction_k20	Prediction_k5	lat	lon
0	NaN	NaN	ied explosion	29.88	73.228
1	NaN	NaN	ied explosion	29.88	73.106
2	NaN	NaN	ied explosion	29.88	72.984
3	NaN	NaN	ied explosion	29.88	72.862
4	NaN	NaN	ied explosion	29.88	72.740

	lat	lon	label	cluster
3138	34.512527	69.174309	neutral	3
3139	34.729282	69.861694	neutral	2
3140	34.544861	69.249039	neutral	0
3141	34.617100	69.078529	neutral	3
3142	34.516678	69.171379	neutral	3