Project Assignment B

Loading and Cleaning the Data

Load WikiLeaks Afghan War Diary from 2004-2009.



In [1]:

    
# Pandas contains useful functions for data structures with "relational" or "labeled" data
import pandas

# header as suggested
# by WikiLeaks: https://wikileaks.org/afg/
# by the Guardian: http://www.theguardian.com/world/datablog/2010/jul/25/wikileaks-afghanistan-data
header = [
    'ReportKey', # find messages and also to reference them
    'DateOccurred', 'EventType', 
    'Category', # describes what kind of event the message is about
    'TrackingNumber', 'Title', # internal tracking number and title
    'Summary', # actual description of the event
    'Region', # broader region of the event.
    'AttackOn', # who was attacked during an event
    'ComplexAttack', #  signifies that an attack was a larger operation that required more planning, coordination and preparatio
    'ReportingUnit', 'UnitName', 'TypeOfUnit', # information on the military unit that authored the report
    'FriendlyWounded', 'FriendlyKilled', 'HostNationWounded', 'HostNationKilled', 'CivilianWounded', 'CivilianKilled', 
    'EnemyWounded', 'EnemyKilled', 'EnemyDetained', # who was killed/wounded/captured
    'MilitaryGridReferenceSystem', 'Latitude', 'Longitude', # location
    'OriginatorGroup', 'UpdatedByGroup', # message originated from or was updated by
    'CommandersCriticalInformationRequirements', 
    'Significant', # are analyzed and evaluated by special group in command centre
    'Affiliation', # event was of friendly, neutral or enemy nature
    'DisplayColor', # enemy activity - RED, friendly activity - BLUE, friend on friend - GREEN
    'ClassificationLevel' # classification level of the message, e.g.: Secret
]

data = pandas.read_csv('../../../data/afg.csv', header=None, names=header)
data.head()









    Out[1]:






  
    
      
      ReportKey
      DateOccurred
      EventType
      Category
      TrackingNumber
      Title
      Summary
      Region
      AttackOn
      ComplexAttack
      ...
      MilitaryGridReferenceSystem
      Latitude
      Longitude
      OriginatorGroup
      UpdatedByGroup
      CommandersCriticalInformationRequirements
      Significant
      Affiliation
      DisplayColor
      ClassificationLevel
    
  
  
    
      0
      D92871CA-D217-4124-B8FB-89B9A2CFFCB4
      2004-01-01 00:00:00
      Enemy Action
      Direct Fire
      2007-033-004042-0756
      DIRECT FIRE  Other
      KAF-1BDE -S3 REPORTS: SUMMIT 09 B CO ELEMENT S...
      RC EAST
      ENEMY
      False
      ...
      42SWB3900916257
      32.683319
      69.416107
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
    
    
      1
      C592135C-1BFF-4AEC-B469-0A495FDA78D9
      2004-01-01 00:00:00
      Friendly Action
      Cache Found/Cleared
      2007-033-004738-0185
      CACHE FOUND/CLEARED  Other
      USSF FINDS CACHE IN VILLAGE OF WALU TANGAY: US...
      RC EAST
      FRIEND
      False
      ...
      42SXD7520076792
      35.018608
      70.920273
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      FRIEND
      BLUE
      SECRET
    
    
      2
      D50F59F0-6F32-4E63-BC02-DB2B8422DE6E
      2004-01-01 00:00:00
      Non-Combat Event
      Propaganda
      2007-033-010818-0798
      PROPAGANDA  Other
      (M) NIGHT LETTERS DISTRIBUTED AROUND HAZARJUFT...
      RC SOUTH
      NEUTRAL
      False
      ...
      41RPQ1439743120
      31.116390
      64.199707
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
    
    
      3
      E3F22EFB-F0CA-4821-9322-CC2250C05C8A
      2004-01-01 00:00:00
      Enemy Action
      Direct Fire
      2007-033-004042-0850
      DIRECT FIRE  Other
      KAF-1BDE -S3: SUMMIT 6 REPORTS TIC SALUTE TO F...
      RC EAST
      ENEMY
      False
      ...
      42SWB3399911991
      32.645000
      69.362511
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
    
    
      4
      4D0E1E60-9535-4D58-A374-74367F058788
      2004-01-01 00:00:00
      Friendly Action
      Cache Found/Cleared
      2007-033-004738-0279
      CACHE FOUND/CLEARED  Other
      KAF-1BDE -S3 REPORTS: GERONIMO 11 SALUTE AS FO...
      RC EAST
      FRIEND
      False
      ...
      42SWB7580277789
      33.236389
      69.813606
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      FRIEND
      BLUE
      SECRET
    
  

5 rows × 32 columns

We start by doing a small exploratory analysis of the data. We extract the number of rows, columns, the years in which the incidents took place, the number of categories and the most and least commonly occurring categories.



In [2]:

    
#data['Year'] = [int(date.split("-")[0]) for date in data['DateOccurred']]
data['DateOccurred'] = pandas.to_datetime(data['DateOccurred'])

data['Year'] = [date.year for date in data['DateOccurred']]
data['Hour'] = [date.hour for date in data['DateOccurred']]

#Number of rows/columns
print "Number of rows: %d" % data.shape[0]
print "Number of columns: %d" % data.shape[1]

date_range = set()
for date in data['DateOccurred']:
    date_range.add(date.year)

print "\nYears:\n"
print list(date_range)

#Ocurrences of categories
print "\nNumber of unique categories: %d" %len(set(data['Category']))

#Distribution of categoriesn_occurrences[0:20]
n_occurrences = data['Category'].value_counts()

print "\nMost commonly occurring categories of crime:\n"
print n_occurrences.head()

print "\nMost commonly occurring category of crime is %s with %d" % (n_occurrences.argmax(), n_occurrences.max())
print "\nLeast commonly occurring category of crime is %s with %d" % (n_occurrences.argmin(), n_occurrences.min())









    



Number of rows: 76911
Number of columns: 34

Years:

[2004, 2005, 2006, 2007, 2008, 2009]

Number of unique categories: 168

Most commonly occurring categories of crime:

Direct Fire          16293
IED Found/Cleared     8581
Indirect Fire         7237
IED Explosion         7202
Other                 4005
dtype: int64

Most commonly occurring category of crime is Direct Fire with 16293

Least commonly occurring category of crime is Security Breach with 1

Casualties recorded by the Afghanistan database

Four groups:

* Afghan forces
* Nato Forces
* Taliban
* Civilians

We count the number of people wounded and killed in each group and plot it in a grouped barchart in which we can toggle between wounded and killed data.



In [3]:

    
#create new dataframe with the columns of interest

db = pandas.concat([data['Year'],data['HostNationWounded'], data['HostNationKilled'],
              data['EnemyWounded'], data['EnemyKilled'],
              data['CivilianWounded'], data['CivilianKilled'],
              data['FriendlyWounded'], data['FriendlyKilled']]
              , axis=1)
db.head()









    Out[3]:






  
    
      
      Year
      HostNationWounded
      HostNationKilled
      EnemyWounded
      EnemyKilled
      CivilianWounded
      CivilianKilled
      FriendlyWounded
      FriendlyKilled
    
  
  
    
      0
      2004
      0
      0
      0
      3
      0
      0
      0
      0
    
    
      1
      2004
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      2004
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      2004
      0
      0
      0
      8
      0
      0
      0
      0
    
    
      4
      2004
      0
      0
      0
      0
      0
      0
      0
      0



In [4]:

    
casualties = {}

for date in date_range:
    #filter dataframe by year
    db_year = db[db['Year'] == date]
    casualties[str(date)]= {}
    for column in db.columns:
        if column != 'Year':
            #sum every column except 'year' and store it in a dictionary with 'years' as keys
            casualties[str(date)][column] = db_year[column].sum()



In [5]:

    
for key in sorted(casualties.iterkeys()):
    print "%s: %s" % (key, casualties[key])









    



2004: {'HostNationKilled': 218.0, 'FriendlyWounded': 133.0, 'HostNationWounded': 316.0, 'EnemyWounded': 93.0, 'FriendlyKilled': 22.0, 'EnemyKilled': 343.0, 'CivilianKilled': 219.0, 'CivilianWounded': 208.0}
2005: {'HostNationKilled': 180.0, 'FriendlyWounded': 480.0, 'HostNationWounded': 504.0, 'EnemyWounded': 200.0, 'FriendlyKilled': 71.0, 'EnemyKilled': 890.0, 'CivilianKilled': 178.0, 'CivilianWounded': 454.0}
2006: {'HostNationKilled': 605.0, 'FriendlyWounded': 1083.0, 'HostNationWounded': 1397.0, 'EnemyWounded': 252.0, 'FriendlyKilled': 142.0, 'EnemyKilled': 2689.0, 'CivilianKilled': 800.0, 'CivilianWounded': 1794.0}
2007: {'HostNationKilled': 951.0, 'FriendlyWounded': 1652.0, 'HostNationWounded': 2093.0, 'EnemyWounded': 377.0, 'FriendlyKilled': 186.0, 'EnemyKilled': 4044.0, 'CivilianKilled': 758.0, 'CivilianWounded': 1962.0}
2008: {'HostNationKilled': 699.0, 'FriendlyWounded': 1255.0, 'HostNationWounded': 1749.0, 'EnemyWounded': 285.0, 'FriendlyKilled': 244.0, 'EnemyKilled': 2816.0, 'CivilianKilled': 798.0, 'CivilianWounded': 1980.0}
2009: {'HostNationKilled': 1143.0, 'FriendlyWounded': 2693.0, 'HostNationWounded': 2444.0, 'EnemyWounded': 617.0, 'FriendlyKilled': 481.0, 'EnemyKilled': 4437.0, 'CivilianKilled': 1241.0, 'CivilianWounded': 2646.0}



In [6]:

    
#Extract wounded data and rename columns

Wounded = {}
for date in date_range:
    
    Wounded[str(date)]= {}
    
    Wounded[str(date)]['Afghan forces'] = casualties[str(date)]['HostNationWounded']
    Wounded[str(date)]['Taliban'] = casualties[str(date)]['EnemyWounded']
    Wounded[str(date)]['Civilians'] = casualties[str(date)]['CivilianWounded']
    Wounded[str(date)]['Nato forces'] = casualties[str(date)]['FriendlyWounded']



In [7]:

    
for key in sorted(Wounded.iterkeys()):
    print "%s: %s" % (key, Wounded[key])









    



2004: {'Afghan forces': 316.0, 'Nato forces': 133.0, 'Taliban': 93.0, 'Civilians': 208.0}
2005: {'Afghan forces': 504.0, 'Nato forces': 480.0, 'Taliban': 200.0, 'Civilians': 454.0}
2006: {'Afghan forces': 1397.0, 'Nato forces': 1083.0, 'Taliban': 252.0, 'Civilians': 1794.0}
2007: {'Afghan forces': 2093.0, 'Nato forces': 1652.0, 'Taliban': 377.0, 'Civilians': 1962.0}
2008: {'Afghan forces': 1749.0, 'Nato forces': 1255.0, 'Taliban': 285.0, 'Civilians': 1980.0}
2009: {'Afghan forces': 2444.0, 'Nato forces': 2693.0, 'Taliban': 617.0, 'Civilians': 2646.0}



In [8]:

    
##writing wounded data to a dataframe for visualizing it with D3

data_wounded = pandas.DataFrame()

data_wounded['Year'] = [key for key in sorted(Wounded.iterkeys())]
data_wounded['Afghan forces'] = [Wounded[key]['Afghan forces'] for key in sorted(Wounded.iterkeys())]
data_wounded['Nato forces'] = [Wounded[key]['Nato forces'] for key in sorted(Wounded.iterkeys())]
data_wounded['Taliban'] = [Wounded[key]['Taliban'] for key in sorted(Wounded.iterkeys())]
data_wounded['Civilians'] = [Wounded[key]['Civilians'] for key in sorted(Wounded.iterkeys())]

data_wounded









    Out[8]:






  
    
      
      Year
      Afghan forces
      Nato forces
      Taliban
      Civilians
    
  
  
    
      0
      2004
      316
      133
      93
      208
    
    
      1
      2005
      504
      480
      200
      454
    
    
      2
      2006
      1397
      1083
      252
      1794
    
    
      3
      2007
      2093
      1652
      377
      1962
    
    
      4
      2008
      1749
      1255
      285
      1980
    
    
      5
      2009
      2444
      2693
      617
      2646



In [9]:

    
#save wounded data to csv
data_wounded.to_csv('../../../data/Wounded', sep=',', index=False)



In [10]:

    
#multi bar plot of killed
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

N = 6
ind = np.arange(N)  # the x locations for the groups
width = 0.2     # the width of the bars

Afghan_forces = list(data_wounded['Afghan forces'])
Nato_forces = list(data_wounded['Nato forces'])
Taliban = list(data_wounded['Taliban'])
Civilians = list(data_wounded['Civilians'])

fig, ax = plt.subplots(figsize=(20, 10))
rects1 = ax.bar(ind, Afghan_forces, width, color='r')
rects2 = ax.bar(ind + width, Nato_forces, width, color='b')
rects3 = ax.bar(ind + width*2, Taliban, width, color='g')
rects4 = ax.bar(ind + width*3, Civilians, width, color='y')

# add some text for labels, title and axes ticks
ax.set_ylabel('Counts')
ax.set_title('Wounded counts recorded by the Wikileaks Afghanistan Database')
ax.set_xticks(ind + width)
ax.set_xticklabels(('2004', '2005', '2006', '2007', '2008', '2009'))

ax.legend((rects1[0], rects2[0], rects3[0], rects4[0]), ('Afghan forces', 'Nato forces', 'Taliban', 'Civilians'))

def autolabel(rects):
    # attach some text labels
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)

plt.show()



In [11]:

    
#Extract killed data and rename columns

Killed = {}
for date in date_range:
    
    Killed[str(date)]= {}
    
    Killed[str(date)]['Afghan forces'] = casualties[str(date)]['HostNationKilled']
    Killed[str(date)]['Taliban'] = casualties[str(date)]['EnemyKilled']
    Killed[str(date)]['Civilians'] = casualties[str(date)]['CivilianKilled']
    Killed[str(date)]['Nato forces'] = casualties[str(date)]['FriendlyKilled']



In [12]:

    
for key in sorted(Killed.iterkeys()):
    print "%s: %s" % (key, Killed[key])









    



2004: {'Afghan forces': 218.0, 'Nato forces': 22.0, 'Taliban': 343.0, 'Civilians': 219.0}
2005: {'Afghan forces': 180.0, 'Nato forces': 71.0, 'Taliban': 890.0, 'Civilians': 178.0}
2006: {'Afghan forces': 605.0, 'Nato forces': 142.0, 'Taliban': 2689.0, 'Civilians': 800.0}
2007: {'Afghan forces': 951.0, 'Nato forces': 186.0, 'Taliban': 4044.0, 'Civilians': 758.0}
2008: {'Afghan forces': 699.0, 'Nato forces': 244.0, 'Taliban': 2816.0, 'Civilians': 798.0}
2009: {'Afghan forces': 1143.0, 'Nato forces': 481.0, 'Taliban': 4437.0, 'Civilians': 1241.0}



In [13]:

    
##writing killed data to a dataframe for visualizing it with D3

data_killed = pandas.DataFrame()

data_killed['Year'] = [key for key in sorted(Killed.iterkeys())]
data_killed['Afghan forces'] = [Killed[key]['Afghan forces'] for key in sorted(Killed.iterkeys())]
data_killed['Nato forces'] = [Killed[key]['Nato forces'] for key in sorted(Killed.iterkeys())]
data_killed['Taliban'] = [Killed[key]['Taliban'] for key in sorted(Killed.iterkeys())]
data_killed['Civilians'] = [Killed[key]['Civilians'] for key in sorted(Killed.iterkeys())]

data_killed









    Out[13]:






  
    
      
      Year
      Afghan forces
      Nato forces
      Taliban
      Civilians
    
  
  
    
      0
      2004
      218
      22
      343
      219
    
    
      1
      2005
      180
      71
      890
      178
    
    
      2
      2006
      605
      142
      2689
      800
    
    
      3
      2007
      951
      186
      4044
      758
    
    
      4
      2008
      699
      244
      2816
      798
    
    
      5
      2009
      1143
      481
      4437
      1241



In [14]:

    
#save killed data to csv
data_killed.to_csv('../../../data/Killed', sep=',', index=False)



In [15]:

    
Afghan_forces = list(data_killed['Afghan forces'])
Nato_forces = list(data_killed['Nato forces'])
Taliban = list(data_killed['Taliban'])
Civilians = list(data_killed['Civilians'])

fig, ax = plt.subplots(figsize=(20, 10))
rects1 = ax.bar(ind, Afghan_forces, width, color='r')
rects2 = ax.bar(ind + width, Nato_forces, width, color='b')
rects3 = ax.bar(ind + width*2, Taliban, width, color='g')
rects4 = ax.bar(ind + width*3, Civilians, width, color='y')

# add some text for labels, title and axes ticks
ax.set_ylabel('Counts')
ax.set_title('Killed counts recorded by the Wikileaks Afghanistan Database')
ax.set_xticks(ind + width)
ax.set_xticklabels(('2004', '2005', '2006', '2007', '2008', '2009'))

ax.legend((rects1[0], rects2[0], rects3[0], rects4[0]), ('Afghan forces', 'Nato forces', 'Taliban', 'Civilians'))

def autolabel(rects):
    # attach some text labels
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)

plt.show()

WordClouds

In this section we use a simple weighting scheme called TF-IDF to find important words within each category. Once we have found them, we visualize them in a word cloud. It is important to note that before starting to find important words, we need to replace all the military acronyms in the descriptions by their meaning in order to have extended and more understandable descriptions.



In [16]:

    
import bs4
from bs4 import BeautifulSoup
import urllib2

#Replace military language in the description (abbreviations and acronyms that help each other tell exactly
#what they need) by the glossary. We use this glossary in order to create wordclouds.

#read url with the glossary of the acronyms
url = 'http://www.theguardian.com/world/datablog/2010/jul/25/wikileaks-afghanistan-war-logs-glossary'
#read url
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
#find the table where the acronyms are matched to their correspondent meaning
table = soup.find("table")

acronym = []
meaning = []

for row in table.findAll("tr"):
    cells = row.findAll("td")
    #For each "tr", assign each "td" to a variable.
    if len(cells) == 2:
        acronym.append(cells[0].find(text=True).rstrip())
        meaning.append(cells[1].find(text=True).rstrip())

#building dictionary
map_shortcuts = {}

#create a dictionary with acronyms as keys and meanings as values
for index in range(len(acronym)):
    map_shortcuts[acronym[index]] = meaning[index]



In [17]:

    
import operator
#print 10 first keys and values of the dictionary
sorted(map_shortcuts.items(), key=operator.itemgetter(1))[:10]









    Out[17]:





[(u'C/S', u' Call sign'),
 (u'L:', u' Location (in relation to S, A, L, T)'),
 (u'OBJ', u' Objective'),
 (u'BSN', u'(Camp) Bastion'),
 (u'cgbg 1 coy', u'1 company, Coldstream Guards battle group'),
 (u'GBU-31', u'2,000lb "smart bomb\''),
 (u'42 CDO RM', u'42 Commando Royal Marines'),
 (u'GBU-12', u'500lb laser-guided "smart bomb"'),
 (u'508 STB', u'508th special troops battalion'),
 (u'81', u'81mm mortar round')]



In [18]:

    
#Since we have 165 categories and computing a wordcloud for each category is not feasible, we just pick some of the 
#categories we want to compute the wordclouds of.

#indices of categories we are going to compute wordclouds of
indices = [66,68,80,82,100,102]

#get categories names
categ = [n_occurrences.index[idx] for idx in indices]

#filter the dataset. Only keep those rows with the categories selected
db_wc = data[data['Category'].isin(categ)]



In [19]:

    
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

#We replace each word in the acronyms description by their meaning. Therefore, we get an extended description.

def matchwords(words):

    text = []

    for word in words:
        if word in map_shortcuts:
            text.append(word.replace(word, map_shortcuts[word]))
        else:
            text.append(word)
            
    return ' '.join(text)



In [20]:

    
#remove rows with nans in summary
db_wc_filt = db_wc[db_wc['Summary'].notnull()]

#get indices of the filtered dataframe
indices = [index for index in db_wc_filt.index]

long_summary = []
for row in indices:
    #tokenize words in the summary
    words = tokenizer.tokenize(db_wc_filt['Summary'][row])
    #match acronyms to meanings and generate an extended summary
    long_summary.append(matchwords(words))



In [21]:

    
#create column with extended summary
db_wc_filt.insert(7, 'Extended Summary', long_summary)



In [22]:

    
db_wc_filt.head()









    Out[22]:






  
    
      
      ReportKey
      DateOccurred
      EventType
      Category
      TrackingNumber
      Title
      Summary
      Extended Summary
      Region
      AttackOn
      ...
      Longitude
      OriginatorGroup
      UpdatedByGroup
      CommandersCriticalInformationRequirements
      Significant
      Affiliation
      DisplayColor
      ClassificationLevel
      Year
      Hour
    
  
  
    
      4697
      971D1398-26F6-42A7-82C7-A03E16E95FA7
      2007-02-17 14:43:00
      Criminal Event
      Carjacking
      2007-048-144324-0833
      Second Maersk Line Driver, Truck, and Shipment...
      FROM:  OPERATIONS OFFICER, AFGHAN DET - 831st ...
      FROM OPERATIONS OFFICER AFGHAN DET 831st TRANP...
      UNKNOWN
      ENEMY
      ...
      70.628479
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
      2007
      14
    
    
      6301
      D995545F-592A-49F1-8B8D-1131B45C50F6
      2006-01-30 22:31:00
      Criminal Event
      Carjacking
      2007-033-004219-0104
      300828Z Criminal Event LN report TF Phoenix
      (UNCONFIRMED REPORT) LN reported to TF Phoenix...
      UNCONFIRMED REPORT Local national reported to ...
      RC WEST
      ENEMY
      ...
      64.011124
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
      2006
      22
    
    
      11260
      E3B84BAD-7AF0-4F73-AA13-165612C20848
      2007-04-01 09:27:00
      Non-Combat Event
      Natural Disaster
      2007-091-092933-0405
      D6 010927ZAPR07 TF Cincinnatus Reports Natural...
      Initial Report: At 010927ZAPR07 Bamyan PRT not...
      Initial Report At 010927ZAPR07 Bamyan Provinci...
      RC EAST
      NEUTRAL
      ...
      67.245644
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
      2007
      9
    
    
      17324
      F0076CEB-9114-4BDF-917E-61E6343825E1
      2007-02-08 15:30:00
      Criminal Event
      Carjacking
      2007-040-134513-0505
      081530Z Hijacking of Supply Trucks in PAK
      1.      Category: 2\n\n \n\n2.      TYPE OF IN...
      1 Category 2 2 TYPE OF INCIDENT Three Addition...
      RC EAST
      ENEMY
      ...
      71.067284
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      ENEMY
      RED
      SECRET
      2007
      15
    
    
      17379
      910B9A6D-5680-480C-A2AF-A7D802A9E915
      2007-02-10 00:00:00
      Non-Combat Event
      Natural Disaster
      2007-042-071612-0929
      10FEB TF Phoenix Rock Slide (No injuries, Rout...
      On 10FEB07 TF Phoenix reported a rock slide bl...
      On 10FEB07 Task force Phoenix reported a rock ...
      RC CAPITAL
      NEUTRAL
      ...
      69.402046
      UNKNOWN
      UNKNOWN
      NaN
      NaN
      NEUTRAL
      GREEN
      SECRET
      2007
      0
    
  

5 rows × 35 columns

For computing the tf-idf weights for each document in the corpus, it is required in the corpus a series of steps:

Tokenize the corpus
Model the vector space
Compute the tf-idf weight for each document in the corpus

We start by building the corpus for each document.



In [23]:

    
branch_corpus = {}
for cat in categ:
    d = db_wc_filt[db_wc_filt['Category'] == cat]
    #for each category, join all text from the extended summary and save it in a dictionary
    branch_corpus[cat] = ' '.join(d['Extended Summary'])



In [24]:

    
import math
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#load english stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))
#create a WordNetLemmatizer for stemming the tokens
wordnet_lemmatizer = WordNetLemmatizer() 

#compute number of times a word appears in a document
def freq(word, doc):
    return doc.count(word)

#get number of words in the document
def word_count(doc):
    return len(doc)

#compute TF
def tf(word, doc):
    return (freq(word, doc) / float(word_count(doc)))

#compute number of documents containing a particular word
def num_docs_containing(word, list_of_docs):
    count = 0
    for document in list_of_docs:
        if freq(word, document) > 0:
            count += 1
    return 1 + count


#compute IDF
def idf(word, list_of_docs):
    return math.log(len(list_of_docs) /
            float(num_docs_containing(word, list_of_docs)))

#compute TF-IDF
def tf_idf(word, doc, list_of_docs):
    return (tf(word, doc) * idf(word, list_of_docs))


vocabulary = []
docs = {}

for key,text in branch_corpus.iteritems():
        #tokenize text for a particular category
        tokens = nltk.word_tokenize(text)
        #lower case the words
        tokens = [token.lower() for token in tokens]
        #only keep words, disregard digits
        tokens = [token for token in tokens if token.isalpha()]
        #disregard stopwords and words that are less than 2 letter in length
        tokens = [token for token in tokens if token not in stopwords and len(token) > 2]
        final_tokens = []
        final_tokens.extend(tokens)
        
        docs[key] = {'tf': {}, 'idf': {},
                        'tf-idf': {}, 'tokens': []}
        #compute TF 
        for token in final_tokens:
            #The term-frequency (Normalized Frequency)
            docs[key]['tf'][token] = tf(token, final_tokens)
            docs[key]['tokens'] = final_tokens
        vocabulary.append(final_tokens)

#Compute IDF and TF-IDF
for doc in docs:
    for token in docs[doc]['tf']:
        #The Inverse-Document-Frequency
        docs[doc]['idf'][token] = idf(token, vocabulary)
        #The tf-idf
        docs[doc]['tf-idf'][token] = tf_idf(token, docs[doc]['tokens'], vocabulary)



In [25]:

    
#get the 10 words with highest TF-IDF score for category 'Hard Landing'
sorted_dic = sorted(docs['Arrest']['tf-idf'].items(), key=operator.itemgetter(1), reverse=True)
sorted_dic[0:10]









    Out[25]:





[(u'arrested', 0.009088038866306866),
 (u'seized', 0.005450238143640517),
 (u'arrest', 0.004282329970003263),
 (u'detained', 0.004175585425059912),
 (u'rel', 0.0027251190718202583),
 (u'son', 0.002701849392685825),
 (u'house', 0.002701849392685825),
 (u'suspects', 0.0024562267206234773),
 (u'suspicious', 0.002335816347274507),
 (u'mines', 0.002335816347274507)]



In [26]:

    
#So that we can create the wordclouds we need to round the tf-idf to the nearest integer value. Then we combine 
#all words together in one long string separated by spaces repeating each word to its rounded tf-idf score. 
#Note: We not only round but we scale them by a factor of 100 since we get very low tf-idf values.

for cat in categ:
    for tup in docs[cat]['tf-idf'].items():
        #scale each tf-idf value by a factor of 1000 and round it
        docs[cat]['tf-idf'][tup[0]] = int(round(tup[1]*1000))



In [27]:

    
#see how they have been scaled and rounded
sorted_dic = sorted(docs['Arrest']['tf-idf'].items(), key=operator.itemgetter(1), reverse=True)
sorted_dic[0:10]









    Out[27]:





[(u'arrested', 9),
 (u'seized', 5),
 (u'arrest', 4),
 (u'detained', 4),
 (u'rel', 3),
 (u'son', 3),
 (u'house', 3),
 (u'detention', 2),
 (u'capture', 2),
 (u'boorhamodin', 2)]



In [28]:

    
#generate text for wordclouds
doc_wordcloud = {}

for cat in categ:
    string = []
    for tup in docs[cat]['tf-idf'].items():
        if tup[1] > 0:
            #repeat each word to its scaled and rounded tf-idf score
            string.extend(np.repeat(tup[0],tup[1]))
            
    doc_wordcloud[cat] = ' '.join(string)



In [29]:

    
from wordcloud import WordCloud

#generate wordclouds
for cat in categ:
    print "#### %s ####" % cat
    wordcloud = WordCloud().generate(doc_wordcloud[cat])
    img=plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()









    



#### Natural Disaster ####






    












    



#### Arrest ####






    












    



#### Carjacking ####






    












    



#### Green-Green ####






    












    



#### Downed Aircraft ####






    












    



#### Refugees ####



In [30]:

    
#Save wordclouds to files so that we can read them in js for visualizing with D3
def savefile(category):
    filename = '../../../data/' + category
    with open(filename, "w") as text_file:
        text_file.write(doc_wordcloud[category])

for cat in categ:
    savefile(cat)

LDA

In this section we use Latent Dirichlet Allocation (LDA) which is a topic model which generates topics based on word frequency from a set of documents.

We consider each extended description in our data set as a document. We do data cleaning on it since it is crucial in order to generate a useful topic model. After the cleaning of the data, we need to generate the LDA model. In order to do that we need to understand how frequently each term occurs within each document. We do that by constructing a document-term matrix (bag of words) with a package called gensim. Once we have constructed the bag of words we are ready to generate and LDA model.



In [31]:

    
import logging, gensim, bz2
from gensim import corpora, models, similarities
import os
import logging
import gensim
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import lxml.html



In [32]:

    
#will be used for text cleaning
stopwords = set(nltk.corpus.stopwords.words('english')) #load english stopwords
stoplist = set('for a of the and to in'.split())
manual_list=set('singapore event none expo nan get ticket tickets http https singaporean'.split())
wordnet_lemmatizer = WordNetLemmatizer() #create a WordNetLemmatizer to user for stemming our categories



In [33]:

    
documents = [summary for summary in db_wc_filt['Extended Summary']]



In [34]:

    
#For cleaning the text:
##Lemmatize words (get the root of the words)
##Only keep words and disregard digits
##Make them lowercase so that there is no difference between "Then" and "then" for instance
##Remove stopwords which are english common words that do not have any meaning
##Remove words that are less than 2 letters in length since they do not usually provide any meaning 

clean_dict={}
i=0;
for document in documents:
    try:
        cleaned_html = lxml.html.fromstring(str(document)).text_content()
    except:
        cleaned_html=""
    cleaned_html.encode('ascii', 'ignore')
    temp_token_list=nltk.word_tokenize(str(cleaned_html.encode('ascii', 'ignore')))
    content = [wordnet_lemmatizer.lemmatize(w.lower()) for w in temp_token_list if 
               ( w.isalpha() and w.lower() not in stopwords and w.lower() not in stoplist \
                and w.lower() not in manual_list and len(w)>2)]
    
    #save the cleaned tokens
    clean_dict[i]=content
    i+=1



In [35]:

    
# remove words that appear only once
from collections import defaultdict

frequency = defaultdict(int)
for text in clean_dict.values():
    for token in text:
        frequency[token] += 1

i=0
for text in clean_dict.values():
    clean_dict[i] = [token for token in text if frequency[token] > 1]
    i+=1



In [36]:

    
dictionary = corpora.Dictionary(clean_dict.values())
corpus = [dictionary.doc2bow(text) for text in clean_dict.values()]



In [37]:

    
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                  level=logging.INFO)

#Compute LDA. Set number of topics and number of words per topic to 10.
lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=10)



In [38]:

    
lda.print_topics(num_topics=10, num_words=10)









    Out[38]:





[u'0.062*afghan + 0.056*national + 0.032*army + 0.019*police + 0.016*force + 0.014*security + 0.010*arrested + 0.009*team + 0.008*assistance + 0.008*provincial',
 u'0.031*amp + 0.018*force + 0.013*apos + 0.012*refugee + 0.012*bagram + 0.012*assistance + 0.010*cincinnatus + 0.010*may + 0.009*afghan + 0.009*fuel',
 u'0.017*truck + 0.013*base + 0.011*report + 0.010*village + 0.009*driver + 0.009*people + 0.008*operating + 0.008*forward + 0.008*information + 0.007*air',
 u'0.080*afghan + 0.071*national + 0.061*police + 0.017*army + 0.014*force + 0.012*wounded + 0.011*action + 0.010*shot + 0.009*security + 0.009*reported',
 u'0.037*force + 0.022*national + 0.020*reported + 0.017*afghan + 0.016*task + 0.015*update + 0.013*base + 0.013*report + 0.011*forward + 0.011*operating',
 u'0.023*flood + 0.019*bridge + 0.016*road + 0.015*river + 0.011*damage + 0.011*valley + 0.010*destroyed + 0.010*canal + 0.009*work + 0.009*june',
 u'0.026*team + 0.025*provincial + 0.023*reconstruction + 0.021*road + 0.012*valley + 0.010*area + 0.010*amp + 0.008*response + 0.008*water + 0.008*flood',
 u'0.016*afghanistan + 0.013*refugee + 0.012*pakistan + 0.012*unhcr + 0.010*government + 0.009*office + 0.008*shah + 0.008*family + 0.008*interior + 0.008*ministry',
 u'0.031*lilley + 0.016*observation + 0.016*post + 0.015*fire + 0.012*boorhamodin + 0.011*sangar + 0.010*operating + 0.009*round + 0.009*military + 0.009*receiving',
 u'0.019*road + 0.011*vehicle + 0.010*mudslide + 0.010*team + 0.010*truck + 0.009*provincial + 0.009*reconstruction + 0.009*driver + 0.009*reported + 0.008*equipment']



In [39]:

    
topics = [lda.show_topic(i) for i in range(10)]



In [40]:

    
#Write to json formatted file for D3

LDA_dicts = []

for index, topic in enumerate(topics):
    name = 'Topic ' + str(index+1)
    dic_out={'name': name , 'children':[]}
    children = []
    for topic in topics[index]:
        dic = {'name':[], 'size':[]}
        dic['name'] = topic[1]
        dic['size'] = topic[0]
        children.append(dic)
    dic_out['children'] = children
    LDA_dicts.append(dic_out)



In [41]:

    
#dump dict to JSON file
import json

with open('../../../data/LDA_topics.json', 'w') as fp:
    json.dump(LDA_dicts, fp)



In [42]:

    
LDA_dicts









    Out[42]:





[{'children': [{'name': u'afghan', 'size': 0.062315425599889389},
   {'name': u'national', 'size': 0.056257503543366047},
   {'name': u'army', 'size': 0.031511483825294867},
   {'name': u'police', 'size': 0.019104610680068641},
   {'name': u'force', 'size': 0.016445516870386919},
   {'name': u'security', 'size': 0.013785890869822423},
   {'name': u'arrested', 'size': 0.010252922222086841},
   {'name': u'team', 'size': 0.0092772241701006807},
   {'name': u'assistance', 'size': 0.0079729653003190563},
   {'name': u'provincial', 'size': 0.0079611543136888594}],
  'name': 'Topic 1'},
 {'children': [{'name': u'amp', 'size': 0.031297039577423129},
   {'name': u'force', 'size': 0.018291395802085015},
   {'name': u'apos', 'size': 0.012883990006675065},
   {'name': u'refugee', 'size': 0.012357515830011958},
   {'name': u'bagram', 'size': 0.011903925524808403},
   {'name': u'assistance', 'size': 0.011764468650899673},
   {'name': u'cincinnatus', 'size': 0.010273641671430286},
   {'name': u'may', 'size': 0.010247588961244048},
   {'name': u'afghan', 'size': 0.0092997843874975551},
   {'name': u'fuel', 'size': 0.0092396884393434082}],
  'name': 'Topic 2'},
 {'children': [{'name': u'truck', 'size': 0.017097972812603817},
   {'name': u'base', 'size': 0.013193259367702137},
   {'name': u'report', 'size': 0.011205433004581001},
   {'name': u'village', 'size': 0.0098841393216456287},
   {'name': u'driver', 'size': 0.0094745698592308848},
   {'name': u'people', 'size': 0.0088046651803829375},
   {'name': u'operating', 'size': 0.0082590509930689712},
   {'name': u'forward', 'size': 0.0082497993269839073},
   {'name': u'information', 'size': 0.0077153349791427902},
   {'name': u'air', 'size': 0.0073937484322570256}],
  'name': 'Topic 3'},
 {'children': [{'name': u'afghan', 'size': 0.080008706654433062},
   {'name': u'national', 'size': 0.070682794093763435},
   {'name': u'police', 'size': 0.061270968479054895},
   {'name': u'army', 'size': 0.017088488622939206},
   {'name': u'force', 'size': 0.014363316830068765},
   {'name': u'wounded', 'size': 0.011772569444254675},
   {'name': u'action', 'size': 0.011138010575500542},
   {'name': u'shot', 'size': 0.010166119090740403},
   {'name': u'security', 'size': 0.0091998045057210195},
   {'name': u'reported', 'size': 0.0090071141809841063}],
  'name': 'Topic 4'},
 {'children': [{'name': u'force', 'size': 0.037099508829362332},
   {'name': u'national', 'size': 0.02151683026427826},
   {'name': u'reported', 'size': 0.019643760083749334},
   {'name': u'afghan', 'size': 0.017310395459674246},
   {'name': u'task', 'size': 0.015695916925359005},
   {'name': u'update', 'size': 0.015459759815543136},
   {'name': u'base', 'size': 0.013365322833514439},
   {'name': u'report', 'size': 0.013238996284963072},
   {'name': u'forward', 'size': 0.011331483323143357},
   {'name': u'operating', 'size': 0.010807508926733329}],
  'name': 'Topic 5'},
 {'children': [{'name': u'flood', 'size': 0.022737421817795728},
   {'name': u'bridge', 'size': 0.019145863901261474},
   {'name': u'road', 'size': 0.01596498971698743},
   {'name': u'river', 'size': 0.014641666503861572},
   {'name': u'damage', 'size': 0.011191571426924417},
   {'name': u'valley', 'size': 0.010521320139011709},
   {'name': u'destroyed', 'size': 0.010028512519206868},
   {'name': u'canal', 'size': 0.0096454381964119987},
   {'name': u'work', 'size': 0.0091935112231104517},
   {'name': u'june', 'size': 0.0090550371546096273}],
  'name': 'Topic 6'},
 {'children': [{'name': u'team', 'size': 0.025729379170586118},
   {'name': u'provincial', 'size': 0.024694546701534415},
   {'name': u'reconstruction', 'size': 0.022652605980638299},
   {'name': u'road', 'size': 0.021412624570352281},
   {'name': u'valley', 'size': 0.011679529304061059},
   {'name': u'area', 'size': 0.010146679649514209},
   {'name': u'amp', 'size': 0.0095518666179083018},
   {'name': u'response', 'size': 0.0084712110757321872},
   {'name': u'water', 'size': 0.0078229887100702312},
   {'name': u'flood', 'size': 0.0075734417732718565}],
  'name': 'Topic 7'},
 {'children': [{'name': u'afghanistan', 'size': 0.016274118033858417},
   {'name': u'refugee', 'size': 0.013123844453884799},
   {'name': u'pakistan', 'size': 0.011720079301356302},
   {'name': u'unhcr', 'size': 0.011581859697880954},
   {'name': u'government', 'size': 0.0096538283923194076},
   {'name': u'office', 'size': 0.0086150120952563653},
   {'name': u'shah', 'size': 0.0084828262910372626},
   {'name': u'family', 'size': 0.0083662453954871149},
   {'name': u'interior', 'size': 0.0082424117262153394},
   {'name': u'ministry', 'size': 0.0081725180858644737}],
  'name': 'Topic 8'},
 {'children': [{'name': u'lilley', 'size': 0.030780101015036278},
   {'name': u'observation', 'size': 0.016463288049459148},
   {'name': u'post', 'size': 0.016379114046847408},
   {'name': u'fire', 'size': 0.015094894336554329},
   {'name': u'boorhamodin', 'size': 0.011515370894476381},
   {'name': u'sangar', 'size': 0.011378113328502953},
   {'name': u'operating', 'size': 0.010307814472560406},
   {'name': u'round', 'size': 0.0093769925076570004},
   {'name': u'military', 'size': 0.008727718053748227},
   {'name': u'receiving', 'size': 0.0085642648952184743}],
  'name': 'Topic 9'},
 {'children': [{'name': u'road', 'size': 0.019007725442469252},
   {'name': u'vehicle', 'size': 0.011250369111866157},
   {'name': u'mudslide', 'size': 0.010446155184588822},
   {'name': u'team', 'size': 0.010403892910543128},
   {'name': u'truck', 'size': 0.010046649058733877},
   {'name': u'provincial', 'size': 0.0094502246960107283},
   {'name': u'reconstruction', 'size': 0.0089106578560191399},
   {'name': u'driver', 'size': 0.008709857115010617},
   {'name': u'reported', 'size': 0.0085316534228923532},
   {'name': u'equipment', 'size': 0.0082691241511847619}],
  'name': 'Topic 10'}]



In [ ]:

	ReportKey	DateOccurred	EventType	Category	TrackingNumber	Title	Summary	Region	AttackOn	ComplexAttack	...	MilitaryGridReferenceSystem	Latitude	Longitude	OriginatorGroup	UpdatedByGroup	CommandersCriticalInformationRequirements	Significant	Affiliation	DisplayColor	ClassificationLevel
0	D92871CA-D217-4124-B8FB-89B9A2CFFCB4	2004-01-01 00:00:00	Enemy Action	Direct Fire	2007-033-004042-0756	DIRECT FIRE Other	KAF-1BDE -S3 REPORTS: SUMMIT 09 B CO ELEMENT S...	RC EAST	ENEMY	False	...	42SWB3900916257	32.683319	69.416107	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET
1	C592135C-1BFF-4AEC-B469-0A495FDA78D9	2004-01-01 00:00:00	Friendly Action	Cache Found/Cleared	2007-033-004738-0185	CACHE FOUND/CLEARED Other	USSF FINDS CACHE IN VILLAGE OF WALU TANGAY: US...	RC EAST	FRIEND	False	...	42SXD7520076792	35.018608	70.920273	UNKNOWN	UNKNOWN	NaN	NaN	FRIEND	BLUE	SECRET
2	D50F59F0-6F32-4E63-BC02-DB2B8422DE6E	2004-01-01 00:00:00	Non-Combat Event	Propaganda	2007-033-010818-0798	PROPAGANDA Other	(M) NIGHT LETTERS DISTRIBUTED AROUND HAZARJUFT...	RC SOUTH	NEUTRAL	False	...	41RPQ1439743120	31.116390	64.199707	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET
3	E3F22EFB-F0CA-4821-9322-CC2250C05C8A	2004-01-01 00:00:00	Enemy Action	Direct Fire	2007-033-004042-0850	DIRECT FIRE Other	KAF-1BDE -S3: SUMMIT 6 REPORTS TIC SALUTE TO F...	RC EAST	ENEMY	False	...	42SWB3399911991	32.645000	69.362511	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET
4	4D0E1E60-9535-4D58-A374-74367F058788	2004-01-01 00:00:00	Friendly Action	Cache Found/Cleared	2007-033-004738-0279	CACHE FOUND/CLEARED Other	KAF-1BDE -S3 REPORTS: GERONIMO 11 SALUTE AS FO...	RC EAST	FRIEND	False	...	42SWB7580277789	33.236389	69.813606	UNKNOWN	UNKNOWN	NaN	NaN	FRIEND	BLUE	SECRET

	Year	EnemyKilled
0	2004	3
1	2004	0
2	2004	0
3	2004	8
4	2004	0

	Year	Afghan forces	Nato forces	Taliban	Civilians
0	2004	316	133	93	208
1	2005	504	480	200	454
2	2006	1397	1083	252	1794
3	2007	2093	1652	377	1962
4	2008	1749	1255	285	1980
5	2009	2444	2693	617	2646

	ReportKey	DateOccurred	EventType	Category	TrackingNumber	Title	Summary	Extended Summary	Region	AttackOn	...	Longitude	OriginatorGroup	UpdatedByGroup	CommandersCriticalInformationRequirements	Significant	Affiliation	DisplayColor	ClassificationLevel	Year	Hour
4697	971D1398-26F6-42A7-82C7-A03E16E95FA7	2007-02-17 14:43:00	Criminal Event	Carjacking	2007-048-144324-0833	Second Maersk Line Driver, Truck, and Shipment...	FROM: OPERATIONS OFFICER, AFGHAN DET - 831st ...	FROM OPERATIONS OFFICER AFGHAN DET 831st TRANP...	UNKNOWN	ENEMY	...	70.628479	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET	2007	14
6301	D995545F-592A-49F1-8B8D-1131B45C50F6	2006-01-30 22:31:00	Criminal Event	Carjacking	2007-033-004219-0104	300828Z Criminal Event LN report TF Phoenix	(UNCONFIRMED REPORT) LN reported to TF Phoenix...	UNCONFIRMED REPORT Local national reported to ...	RC WEST	ENEMY	...	64.011124	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET	2006	22
11260	E3B84BAD-7AF0-4F73-AA13-165612C20848	2007-04-01 09:27:00	Non-Combat Event	Natural Disaster	2007-091-092933-0405	D6 010927ZAPR07 TF Cincinnatus Reports Natural...	Initial Report: At 010927ZAPR07 Bamyan PRT not...	Initial Report At 010927ZAPR07 Bamyan Provinci...	RC EAST	NEUTRAL	...	67.245644	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET	2007	9
17324	F0076CEB-9114-4BDF-917E-61E6343825E1	2007-02-08 15:30:00	Criminal Event	Carjacking	2007-040-134513-0505	081530Z Hijacking of Supply Trucks in PAK	1. Category: 2\n\n \n\n2. TYPE OF IN...	1 Category 2 2 TYPE OF INCIDENT Three Addition...	RC EAST	ENEMY	...	71.067284	UNKNOWN	UNKNOWN	NaN	NaN	ENEMY	RED	SECRET	2007	15
17379	910B9A6D-5680-480C-A2AF-A7D802A9E915	2007-02-10 00:00:00	Non-Combat Event	Natural Disaster	2007-042-071612-0929	10FEB TF Phoenix Rock Slide (No injuries, Rout...	On 10FEB07 TF Phoenix reported a rock slide bl...	On 10FEB07 Task force Phoenix reported a rock ...	RC CAPITAL	NEUTRAL	...	69.402046	UNKNOWN	UNKNOWN	NaN	NaN	NEUTRAL	GREEN	SECRET	2007	0