Getting Started with ICEWS in Python

David Masad

The Integrated Crisis Early Warning System (ICEWS) is a machine-coded event dataset developed by Lockheed Martin and others for DARPA and the Office of Naval Research. For a long time, ICEWS was available only within the Department of Defense, and to a few select academics. Now, for the first time, a checkpointed version of ICEWS is being released to the general public (or, at least, the parts of the general public that care about political event data).

Unlike some event data sets, the public version of ICEWS will only be updated annually or so, but it still includes almost 20 years worth of event data that's been used successfully both in the government and academic research.

This document is mostly a cleaned-up version of my own initial exploration of the dataset. Hopefully it'll prove useful to others who want to use ICEWS in their own research.

UPDATE (03/29/15): Jennifer Lautenschlager, from the ICEWS team at Lockheed Martin, was kind enough to provide some clarifications, which I've added.

Technical note

This is done in Python 3.4.2, with pandas version 0.15.2. The only requirement that might be tricky to install is Basemap, which is only used for the mapping section. You won't miss much without it.



In [1]:

    
import os
from collections import defaultdict

# Other libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
# Show plots inline
%matplotlib inline

Downloading the data

The data is available via the Harvard Dataverse, at http://thedata.harvard.edu/dvn/dv/icews. The two datasets I use are the ICEWS Coded Event Data and the Ground Truth Data Set. The easiest way to download both is to go to the Data & Analysis tab, click Select all files at the top, and then Download Selected Files.

The ICEWS event data comes as one file per year, initially zipped. On OSX or Linux, you can unzip all the files in a directory at once from the terminal with

$ unzip "*.zip"

And you can delete all the zipped files with

$ rm *.zip

In this document, I assume that all the annual data files, as well as the one Ground Truth data file, are in the same directory.

Loading the data



In [2]:

    
# Path to directory where the data is stored
DATA = "/Users/dmasad/Data/ICEWS/"

For testing purposes, I start by loading a single year into a pandas DataFrame. The data files are tab-delimited, and have the column names as the first row.



In [3]:

    
one_year = pd.read_csv(DATA + "events.1995.20150313082510.tab", sep="\t")



In [4]:

    
one_year.head()









    Out[4]:






  
    
      
      Event ID
      Event Date
      Source Name
      Source Sectors
      Source Country
      Event Text
      CAMEO Code
      Intensity
      Target Name
      Target Sectors
      Target Country
      Story ID
      Sentence Number
      Publisher
      City
      District
      Province
      Country
      Latitude
      Longitude
    
  
  
    
      0
       926685
       1995-01-01
                        Extremist (Russia)
       Radicals / Extremists / Fundamentalists,Dissident
           Russian Federation
                                 Praise or endorse
        51
       3.4
                             Boris Yeltsin
         Elite,Executive,Executive Office,Government
           Russian Federation
       28235806
       5
       The Toronto Star
         Moscow
            NaN
        Moskva
           Russian Federation
       55.7522
       37.6156
    
    
      1
       926687
       1995-01-01
       Government (Bosnia and Herzegovina)
                                              Government
       Bosnia and Herzegovina
                       Express intent to cooperate
        30
       4.0
                          Citizen (Serbia)
       General Population / Civilian / Social,Social
                       Serbia
       28235807
       1
       The Toronto Star
            NaN
            NaN
        Bosnia
       Bosnia and Herzegovina
       44.0000
       18.0000
    
    
      2
       926686
       1995-01-01
                          Citizen (Serbia)
           General Population / Civilian / Social,Social
                       Serbia
                       Express intent to cooperate
        30
       4.0
       Government (Bosnia and Herzegovina)
                                          Government
       Bosnia and Herzegovina
       28235807
       1
       The Toronto Star
            NaN
            NaN
        Bosnia
       Bosnia and Herzegovina
       44.0000
       18.0000
    
    
      3
       926688
       1995-01-01
                                    Canada
                                                     NaN
                       Canada
                                 Praise or endorse
        51
       3.4
                       City Mayor (Canada)
                          Government,Local,Municipal
                       Canada
       28235809
       3
       The Toronto Star
            NaN
            NaN
       Ontario
                       Canada
       49.2501
      -84.4998
    
    
      4
       926689
       1995-01-01
                  Lawyer/Attorney (Canada)
                                            Legal,Social
                       Canada
       Arrest, detain, or charge with legal action
       173
      -5.0
                           Police (Canada)
                                   Government,Police
                       Canada
       28235964
       1
       The Toronto Star
       Montreal
       Montreal
        Quebec
                       Canada
       45.5088
      -73.5878



In [5]:

    
one_year.dtypes









    Out[5]:





Event ID             int64
Event Date          object
Source Name         object
Source Sectors      object
Source Country      object
Event Text          object
CAMEO Code           int64
Intensity          float64
Target Name         object
Target Sectors      object
Target Country      object
Story ID             int64
Sentence Number      int64
Publisher           object
City                object
District            object
Province            object
Country             object
Latitude           float64
Longitude          float64
dtype: object

Looks pretty good! Notice that the Event Date column is an object (meaning a string), so when we load in all of the data we should tell pandas to parse it automatically.

Loading all the data

The ICEWS data isn't too big to hold in memory all at once, so I go ahead and load the entire thing. To do it, we'll iterate over all the data files, read each into a DataFrame, and then concatenate them together.

Note that in this code, I added the parse_dates=[1] argument to the .read_csv(...) method, telling pandas to parse the second column as a date.

This code assumes that the ICEWS data files are the only .tab files in your DATA directory. If that isn't the case, adjust as needed.



In [6]:

    
all_data = []
for f in os.listdir(DATA): # Iterate over all files
    if f[-3:] != "tab":  # Skip non-tab files.
        continue
    df = pd.read_csv(DATA + f, sep='\t', parse_dates=[1])
    all_data.append(df)

data = pd.concat(all_data)

Some of the ICEWS column names have spaces in them, which means they can't be referenced using pandas's period notation. To fix this, I rename the columns to replace the spaces with underscores:



In [7]:

    
cols = {col: col.replace(" ", "_") for col in data.columns}
data.rename(columns=cols, inplace=True)



In [8]:

    
data.dtypes









    Out[8]:





Event_ID                    int64
Event_Date         datetime64[ns]
Source_Name                object
Source_Sectors             object
Source_Country             object
Event_Text                 object
CAMEO_Code                  int64
Intensity                 float64
Target_Name                object
Target_Sectors             object
Target_Country             object
Story_ID                    int64
Sentence_Number             int64
Publisher                  object
City                       object
District                   object
Province                   object
Country                    object
Latitude                  float64
Longitude                 float64
dtype: object



In [9]:

    
print(data.Event_Date.min())
print(data.Event_Date.max())









    



1995-01-01 00:00:00
2014-02-28 00:00:00



In [10]:

    
len(data)









    Out[10]:





13514121

Looks good! The data types are what we expect, and the dates seem to have been parsed correctly.

Examining the data

A good initial examination of the data is seeing who the most frequent actors are. The following code counts how often each actor appears as the source or target of an event:



In [11]:

    
actors_source = data.Source_Name.value_counts()
actors_target = data.Target_Name.value_counts()
actor_counts = pd.DataFrame({"SourceFreq": actors_source,
                             "TargetFreq": actors_target})
actor_counts.fillna(0, inplace=True)
actor_counts["Total"] = actor_counts.SourceFreq + actor_counts.TargetFreq

Now let's look at the top 50 actors. For people like me who are more used to GDELT and Phoenix, the actor list might look a little different than what we expect:



In [12]:

    
actor_counts.sort("Total", ascending=False, inplace=True)
actor_counts.head(50)









    Out[12]:






  
    
      
      SourceFreq
      TargetFreq
      Total
    
  
  
    
      United States
       330446
       341603
       672049
    
    
      Russia
       195571
       260635
       456206
    
    
      China
       192944
       254747
       447691
    
    
      Israel
       116427
       150810
       267237
    
    
      Japan
       103651
       145657
       249308
    
    
      India
        96871
       147406
       244277
    
    
      Iran
        84099
       150208
       234307
    
    
      Citizen (India)
        65350
       136966
       202316
    
    
      United Nations
        92022
       107413
       199435
    
    
      Unspecified Actor
            0
       198718
       198718
    
    
      European Union
        90824
       100310
       191134
    
    
      Iraq
        48025
       128246
       176271
    
    
      Vladimir Putin
       102453
        73104
       175557
    
    
      George W. Bush
       100763
        72909
       173672
    
    
      North Korea
        65394
       105958
       171352
    
    
      Turkey
        64995
       104946
       169941
    
    
      South Korea
        69509
        89696
       159205
    
    
      Pakistan
        57450
        98419
       155869
    
    
      Police (India)
       117726
        32771
       150497
    
    
      United Kingdom
        65195
        81465
       146660
    
    
      Palestinian Territory, Occupied
        39861
       101029
       140890
    
    
      France
        61803
        72704
       134507
    
    
      Australia
        41393
        71031
       112424
    
    
      Afghanistan
        31678
        75604
       107282
    
    
      North Atlantic Treaty Organization
        51437
        52715
       104152
    
    
      Syria
        33110
        64113
        97223
    
    
      Germany
        42116
        51114
        93230
    
    
      Egypt
        34961
        47969
        82930
    
    
      Barack Obama
        46660
        34782
        81442
    
    
      Indonesia
        28997
        43795
        72792
    
    
      Georgia
        26227
        45432
        71659
    
    
      Hu Jintao
        39679
        28938
        68617
    
    
      Citizen (Palestinian Territory, Occupied)
        19921
        48444
        68365
    
    
      Yasir Arafat
        31196
        35228
        66424
    
    
      Citizen (Russia)
        23473
        42401
        65874
    
    
      Thailand
        25541
        40200
        65741
    
    
      Mahmoud Abbas
        34629
        29995
        64624
    
    
      Citizen (Australia)
        22541
        41030
        63571
    
    
      Serbia
        21006
        41502
        62508
    
    
      Taliban
        31678
        30688
        62366
    
    
      Government (India)
        12117
        49629
        61746
    
    
      Israeli Defense Forces
        42364
        19130
        61494
    
    
      UN Security Council
        27184
        34210
        61394
    
    
      Ukraine
        23290
        37906
        61196
    
    
      Taiwan
        19421
        41052
        60473
    
    
      Kofi Annan
        37150
        22442
        59592
    
    
      Vietnam
        24456
        34951
        59407
    
    
      Tony Blair
        33083
        25713
        58796
    
    
      Lebanon
        17041
        40094
        57135
    
    
      Mexico
        24690
        32142
        56832

What stood out to me was the mix of country-level actors with named individuals. Unlike event datasets that use CAMEO coding, leaders or sub-state organizations don't seem to be coded as add-ons to a state actor code (e.g. USAGOV) but separate actors in their own right.

Update (03/29/2015): The _Sectors column contains the role information that would otherwise be contained in the chained CAMEO designations. For example, if you scroll back to the first row of 1995 data, the target name is Boris Yeltsin, and the target sectors associated with him are "Elite,Executive,Executive Office,Government".

The Citizen (Country) actor stood out to me in particular, especially since it isn't mentioned specifically in the included documentation -- so let's take a look at some of the rows that use it:



In [13]:

    
data[data.Source_Name=="Citizen (India)"].head()









    Out[13]:






  
    
      
      Event_ID
      Event_Date
      Source_Name
      Source_Sectors
      Source_Country
      Event_Text
      CAMEO_Code
      Intensity
      Target_Name
      Target_Sectors
      Target_Country
      Story_ID
      Sentence_Number
      Publisher
      City
      District
      Province
      Country
      Latitude
      Longitude
    
  
  
    
      676 
        927826
      1995-01-11
       Citizen (India)
       General Population / Civilian / Social,Social
       India
       Reject proposal to meet, discuss, or negotiate
        125
       -5.0
           Narasimha Rao
                   Executive Office,Executive,Government
               India
       28239021
       2
       The Associated Press Political Service
             NaN
                 NaN
                       State of Tamil Nadu
               India
       11.0000
       78.0000
    
    
      783 
        927996
      1995-01-12
       Citizen (India)
       General Population / Civilian / Social,Social
       India
                  Express intent to meet or negotiate
         36
        4.0
           United States
                                                     NaN
       United States
       28239081
       1
       The Associated Press Political Service
         Swanton
       Saline County
                                  Nebraska
       United States
       40.3781
      -97.0728
    
    
      2151
       2547954
      1995-01-26
       Citizen (India)
       Social,General Population / Civilian / Social
       India
                                 Demonstrate or rally
        141
       -6.5
       Unspecified Actor
                                             Unspecified
                 NaN
       28242253
       2
                                 Reuters News
           Jammu
                 NaN
                State of Jammu and Kashmir
               India
       32.7357
       74.8691
    
    
      5760
        935070
      1995-03-06
       Citizen (India)
       Social,General Population / Civilian / Social
       India
                             Kill by physical assault
       1823
      -10.0
          Congress Party
       (National) Major Party,Government Major Party ...
               India
       28915932
       4
       The Associated Press Political Service
       Hyderabad
           Hyderabad
                   State of Andhra Pradesh
               India
       17.3841
       78.4564
    
    
      5766
        935081
      1995-03-06
       Citizen (India)
       Social,General Population / Civilian / Social
       India
                          Use unconventional violence
        180
       -9.0
        Militant (India)
                                     Unidentified Forces
               India
       28915955
       6
       The Associated Press Political Service
       New Delhi
                 NaN
       National Capital Territory of Delhi
               India
       28.6358
       77.2244

So it looks like Citizen really means civilians, or possibly civil society actors unaffiliated with any organization the ICEWS coding system recognizes.

Update (03/29/2015): I had trouble finding news events that corresponded to the events above, but Jennifer Lautenschlager pointed me to this news article that indicates that there was election violence in India in that time frame.

To get country-level actors comparable to other event datasets, looks like we need to use the source and target country columns:



In [14]:

    
country_source = data.Source_Country.value_counts()
country_target = data.Target_Country.value_counts()
country_counts = pd.DataFrame({"SourceFreq": country_source,
                             "TargetFreq": country_target})
country_counts.fillna(0, inplace=True)
country_counts["Total"] = country_counts.SourceFreq + country_counts.TargetFreq



In [15]:

    
country_counts.sort("Total", ascending=False, inplace=True)
country_counts.head(10)









    Out[15]:






  
    
      
      SourceFreq
      TargetFreq
      Total
    
  
  
    
      United States
       997696
       803460
       1801156
    
    
      India
       773712
       760583
       1534295
    
    
      Russian Federation
       746829
       706700
       1453529
    
    
      China
       541432
       525955
       1067387
    
    
      Japan
       344413
       332380
        676793
    
    
      Australia
       340339
       320329
        660668
    
    
      Israel
       338118
       315501
        653619
    
    
      United Kingdom
       331735
       302389
        634124
    
    
      Occupied Palestinian Territory
       251678
       317883
        569561
    
    
      Iran
       286274
       283276
        569550

This looks pretty good too! India seems more represented compared to what I've seen in other datasets, and of course Israel/Palestine maintain their usual place on the event data leaderboard.

Update (03/29/2015): Since the Sectors are also an important way of understanding the relevant data, let's get their frequencies too. Sectors are a bit trickier, since each cell can contain multiple selectors, separated by commas. So we need to loop over each cell, split the selectors mentioned, and count each one.



In [16]:

    
# Count source sectors
source_sectors = defaultdict(int)
source_sector_counts = data.Source_Sectors.value_counts()
for sectors, count in source_sector_counts.iteritems():
    sectors = sectors.split(",")
    for sector in sectors:
        source_sectors[sector] += 1

# Count Target sectors
target_sectors = defaultdict(int)
target_sector_counts = data.Target_Sectors.value_counts()
for sectors, count in target_sector_counts.iteritems():
    sectors = sectors.split(",")
    for sector in sectors:
        target_sectors[sector] += 1
        
# Convert into series
source_sectors = pd.Series(source_sectors)
target_sectors = pd.Series(target_sectors)
# Combine into a dataframe, and fill missing with 0
sector_counts = pd.DataFrame({"SourceFreq": source_sectors,
                              "TargetFreq": target_sectors})

sector_counts.fillna(0, inplace=True)
sector_counts["Total"] = sector_counts.SourceFreq + sector_counts.TargetFreq



In [17]:

    
sector_counts.sort("Total", ascending=False, inplace=True)



In [18]:

    
sector_counts.head(10)









    Out[18]:






  
    
      
      SourceFreq
      TargetFreq
      Total
    
  
  
    
      Government
       176897
       138684
       315581
    
    
      Parties
       171411
       135383
       306794
    
    
      Ideological
       153750
       121842
       275592
    
    
      (National) Major Party
       134134
       106262
       240396
    
    
      Executive
       129382
       103265
       232647
    
    
      Elite
        92926
        80163
       173089
    
    
      Legislative / Parliamentary
        63654
        45670
       109324
    
    
      Executive Office
        54710
        49770
       104480
    
    
      Cabinet
        57273
        42038
        99311
    
    
      Center Left
        48678
        37593
        86271



In [19]:

    
sector_counts.tail(10)









    Out[19]:






  
    
      
      SourceFreq
      TargetFreq
      Total
    
  
  
    
      International Exiles
       2
       1
       3
    
    
      Bedouin
       2
       0
       2
    
    
      Nepali-Pahari
       1
       1
       2
    
    
       Western
       1
       1
       2
    
    
      Navy Headquarters
       1
       1
       2
    
    
      Army Education / Training
       0
       1
       1
    
    
      Unspecified
       0
       1
       1
    
    
      Consumer Services MNCs
       1
       0
       1
    
    
      State-Owned Consumer Goods
       1
       0
       1
    
    
       Saharan
       1
       0
       1

In addition to CAMEO-type actor designations (e.g. Government) it looks like some of the Sectors also resemble the Issues in Phoenix, or Themes in the GDELT GKG.

Daily Event Counts

An easy way to get an idea of whether there were significant changes in data collection over time is to look at total events over time. ICEWS events have the full daily date only, so let's go with that and look at daily events.



In [20]:

    
daily_events = data.groupby("Event_Date").aggregate(len)["Event_ID"]



In [21]:

    
daily_events.plot(color='k', lw=0.2, figsize=(12,6), 
                  title="ICEWS Daily Event Count")









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x1044704e0>

There seems to be a definite ramp-up period from 1995 through 1999 or so, and some sort of fall in event volume around 2009. Notice that there are also a few individual days, especially around 2004, with very few events for some reason.

Update (03/29/2015): Jennifer Lautenschlager clarified that the jumps in the 1995-2001 period reflect publishers entering incrementally into the commercial data system that feeds into ICEWS. The post-2008 dip reflects a decline in number of stories overall, possibly driven by budget cuts due to the recession.

Since each event has an associated Story ID, we can count how many unique stories are processed by ICEWS every day and end up generating events.



In [22]:

    
daily_stories = data.groupby("Event_Date").aggregate(pd.Series.nunique)["Story_ID"]



In [23]:

    
daily_stories.plot(color='k', lw=0.2, figsize=(12,6), 
                   title="ICEWS Daily Story Count")









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a31dfd0>

With these two series, we can measure the daily average events generated per story:



In [24]:

    
events_per_story = daily_events / daily_stories
events_per_story.plot(color='k', lw=0.2, figsize=(12,6), 
                      title="ICEWS Daily Events Per Story")









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a317080>

This confirms that indeed, except for a few anomalies, the number of events generated per story stays relatively consistent over time. Nevertheless, it's probably important to at least try to distinguish between fewer stories as caused by fewer newsworthy events, and fewer stories as caused by fewer journalists writing them.

Mapping

Another good way to get an idea of the dataset's coverage is to put the events on a map. To do that, let's group the data by the latitude and longitude for each event, and count the number of events at each point. Then we can put those points on a world map using the basemap package.



In [25]:

    
points = data.groupby(["Latitude", "Longitude"]).aggregate(len)["Event_ID"]
points = points.reset_index()

Nobody will be surprised that the distribution of events-per-point is very long-tailed, with many points having only a small number of events, and a small number of points having hundreds of thousands of events.



In [26]:

    
points.Event_ID.hist()
plt.yscale('log')

So the best way to deal with this is to plot point size based on the log of the number of events recorded there.

The following code draws a world map using Basemap's default, built-in map, and then iterates over all the points, putting a dot on the map for each one. Finally, it exports the resulting map to a PNG file



In [27]:

    
plt.figure(figsize=(16,16))

# Draw the world map itself
m = Basemap(projection='eck4',lon_0=0,resolution='c')
m.drawcoastlines()
m.fillcontinents()
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,120.,30.))
m.drawmeridians(np.arange(0.,360.,60.))
m.drawmapboundary()
m.drawcountries()
plt.title("ICEWS Total Events", fontsize=24)

# Plot the points
for row in points.iterrows():
    row = row[1]
    lat = row.Latitude
    lon = row.Longitude
    count = np.log10(row.Event_ID + 1) * 2
    x, y = m(lon, lat) # Convert lat-long to plot coordinates
    m.plot(x, y, 'ro', markersize=count, alpha=0.3)
plt.savefig("ICEWS.png", dpi=120, facecolor="#FFFFFF")

This looks... shockingly good to me. A few regions -- particularly the Indian subcontinent, East Asia and South America -- seem much better covered than in some other datasets. US Pacific Command was one of ICEWS's first customers, so it makes sense that its AOR would be well covered. Nigeria also seems to be relatively densly-covered, though whether this is because of particular attention or simply its population and regional significance isn't clear.

The ICEWS documentation says that purely domestic US events aren't included. This explains why the continental US appears sparser than some other datasets -- but there are obviously many points still left. Most of these events have at least one foreign actor, and apparently very few purely domestic events slip past the filters ICEWS have in place.

The Israeli-Palestinian Dyad

Back to everyone's favorite dyad, which has had more than its fair share of event data analysis pointed at it. Let's subset all events originating from Israel and targeting what ICEWS codes as the Occupied Palestinian Territory, and vice-versa.



In [28]:

    
dyad = ["Israel", "Occupied Palestinian Territory"]
ilpalcon = data[(data.Source_Country.isin(dyad)) & 
                (data.Target_Country.isin(dyad))]



In [29]:

    
ilpalcon.head()









    Out[29]:






  
    
      
      Event_ID
      Event_Date
      Source_Name
      Source_Sectors
      Source_Country
      Event_Text
      CAMEO_Code
      Intensity
      Target_Name
      Target_Sectors
      Target_Country
      Story_ID
      Sentence_Number
      Publisher
      City
      District
      Province
      Country
      Latitude
      Longitude
    
  
  
    
      50 
       926771
      1995-01-03
               Citizen (Palestinian Territory, Occupied)
           Social,General Population / Civilian / Social
       Occupied Palestinian Territory
                  Criticize or denounce
       111
      -2.0
             Yitzhak Rabin
       Government,State-Owned Defense / Security,Exec...
                               Israel
       28235898
       4
                     The Toronto Star
       Bethlehem
       NaN
                West Bank
       Occupied Palestinian Territory
       31.7049
       35.2038
    
    
      132
       926876
      1995-01-03
                                           Yitzhak Rabin
       Government,State-Owned Defense / Security,Exec...
                               Israel
                         Make statement
        10
       0.0
       Government (Israel)
                                              Government
                               Israel
       28242652
       3
       The Wall Street Journal Europe
       Jerusalem
       NaN
       Jerusalem District
                               Israel
       31.7690
       35.2163
    
    
      133
       926877
      1995-01-03
       Cabinet / Council of Ministers / Advisors (Isr...
                            Government,Cabinet,Executive
                               Israel
                      Praise or endorse
        51
       3.4
             Yitzhak Rabin
       Government,State-Owned Defense / Security,Exec...
                               Israel
       28242652
       4
       The Wall Street Journal Europe
       Jerusalem
       NaN
       Jerusalem District
                               Israel
       31.7690
       35.2163
    
    
      161
       926955
      1995-01-04
                                            Yasir Arafat
                   Executive Office,Government,Executive
       Occupied Palestinian Territory
       Engage in diplomatic cooperation
        50
       3.5
                    Israel
                                                     NaN
                               Israel
       28241261
       3
        The Christian Science Monitor
            Gaza
       NaN
               Gaza Strip
       Occupied Palestinian Territory
       31.5000
       34.4667
    
    
      162
       926956
      1995-01-04
                                            Yasir Arafat
                   Executive Office,Government,Executive
       Occupied Palestinian Territory
       Engage in diplomatic cooperation
        50
       3.5
                     Hamas
       Parties,(National) Major Party,Dissident,Nongo...
       Occupied Palestinian Territory
       28241261
       3
        The Christian Science Monitor
            Gaza
       NaN
               Gaza Strip
       Occupied Palestinian Territory
       31.5000
       34.4667

Unlike GDELT and Phoenix, ICEWS doesn't include a quad/penta-code categorizing events into broadly cooperative or conflict actions (though you can create them yourself using the ICEWS CAMEO code, e.g. as described in the Phoenix documentation). Instead, it provides an Intensity score -- positive intensity indicates positive events (providing assistance, etc.) while negative scores indicate conflict (criticism, physical attacks). Taking the average intensity for some period of time should provide a rough estimate of each side's posture towards the other.

Let's break down the subset further, one for Israeli-initiated actions and one for Palestinian-initiated ones. That will give us a rough estimate of reciprocity -- is one side behaving more peacefully towards the other, or are their actions relatively mirrored?

First, we select Israel-initiated events, and get the mean intensity by day.



In [30]:

    
il_initiated = ilpalcon[ilpalcon.Source_Country=="Israel"]
il_initiated = il_initiated.groupby("Event_Date")
il_initiated = il_initiated.aggregate(np.mean)["Intensity"]



In [31]:

    
il_initiated.plot()









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x112ee3630>

It looks like daily events are too noisy to give us a good picture of what's going on. To let's use pandas's rolling mean tool to see the average intensity across a 30-day window:



In [32]:

    
pd.rolling_mean(il_initiated, 30).plot()









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x26d873908>

Notice the sharp drop that occurs in late 2000, marking the beginning of the Second Intifada.

Now let's get the same dataset for Palestinian-initiated actions. This time, I string together the pandas operations using the '\' operator, which allows multiple lines to be strung together for legibility as if they were a single line of code:



In [33]:

    
pal_initiated = ilpalcon[ilpalcon.Source_Country=="Occupied Palestinian Territory"] \
                        .groupby("Event_Date") \
                        .aggregate(np.mean) \
                        ["Intensity"]

Next, combining the two mean intensity series into a single dataframe:



In [34]:

    
df = pd.DataFrame({"IL_Initiated": pd.rolling_mean(il_initiated, 30),
                   "PAL_Initiated": pd.rolling_mean(pal_initiated, 30)})

And now we can plot the mean intensity of actions initiated by each side.



In [35]:

    
fig, ax = plt.subplots(figsize=(12,6))
df.plot(ax=ax)
ax.set_ylabel("Mean Intensity Coding")









    Out[35]:





<matplotlib.text.Text at 0x26db48550>

Not too surprisingly, they seem to overlap almost perfectly. There are a few points that stand out where the lines diverge significantly -- in a more in-depth analysis, they might warrant further examination to see whether they represent something interesting happening on the ground, or just a blip in the data collection.

We can correlate the series, and see that they do indeed track each other pretty closely (though not as perfectly as they may look on visual examination):



In [36]:

    
df.corr()









    Out[36]:






  
    
      
      IL_Initiated
      PAL_Initiated
    
  
  
    
      IL_Initiated
       1.000000
       0.802497
    
    
      PAL_Initiated
       0.802497
       1.000000

Ground Truth Dataset

One of ICEWS's biggest advantages is that it includes not only machine-coded event data, but hand-validated ground truth data on whether, on a monthly basis, each country is experiencing one of several types of conflict events.

Let's load it and take a look:



In [37]:

    
ground_truth = pd.read_csv(DATA + "gtds_2001.to.feb.2014.csv")



In [38]:

    
ground_truth.head()









    Out[38]:






  
    
      
      ccode
      country
      year
      month
      time
      ins
      reb
      dpc
      erv
      ic
      notes
      coder
      insnotes
      dpcnotes
      rebnotes
      ervnotes
      icnotes
    
  
  
    
      0
       20
       CANADA
       2001
       1
       2001m1
       0
       0
       0
       0
       0
       NaN
       Bentley & Leonard
       NaN
       NaN
       NaN
       NaN
       NaN
    
    
      1
       20
       CANADA
       2001
       2
       2001m2
       0
       0
       0
       0
       0
       NaN
       Bentley & Leonard
       NaN
       NaN
       NaN
       NaN
       NaN
    
    
      2
       20
       CANADA
       2001
       3
       2001m3
       0
       0
       0
       0
       0
       NaN
       Bentley & Leonard
       NaN
       NaN
       NaN
       NaN
       NaN
    
    
      3
       20
       CANADA
       2001
       4
       2001m4
       0
       0
       0
       0
       0
       NaN
       Bentley & Leonard
       NaN
       NaN
       NaN
       NaN
       NaN
    
    
      4
       20
       CANADA
       2001
       5
       2001m5
       0
       0
       0
       0
       0
       NaN
       Bentley & Leonard
       NaN
       NaN
       NaN
       NaN
       NaN

The columns ins to ic are 1 if the country experienced that event during that month, and 0 otherwise. They are:

ins: Insurgency
reb: Rebellion
dpc: Domestic political crisis
erv: Ethnic or religious violence
ic: International conflict

For more details, see the GTDS documentation.



In [39]:

    
# Convert the 'time' column to datetime:
ground_truth["time"] = pd.to_datetime(ground_truth.time, format="%Ym%m")

We can do some simple analysis on the ground truth dataset alone, for example see how many insurgencies are going on in the world on a month-by-month basis:



In [40]:

    
insurgency_count = ground_truth.groupby("time").aggregate(sum)["ins"]



In [41]:

    
insurgency_count.plot()
plt.ylabel("# of countries")
plt.title("Number of countries experiencing insurgencies")









    Out[41]:





<matplotlib.text.Text at 0x26d89afd0>

Combining the event data with ground truth

The real advantage that the ground truth data gives us is being able to combine it with the machine-coded event data for analysis and ultimately prediction.

In this example, I'm going to do a very simple analysis, and try and see whether countries experiencing one of the conflicts measured by the GTDS generate more events, and events of lower intensity.

First, we count how many 'bad things' are happening per country-month:



In [42]:

    
ground_truth["Conflict"] = 0
for col in ["ins", "reb", "dpc", "erv", "ic"]:
    ground_truth.Conflict += ground_truth[col]

All we care about for now is the country, the month, and the count of conflict types:



In [43]:

    
monthly_conflict = ground_truth[["time", "country", "Conflict"]]



In [44]:

    
monthly_conflict.head()

Now let's go back to the ICEWS event data, and aggregate it on a country-month basis too. For purposes of this analysis, I'll associate events with the country that ICEWS places them in, rather than the source or target country.

I'll collect two measures: how many events were generated per country-month, and what their average intensity was.

ICEWS events are on a daily basis, so we need to associate a year-month with each event. Unfortunately, pandas doesn't know how to deal with 'months' -- notice that we converted the ground truth event date into the first day of the relevant month. We'll do the same for the ICEWS events:



In [45]:

    
get_month = lambda x: pd.datetime(x.year, x.month, 1)
data["YearMonth"] = data.Event_Date.apply(get_month)

Now we'll group the data by country and month (really, first-day-of-the-month) and get the number and mean intensity of events for each.



In [46]:

    
monthly_grouped = data.groupby(["YearMonth", "Country"])
monthly_counts = monthly_grouped.aggregate(len)["Event_ID"]
monthly_intensity = monthly_grouped.aggregate(np.mean)["Intensity"]

And combine these series into a single DataFrame:



In [47]:

    
monthly_events = pd.DataFrame({"EventCounts": monthly_counts,
                               "MeanIntensity": monthly_intensity})
monthly_events.reset_index(inplace=True)



In [48]:

    
monthly_events.head()









    Out[48]:






  
    
      
      YearMonth
      Country
      EventCounts
      MeanIntensity
    
  
  
    
      0
      1995-01-01
       Afghanistan
        5
       1.640000
    
    
      1
      1995-01-01
           Albania
       10
       0.800000
    
    
      2
      1995-01-01
           Algeria
       25
      -3.064000
    
    
      3
      1995-01-01
            Angola
        3
      -6.333333
    
    
      4
      1995-01-01
         Argentina
       18
       1.022222

So this is fun: country names in the ICEWS event dataset are written with only the first letters capitalized, but the GTDS country names are in ALL CAPS. We need to convert one to the other in order to be able to match them -- and making country names all-caps is easier than dealing with title-casing multi-word all-cap country names.



In [49]:

    
capitalize = lambda x: x.upper()
monthly_events["Country"] = monthly_events.Country.apply(capitalize)

Now that we've done that, we can merge the dataframes on month and country name. The merge includes all the columns from both dataframes by default, so we need to only keep the ones we're interested in:



In [50]:

    
monthly_data = monthly_conflict.merge(monthly_events,
                       left_on=["time","country"], right_on=["YearMonth", "Country"])
monthly_data = monthly_data[["YearMonth", "Country", "Conflict", "EventCounts", "MeanIntensity"]]



In [51]:

    
monthly_data.head()









    Out[51]:






  
    
      
      YearMonth
      Country
      Conflict
      EventCounts
      MeanIntensity
    
  
  
    
      0
      2001-01-01
       CANADA
       0
       363
       0.307989
    
    
      1
      2001-02-01
       CANADA
       0
       438
       0.773973
    
    
      2
      2001-03-01
       CANADA
       0
       443
       0.583070
    
    
      3
      2001-04-01
       CANADA
       0
       716
       0.095670
    
    
      4
      2001-05-01
       CANADA
       0
       445
       0.352584

Now let's make some quick box plots and eyeball whether conflicts make a difference for data generation:



In [52]:

    
monthly_data.boxplot(column="EventCounts", by="Conflict")









    Out[52]:





<matplotlib.axes._subplots.AxesSubplot at 0x1cbf73198>

There are many more entries with low Conflict measures than high ones, and so not surprisingly there are many more outliers for those. Nevertheless, the median events generated *are* consistently higher for higher Conflict scores. And how about intensity?



In [53]:

    
monthly_data.boxplot(column="MeanIntensity", by="Conflict")









    Out[53]:





<matplotlib.axes._subplots.AxesSubplot at 0x1cc07da20>

We see a similar thing here -- no- or low-conflict country-months generate a wide variety of mean intensities, but the median mean intesity seems to become more negative with higher conflict scores.

However, what's the deal with the data points showing a very low mean intensity (which indicates conflict) when the ground-truth doesn't indicate that there were conflicts occuring? Let's check:



In [54]:

    
monthly_data[(monthly_data.Conflict==0) & (monthly_data.MeanIntensity<-9)]









    Out[54]:






  
    
      
      YearMonth
      Country
      Conflict
      EventCounts
      MeanIntensity
    
  
  
    
      1188 
      2010-12-01
                BARBADOS
       0
       1
      -10.0
    
    
      1491 
      2011-11-01
                  BELIZE
       0
       1
       -9.5
    
    
      1494 
      2012-02-01
                  BELIZE
       0
       1
      -10.0
    
    
      10373
      2007-05-01
              CAPE VERDE
       0
       1
       -9.5
    
    
      10395
      2010-09-01
              CAPE VERDE
       0
       1
      -10.0
    
    
      10614
      2007-11-01
           GUINEA-BISSAU
       0
       1
      -10.0
    
    
      10675
      2013-03-01
           GUINEA-BISSAU
       0
       1
      -10.0
    
    
      10791
      2010-10-01
       EQUATORIAL GUINEA
       0
       1
      -10.0
    
    
      10797
      2011-05-01
       EQUATORIAL GUINEA
       0
       1
      -10.0
    
    
      11495
      2004-06-01
              MAURITANIA
       0
       1
       -9.2
    
    
      13064
      2004-03-01
                   GABON
       0
       2
       -9.5
    
    
      16465
      2012-12-01
                BOTSWANA
       0
       1
       -9.5
    
    
      21876
      2010-03-01
                  BHUTAN
       0
       1
      -10.0
    
    
      21898
      2012-04-01
                  BHUTAN
       0
       1
       -9.5
    
    
      24581
      2012-05-01
         SOLOMON ISLANDS
       0
       2
       -9.1

Ah -- it looks like all of these were very low event counts. Remember that these are monthly, and one or two intensly-negative events generated in an entire month probably are not themselves strong indicators of conflict. At the very least, so few events probably also indicate that there isn't much collection of events happening for that country in general.

Summary

This was just a quick tour of things I tried playing around with ICEWS. There's a lot of public research that's already been done with ICEWS that it could be fun to attempt to replicate now that the data is finally public. It'll also be interesting to compare the data to other public event datasets, to figure out strengths and gaps, and improve both. The ground truth dataset alone could also be useful for building and testing models with completely different event data.

Comments? Suggestions? Questions? Find me on Twitter or let me know by email



In [55]:

    
# Putting the formatting out of the way
from IPython.core.display import HTML
styles = open("Style.css").read()
HTML(styles)









    Out[55]:

	Event ID	Event Date	Source Name	Source Sectors	Source Country	Event Text	CAMEO Code	Intensity	Target Name	Target Sectors	Target Country	Story ID	Sentence Number	Publisher	City	District	Province	Country	Latitude	Longitude
0	926685	1995-01-01	Extremist (Russia)	Radicals / Extremists / Fundamentalists,Dissident	Russian Federation	Praise or endorse	51	3.4	Boris Yeltsin	Elite,Executive,Executive Office,Government	Russian Federation	28235806	5	The Toronto Star	Moscow	NaN	Moskva	Russian Federation	55.7522	37.6156
1	926687	1995-01-01	Government (Bosnia and Herzegovina)	Government	Bosnia and Herzegovina	Express intent to cooperate	30	4.0	Citizen (Serbia)	General Population / Civilian / Social,Social	Serbia	28235807	1	The Toronto Star	NaN	NaN	Bosnia	Bosnia and Herzegovina	44.0000	18.0000
2	926686	1995-01-01	Citizen (Serbia)	General Population / Civilian / Social,Social	Serbia	Express intent to cooperate	30	4.0	Government (Bosnia and Herzegovina)	Government	Bosnia and Herzegovina	28235807	1	The Toronto Star	NaN	NaN	Bosnia	Bosnia and Herzegovina	44.0000	18.0000
3	926688	1995-01-01	Canada	NaN	Canada	Praise or endorse	51	3.4	City Mayor (Canada)	Government,Local,Municipal	Canada	28235809	3	The Toronto Star	NaN	NaN	Ontario	Canada	49.2501	-84.4998
4	926689	1995-01-01	Lawyer/Attorney (Canada)	Legal,Social	Canada	Arrest, detain, or charge with legal action	173	-5.0	Police (Canada)	Government,Police	Canada	28235964	1	The Toronto Star	Montreal	Montreal	Quebec	Canada	45.5088	-73.5878

	SourceFreq	TargetFreq	Total
United States	330446	341603	672049
Russia	195571	260635	456206
China	192944	254747	447691
Israel	116427	150810	267237
Japan	103651	145657	249308
India	96871	147406	244277
Iran	84099	150208	234307
Citizen (India)	65350	136966	202316
United Nations	92022	107413	199435
Unspecified Actor	0	198718	198718
European Union	90824	100310	191134
Iraq	48025	128246	176271
Vladimir Putin	102453	73104	175557
George W. Bush	100763	72909	173672
North Korea	65394	105958	171352
Turkey	64995	104946	169941
South Korea	69509	89696	159205
Pakistan	57450	98419	155869
Police (India)	117726	32771	150497
United Kingdom	65195	81465	146660
Palestinian Territory, Occupied	39861	101029	140890
France	61803	72704	134507
Australia	41393	71031	112424
Afghanistan	31678	75604	107282
North Atlantic Treaty Organization	51437	52715	104152
Syria	33110	64113	97223
Germany	42116	51114	93230
Egypt	34961	47969	82930
Barack Obama	46660	34782	81442
Indonesia	28997	43795	72792
Georgia	26227	45432	71659
Hu Jintao	39679	28938	68617
Citizen (Palestinian Territory, Occupied)	19921	48444	68365
Yasir Arafat	31196	35228	66424
Citizen (Russia)	23473	42401	65874
Thailand	25541	40200	65741
Mahmoud Abbas	34629	29995	64624
Citizen (Australia)	22541	41030	63571
Serbia	21006	41502	62508
Taliban	31678	30688	62366
Government (India)	12117	49629	61746
Israeli Defense Forces	42364	19130	61494
UN Security Council	27184	34210	61394
Ukraine	23290	37906	61196
Taiwan	19421	41052	60473
Kofi Annan	37150	22442	59592
Vietnam	24456	34951	59407
Tony Blair	33083	25713	58796
Lebanon	17041	40094	57135
Mexico	24690	32142	56832

	Event_ID	Event_Date	Source_Name	Source_Sectors	Source_Country	Event_Text	CAMEO_Code	Intensity	Target_Name	Target_Sectors	Target_Country	Story_ID	Sentence_Number	Publisher	City	District	Province	Country	Latitude	Longitude
676	927826	1995-01-11	Citizen (India)	General Population / Civilian / Social,Social	India	Reject proposal to meet, discuss, or negotiate	125	-5.0	Narasimha Rao	Executive Office,Executive,Government	India	28239021	2	The Associated Press Political Service	NaN	NaN	State of Tamil Nadu	India	11.0000	78.0000
783	927996	1995-01-12	Citizen (India)	General Population / Civilian / Social,Social	India	Express intent to meet or negotiate	36	4.0	United States	NaN	United States	28239081	1	The Associated Press Political Service	Swanton	Saline County	Nebraska	United States	40.3781	-97.0728
2151	2547954	1995-01-26	Citizen (India)	Social,General Population / Civilian / Social	India	Demonstrate or rally	141	-6.5	Unspecified Actor	Unspecified	NaN	28242253	2	Reuters News	Jammu	NaN	State of Jammu and Kashmir	India	32.7357	74.8691
5760	935070	1995-03-06	Citizen (India)	Social,General Population / Civilian / Social	India	Kill by physical assault	1823	-10.0	Congress Party	(National) Major Party,Government Major Party ...	India	28915932	4	The Associated Press Political Service	Hyderabad	Hyderabad	State of Andhra Pradesh	India	17.3841	78.4564
5766	935081	1995-03-06	Citizen (India)	Social,General Population / Civilian / Social	India	Use unconventional violence	180	-9.0	Militant (India)	Unidentified Forces	India	28915955	6	The Associated Press Political Service	New Delhi	NaN	National Capital Territory of Delhi	India	28.6358	77.2244

	SourceFreq	TargetFreq	Total
United States	997696	803460	1801156
India	773712	760583	1534295
Russian Federation	746829	706700	1453529
China	541432	525955	1067387
Japan	344413	332380	676793
Australia	340339	320329	660668
Israel	338118	315501	653619
United Kingdom	331735	302389	634124
Occupied Palestinian Territory	251678	317883	569561
Iran	286274	283276	569550

	SourceFreq	TargetFreq	Total
Government	176897	138684	315581
Parties	171411	135383	306794
Ideological	153750	121842	275592
(National) Major Party	134134	106262	240396
Executive	129382	103265	232647
Elite	92926	80163	173089
Legislative / Parliamentary	63654	45670	109324
Executive Office	54710	49770	104480
Cabinet	57273	42038	99311
Center Left	48678	37593	86271

	SourceFreq	TargetFreq	Total
International Exiles	2	1	3
Bedouin	2	0	2
Nepali-Pahari	1	1	2
Western	1	1	2
Navy Headquarters	1	1	2
Army Education / Training	0	1	1
Unspecified	0	1	1
Consumer Services MNCs	1	0	1
State-Owned Consumer Goods	1	0	1
Saharan	1	0	1

	Event_ID	Event_Date	Source_Name	Source_Sectors	Source_Country	Event_Text	CAMEO_Code	Intensity	Target_Name	Target_Sectors	Target_Country	Story_ID	Sentence_Number	Publisher	City	District	Province	Country	Latitude	Longitude
50	926771	1995-01-03	Citizen (Palestinian Territory, Occupied)	Social,General Population / Civilian / Social	Occupied Palestinian Territory	Criticize or denounce	111	-2.0	Yitzhak Rabin	Government,State-Owned Defense / Security,Exec...	Israel	28235898	4	The Toronto Star	Bethlehem	NaN	West Bank	Occupied Palestinian Territory	31.7049	35.2038
132	926876	1995-01-03	Yitzhak Rabin	Government,State-Owned Defense / Security,Exec...	Israel	Make statement	10	0.0	Government (Israel)	Government	Israel	28242652	3	The Wall Street Journal Europe	Jerusalem	NaN	Jerusalem District	Israel	31.7690	35.2163
133	926877	1995-01-03	Cabinet / Council of Ministers / Advisors (Isr...	Government,Cabinet,Executive	Israel	Praise or endorse	51	3.4	Yitzhak Rabin	Government,State-Owned Defense / Security,Exec...	Israel	28242652	4	The Wall Street Journal Europe	Jerusalem	NaN	Jerusalem District	Israel	31.7690	35.2163
161	926955	1995-01-04	Yasir Arafat	Executive Office,Government,Executive	Occupied Palestinian Territory	Engage in diplomatic cooperation	50	3.5	Israel	NaN	Israel	28241261	3	The Christian Science Monitor	Gaza	NaN	Gaza Strip	Occupied Palestinian Territory	31.5000	34.4667
162	926956	1995-01-04	Yasir Arafat	Executive Office,Government,Executive	Occupied Palestinian Territory	Engage in diplomatic cooperation	50	3.5	Hamas	Parties,(National) Major Party,Dissident,Nongo...	Occupied Palestinian Territory	28241261	3	The Christian Science Monitor	Gaza	NaN	Gaza Strip	Occupied Palestinian Territory	31.5000	34.4667

	ccode	country	year	month	time	notes	coder	insnotes	dpcnotes	rebnotes	ervnotes	icnotes
0	20	CANADA	2001	1	2001m1	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN
1	20	CANADA	2001	2	2001m2	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN
2	20	CANADA	2001	3	2001m3	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN
3	20	CANADA	2001	4	2001m4	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN
4	20	CANADA	2001	5	2001m5	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN

	time	country
0	2001-01-01	CANADA
1	2001-02-01	CANADA
2	2001-03-01	CANADA
3	2001-04-01	CANADA
4	2001-05-01	CANADA

	YearMonth	Country	EventCounts	MeanIntensity
0	1995-01-01	Afghanistan	5	1.640000
1	1995-01-01	Albania	10	0.800000
2	1995-01-01	Algeria	25	-3.064000
3	1995-01-01	Angola	3	-6.333333
4	1995-01-01	Argentina	18	1.022222

	YearMonth	Country	EventCounts	MeanIntensity
0	2001-01-01	CANADA	363	0.307989
1	2001-02-01	CANADA	438	0.773973
2	2001-03-01	CANADA	443	0.583070
3	2001-04-01	CANADA	716	0.095670
4	2001-05-01	CANADA	445	0.352584

	YearMonth	Country	EventCounts	MeanIntensity
1188	2010-12-01	BARBADOS	1	-10.0
1491	2011-11-01	BELIZE	1	-9.5
1494	2012-02-01	BELIZE	1	-10.0
10373	2007-05-01	CAPE VERDE	1	-9.5
10395	2010-09-01	CAPE VERDE	1	-10.0
10614	2007-11-01	GUINEA-BISSAU	1	-10.0
10675	2013-03-01	GUINEA-BISSAU	1	-10.0
10791	2010-10-01	EQUATORIAL GUINEA	1	-10.0
10797	2011-05-01	EQUATORIAL GUINEA	1	-10.0
11495	2004-06-01	MAURITANIA	1	-9.2
13064	2004-03-01	GABON	2	-9.5
16465	2012-12-01	BOTSWANA	1	-9.5
21876	2010-03-01	BHUTAN	1	-10.0
21898	2012-04-01	BHUTAN	1	-9.5
24581	2012-05-01	SOLOMON ISLANDS	2	-9.1