The Integrated Crisis Early Warning System (ICEWS) is a machine-coded event dataset developed by Lockheed Martin and others for DARPA and the Office of Naval Research. For a long time, ICEWS was available only within the Department of Defense, and to a few select academics. Now, for the first time, a checkpointed version of ICEWS is being released to the general public (or, at least, the parts of the general public that care about political event data).
Unlike some event data sets, the public version of ICEWS will only be updated annually or so, but it still includes almost 20 years worth of event data that's been used successfully both in the government and academic research.
This document is mostly a cleaned-up version of my own initial exploration of the dataset. Hopefully it'll prove useful to others who want to use ICEWS in their own research.
UPDATE (03/29/15): Jennifer Lautenschlager, from the ICEWS team at Lockheed Martin, was kind enough to provide some clarifications, which I've added.
This is done in Python 3.4.2, with pandas version 0.15.2. The only requirement that might be tricky to install is Basemap, which is only used for the mapping section. You won't miss much without it.
In [1]:
import os
from collections import defaultdict
# Other libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
# Show plots inline
%matplotlib inline
The data is available via the Harvard Dataverse, at http://thedata.harvard.edu/dvn/dv/icews. The two datasets I use are the ICEWS Coded Event Data and the Ground Truth Data Set. The easiest way to download both is to go to the Data & Analysis tab, click Select all files at the top, and then Download Selected Files.
The ICEWS event data comes as one file per year, initially zipped. On OSX or Linux, you can unzip all the files in a directory at once from the terminal with
$ unzip "*.zip"
And you can delete all the zipped files with
$ rm *.zip
In this document, I assume that all the annual data files, as well as the one Ground Truth data file, are in the same directory.
In [2]:
# Path to directory where the data is stored
DATA = "/Users/dmasad/Data/ICEWS/"
For testing purposes, I start by loading a single year into a pandas DataFrame. The data files are tab-delimited, and have the column names as the first row.
In [3]:
one_year = pd.read_csv(DATA + "events.1995.20150313082510.tab", sep="\t")
In [4]:
one_year.head()
Out[4]:
In [5]:
one_year.dtypes
Out[5]:
Looks pretty good! Notice that the Event Date column is an object (meaning a string), so when we load in all of the data we should tell pandas to parse it automatically.
The ICEWS data isn't too big to hold in memory all at once, so I go ahead and load the entire thing. To do it, we'll iterate over all the data files, read each into a DataFrame, and then concatenate them together.
Note that in this code, I added the parse_dates=[1] argument to the .read_csv(...) method, telling pandas to parse the second column as a date.
This code assumes that the ICEWS data files are the only .tab files in your DATA directory. If that isn't the case, adjust as needed.
In [6]:
all_data = []
for f in os.listdir(DATA): # Iterate over all files
if f[-3:] != "tab": # Skip non-tab files.
continue
df = pd.read_csv(DATA + f, sep='\t', parse_dates=[1])
all_data.append(df)
data = pd.concat(all_data)
Some of the ICEWS column names have spaces in them, which means they can't be referenced using pandas's period notation. To fix this, I rename the columns to replace the spaces with underscores:
In [7]:
cols = {col: col.replace(" ", "_") for col in data.columns}
data.rename(columns=cols, inplace=True)
In [8]:
data.dtypes
Out[8]:
In [9]:
print(data.Event_Date.min())
print(data.Event_Date.max())
In [10]:
len(data)
Out[10]:
Looks good! The data types are what we expect, and the dates seem to have been parsed correctly.
In [11]:
actors_source = data.Source_Name.value_counts()
actors_target = data.Target_Name.value_counts()
actor_counts = pd.DataFrame({"SourceFreq": actors_source,
"TargetFreq": actors_target})
actor_counts.fillna(0, inplace=True)
actor_counts["Total"] = actor_counts.SourceFreq + actor_counts.TargetFreq
Now let's look at the top 50 actors. For people like me who are more used to GDELT and Phoenix, the actor list might look a little different than what we expect:
In [12]:
actor_counts.sort("Total", ascending=False, inplace=True)
actor_counts.head(50)
Out[12]:
What stood out to me was the mix of country-level actors with named individuals. Unlike event datasets that use CAMEO coding, leaders or sub-state organizations don't seem to be coded as add-ons to a state actor code (e.g. USAGOV) but separate actors in their own right.
Update (03/29/2015): The _Sectors column contains the role information that would otherwise be contained in the chained CAMEO designations. For example, if you scroll back to the first row of 1995 data, the target name is Boris Yeltsin, and the target sectors associated with him are "Elite,Executive,Executive Office,Government".
The Citizen (Country) actor stood out to me in particular, especially since it isn't mentioned specifically in the included documentation -- so let's take a look at some of the rows that use it:
In [13]:
data[data.Source_Name=="Citizen (India)"].head()
Out[13]:
So it looks like Citizen really means civilians, or possibly civil society actors unaffiliated with any organization the ICEWS coding system recognizes.
Update (03/29/2015): I had trouble finding news events that corresponded to the events above, but Jennifer Lautenschlager pointed me to this news article that indicates that there was election violence in India in that time frame.
To get country-level actors comparable to other event datasets, looks like we need to use the source and target country columns:
In [14]:
country_source = data.Source_Country.value_counts()
country_target = data.Target_Country.value_counts()
country_counts = pd.DataFrame({"SourceFreq": country_source,
"TargetFreq": country_target})
country_counts.fillna(0, inplace=True)
country_counts["Total"] = country_counts.SourceFreq + country_counts.TargetFreq
In [15]:
country_counts.sort("Total", ascending=False, inplace=True)
country_counts.head(10)
Out[15]:
This looks pretty good too! India seems more represented compared to what I've seen in other datasets, and of course Israel/Palestine maintain their usual place on the event data leaderboard.
Update (03/29/2015): Since the Sectors are also an important way of understanding the relevant data, let's get their frequencies too. Sectors are a bit trickier, since each cell can contain multiple selectors, separated by commas. So we need to loop over each cell, split the selectors mentioned, and count each one.
In [16]:
# Count source sectors
source_sectors = defaultdict(int)
source_sector_counts = data.Source_Sectors.value_counts()
for sectors, count in source_sector_counts.iteritems():
sectors = sectors.split(",")
for sector in sectors:
source_sectors[sector] += 1
# Count Target sectors
target_sectors = defaultdict(int)
target_sector_counts = data.Target_Sectors.value_counts()
for sectors, count in target_sector_counts.iteritems():
sectors = sectors.split(",")
for sector in sectors:
target_sectors[sector] += 1
# Convert into series
source_sectors = pd.Series(source_sectors)
target_sectors = pd.Series(target_sectors)
# Combine into a dataframe, and fill missing with 0
sector_counts = pd.DataFrame({"SourceFreq": source_sectors,
"TargetFreq": target_sectors})
sector_counts.fillna(0, inplace=True)
sector_counts["Total"] = sector_counts.SourceFreq + sector_counts.TargetFreq
In [17]:
sector_counts.sort("Total", ascending=False, inplace=True)
In [18]:
sector_counts.head(10)
Out[18]:
In [19]:
sector_counts.tail(10)
Out[19]:
In addition to CAMEO-type actor designations (e.g. Government) it looks like some of the Sectors also resemble the Issues in Phoenix, or Themes in the GDELT GKG.
In [20]:
daily_events = data.groupby("Event_Date").aggregate(len)["Event_ID"]
In [21]:
daily_events.plot(color='k', lw=0.2, figsize=(12,6),
title="ICEWS Daily Event Count")
Out[21]:
There seems to be a definite ramp-up period from 1995 through 1999 or so, and some sort of fall in event volume around 2009. Notice that there are also a few individual days, especially around 2004, with very few events for some reason.
Update (03/29/2015): Jennifer Lautenschlager clarified that the jumps in the 1995-2001 period reflect publishers entering incrementally into the commercial data system that feeds into ICEWS. The post-2008 dip reflects a decline in number of stories overall, possibly driven by budget cuts due to the recession.
Since each event has an associated Story ID, we can count how many unique stories are processed by ICEWS every day and end up generating events.
In [22]:
daily_stories = data.groupby("Event_Date").aggregate(pd.Series.nunique)["Story_ID"]
In [23]:
daily_stories.plot(color='k', lw=0.2, figsize=(12,6),
title="ICEWS Daily Story Count")
Out[23]:
With these two series, we can measure the daily average events generated per story:
In [24]:
events_per_story = daily_events / daily_stories
events_per_story.plot(color='k', lw=0.2, figsize=(12,6),
title="ICEWS Daily Events Per Story")
Out[24]:
This confirms that indeed, except for a few anomalies, the number of events generated per story stays relatively consistent over time. Nevertheless, it's probably important to at least try to distinguish between fewer stories as caused by fewer newsworthy events, and fewer stories as caused by fewer journalists writing them.
In [25]:
points = data.groupby(["Latitude", "Longitude"]).aggregate(len)["Event_ID"]
points = points.reset_index()
Nobody will be surprised that the distribution of events-per-point is very long-tailed, with many points having only a small number of events, and a small number of points having hundreds of thousands of events.
In [26]:
points.Event_ID.hist()
plt.yscale('log')
So the best way to deal with this is to plot point size based on the log of the number of events recorded there.
The following code draws a world map using Basemap's default, built-in map, and then iterates over all the points, putting a dot on the map for each one. Finally, it exports the resulting map to a PNG file
In [27]:
plt.figure(figsize=(16,16))
# Draw the world map itself
m = Basemap(projection='eck4',lon_0=0,resolution='c')
m.drawcoastlines()
m.fillcontinents()
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,120.,30.))
m.drawmeridians(np.arange(0.,360.,60.))
m.drawmapboundary()
m.drawcountries()
plt.title("ICEWS Total Events", fontsize=24)
# Plot the points
for row in points.iterrows():
row = row[1]
lat = row.Latitude
lon = row.Longitude
count = np.log10(row.Event_ID + 1) * 2
x, y = m(lon, lat) # Convert lat-long to plot coordinates
m.plot(x, y, 'ro', markersize=count, alpha=0.3)
plt.savefig("ICEWS.png", dpi=120, facecolor="#FFFFFF")
This looks... shockingly good to me. A few regions -- particularly the Indian subcontinent, East Asia and South America -- seem much better covered than in some other datasets. US Pacific Command was one of ICEWS's first customers, so it makes sense that its AOR would be well covered. Nigeria also seems to be relatively densly-covered, though whether this is because of particular attention or simply its population and regional significance isn't clear.
The ICEWS documentation says that purely domestic US events aren't included. This explains why the continental US appears sparser than some other datasets -- but there are obviously many points still left. Most of these events have at least one foreign actor, and apparently very few purely domestic events slip past the filters ICEWS have in place.
In [28]:
dyad = ["Israel", "Occupied Palestinian Territory"]
ilpalcon = data[(data.Source_Country.isin(dyad)) &
(data.Target_Country.isin(dyad))]
In [29]:
ilpalcon.head()
Out[29]:
Unlike GDELT and Phoenix, ICEWS doesn't include a quad/penta-code categorizing events into broadly cooperative or conflict actions (though you can create them yourself using the ICEWS CAMEO code, e.g. as described in the Phoenix documentation). Instead, it provides an Intensity score -- positive intensity indicates positive events (providing assistance, etc.) while negative scores indicate conflict (criticism, physical attacks). Taking the average intensity for some period of time should provide a rough estimate of each side's posture towards the other.
Let's break down the subset further, one for Israeli-initiated actions and one for Palestinian-initiated ones. That will give us a rough estimate of reciprocity -- is one side behaving more peacefully towards the other, or are their actions relatively mirrored?
First, we select Israel-initiated events, and get the mean intensity by day.
In [30]:
il_initiated = ilpalcon[ilpalcon.Source_Country=="Israel"]
il_initiated = il_initiated.groupby("Event_Date")
il_initiated = il_initiated.aggregate(np.mean)["Intensity"]
In [31]:
il_initiated.plot()
Out[31]:
It looks like daily events are too noisy to give us a good picture of what's going on. To let's use pandas's rolling mean tool to see the average intensity across a 30-day window:
In [32]:
pd.rolling_mean(il_initiated, 30).plot()
Out[32]:
Notice the sharp drop that occurs in late 2000, marking the beginning of the Second Intifada.
Now let's get the same dataset for Palestinian-initiated actions. This time, I string together the pandas operations using the '\' operator, which allows multiple lines to be strung together for legibility as if they were a single line of code:
In [33]:
pal_initiated = ilpalcon[ilpalcon.Source_Country=="Occupied Palestinian Territory"] \
.groupby("Event_Date") \
.aggregate(np.mean) \
["Intensity"]
Next, combining the two mean intensity series into a single dataframe:
In [34]:
df = pd.DataFrame({"IL_Initiated": pd.rolling_mean(il_initiated, 30),
"PAL_Initiated": pd.rolling_mean(pal_initiated, 30)})
And now we can plot the mean intensity of actions initiated by each side.
In [35]:
fig, ax = plt.subplots(figsize=(12,6))
df.plot(ax=ax)
ax.set_ylabel("Mean Intensity Coding")
Out[35]:
Not too surprisingly, they seem to overlap almost perfectly. There are a few points that stand out where the lines diverge significantly -- in a more in-depth analysis, they might warrant further examination to see whether they represent something interesting happening on the ground, or just a blip in the data collection.
We can correlate the series, and see that they do indeed track each other pretty closely (though not as perfectly as they may look on visual examination):
In [36]:
df.corr()
Out[36]:
In [37]:
ground_truth = pd.read_csv(DATA + "gtds_2001.to.feb.2014.csv")
In [38]:
ground_truth.head()
Out[38]:
The columns ins to ic are 1 if the country experienced that event during that month, and 0 otherwise. They are:
For more details, see the GTDS documentation.
In [39]:
# Convert the 'time' column to datetime:
ground_truth["time"] = pd.to_datetime(ground_truth.time, format="%Ym%m")
We can do some simple analysis on the ground truth dataset alone, for example see how many insurgencies are going on in the world on a month-by-month basis:
In [40]:
insurgency_count = ground_truth.groupby("time").aggregate(sum)["ins"]
In [41]:
insurgency_count.plot()
plt.ylabel("# of countries")
plt.title("Number of countries experiencing insurgencies")
Out[41]:
The real advantage that the ground truth data gives us is being able to combine it with the machine-coded event data for analysis and ultimately prediction.
In this example, I'm going to do a very simple analysis, and try and see whether countries experiencing one of the conflicts measured by the GTDS generate more events, and events of lower intensity.
First, we count how many 'bad things' are happening per country-month:
In [42]:
ground_truth["Conflict"] = 0
for col in ["ins", "reb", "dpc", "erv", "ic"]:
ground_truth.Conflict += ground_truth[col]
All we care about for now is the country, the month, and the coun