Website of data story : http://ada.gregunz.io/

To view the notebook plus use this nbviewer link

Summary

Most of the code is implemented in side functions to keep this notebook as clean as possible.

Summary of what each file does :

Final_notebook.ipynb -- (jupyter notebook)

  • Show our final progress

clean_data.py -- (python code functions)

  • Filter data we don't use
  • Clean the Gdelt data

fetch_gdelt_data.py -- (python code functions)

  • Download/Save/Load Gdelt data given dates

fetch_location.py -- (python code functions)

  • Find all Actions country in a dataframe using longitude and latitude

fetch_source_country.py -- (python code functions)

  • Find the country of the sources (newspapers/websites)

high_level_fetch.py -- (python code function)

  • Provide some functions to fetch/clean/load data by chunk (for example per month)

load_data.py -- (python code function)

  • Contain the function to download/clean/load the entire GDELT 2.0 dataset. If the final cleaned file is present on the disk it will obviously not do all the work and just load the file. This should be the only function used to load the data.

Note

All the Gdelt Events (2.0) data represent 200k files, 100GB uncompressed

After cleaning and only keeping the informations we need, we are left with only 1 file of 6.2GB uncompressed


In [28]:
import pandas as pd
import datetime
import numpy as np

Data Pipeline

This pipeline sums up all the data processing done before visualition. This includes data acquisition, data cleaning and data augmentation.

  1. We download the data from the Gdelt website. There are a file every 15 minutes (96 per day). Some powerful functions in fetch_gdelt_data.py makes the download easy. By providing a date (or range of dates) we can automatically download every files and save them.
  2. We load all the files into a single dataframes, this is directly done in fetch_gdelt_data.py after the download. (Note: if a file is already downloaded, it will load it automatically from storage)
  3. In our project we only need a few columns, hence we keep only the ones we need for our project ('EventCode', 'SOURCEURL', 'ActionGeo_CountryCode', 'ActionGeo_Lat', 'ActionGeo_Long', 'IsRootEvent', 'QuadClass', 'GoldsteinScale', 'AvgTone', 'NumMentions', 'NumSources', 'NumArticles', 'ActionGeo_Type', 'Day'), please refer to the Gdelt Codebook for the details about each field. This is done in the clean_data.py file.
  4. When some values are missing or invalid (e.g. not geographic position), we remove the row (done in the clean_data.py file as well).
  5. 'ActionGeo_CountryCode' do NOT use the ISO 3166-1 country codes. Hence, to be more consistent we find the correct country using the latitude and longitude and construct a mapping of their country codes to the ISO ones. This is done in [fetch_locations.py].
  6. For each event we have a source URL of the article from which the event comes from. Unfortunately we don't have the country from which the article comes from. For this reason we use this database and make use of the top level domain (tld) names which represent a country to determine it. All this is done in fetch_source_country.py

Important notice

What is below until the section "Post milestone 2 work" is what have been done before Milestone 3 and hence does not reflect 100% what is present in the final work, but we keep it for completness.

Data Fetching


In [2]:
# might be better to import the code and not a file (to show what we've done)

from fetch_gdelt_data import *
from clean_data import clean_df

Below, we specify the date interval of the data to load in a DataFrame for us to use and to download, if we do not already have the data locally.


In [3]:
start = datetime.date(2015, 3, 1)
end = datetime.date(2015, 4, 1)

To load and download the data, a simple function call is needed. We can specify whether we want the translingual version or the english only one.


In [4]:
test_df = fetch_df(start, end, translingual=True, v='2')

Data Cleaning and Selection

We will only keep the informations about the event type and location, the source URL and number of mentions, and the Goldstein scale and average tone of the Event. We drop every event with missing entries and add a column containing the ISO 3166-1 alpha-3 convention where the event happens.


In [5]:
selected_df = clean_df(test_df)

Data visualization

To show how we can visualize the data, we plan to use folium and plotly later on.


In [6]:
import json
import branca
import folium
from folium.plugins import HeatMap
from fetch_location import get_country_from_point, get_mapping

from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
init_notebook_mode()