To view the notebook plus use this nbviewer link
Most of the code is implemented in side functions to keep this notebook as clean as possible.
Summary of what each file does :
All the Gdelt Events (2.0) data represent 200k files, 100GB uncompressed
After cleaning and only keeping the informations we need, we are left with only 1 file of 6.2GB uncompressed
In [28]:
import pandas as pd
import datetime
import numpy as np
This pipeline sums up all the data processing done before visualition. This includes data acquisition, data cleaning and data augmentation.
What is below until the section "Post milestone 2 work" is what have been done before Milestone 3 and hence does not reflect 100% what is present in the final work, but we keep it for completness.
In [2]:
# might be better to import the code and not a file (to show what we've done)
from fetch_gdelt_data import *
from clean_data import clean_df
Below, we specify the date interval of the data to load in a DataFrame for us to use and to download, if we do not already have the data locally.
In [3]:
start = datetime.date(2015, 3, 1)
end = datetime.date(2015, 4, 1)
To load and download the data, a simple function call is needed. We can specify whether we want the translingual version or the english only one.
In [4]:
test_df = fetch_df(start, end, translingual=True, v='2')
We will only keep the informations about the event type and location, the source URL and number of mentions, and the Goldstein scale and average tone of the Event. We drop every event with missing entries and add a column containing the ISO 3166-1 alpha-3 convention where the event happens.
In [5]:
selected_df = clean_df(test_df)
To show how we can visualize the data, we plan to use folium and plotly later on.
In [6]:
import json
import branca
import folium
from folium.plugins import HeatMap
from fetch_location import get_country_from_point, get_mapping
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
init_notebook_mode()