gdeltPyR
UsagegdeltPyR
retrieves Global Database of Events, Language, and Tone (GDELT) data (version 1.0 or version 2.0) via parallel HTTP GET requests and is an alternative to accessing GDELT data via Google BigQuery .
Performance will vary based on the number of available cores (i.e. CPUs), internet connection speed, and available RAM. For systems with limited RAM, Later iterations of gdeltPyR
will include an option to store the output directly to disc.
Take your systems specifications into consideration when running large or complex queries. While gdeltPyR
loads each temporary file long enough only to convert it into a pandas
dataframe (15 minutes each for 2.0, full day for 1.0 events tables), GDELT data can be especially large and exhaust a computers RAM. For example, Global Knowledge Graph (gkg) table queries can eat up large amounts of RAM when pulling data for only a few days. Before trying month long queries, try single day queries or create a pipeline that pulls several days worth of data, writes to discs, flushes globals, and continues to pull more data.
It's best to use a system with at least 8 GB of RAM.
pip install gdeltPyR
You can also install directly from www.github.com
pip install git+https://github.com/linwoodc3/gdeltPyR
gdeltPyR
queries revolve around 4 concepts:
Name | Description | Input Possibilities/Examples |
---|---|---|
version | (integer) - Selects the version of GDELT data to query; defaults to version 2. | 1 or 2 |
date | (string or list of strings) - Dates to query | "2016 10 23" or "2016 Oct 23" |
coverage | (bool) - For GDELT 2.0, pulls every 15 minute interval in the dates passed in the 'date' parameter. Default coverage is False or None. gdeltPyR will pull the latest 15 minute interval for the current day or the last 15 minute interval for a historic day. |
True or False or None |
tables | (string) - The specific GDELT table to pull. The default table is the 'events' table. See the GDELT documentation page for more information | 'events' or 'mentions' or 'gkg' |
With these basic concepts, you can run any number of GDELT queries.
In [18]:
##############################
# Import the package
##############################
import gdelt
In [22]:
###############################
# Instantiate the gdelt object
##############################
gd = gdelt.gdelt(version=2)
To launch your query, pass in your dates. When passing multiple dates, pass as a list of strings. We will time the multi-day query.
For GDELT 2.0, every 15 minute interval is a zipped CSV file, and gdeltPyR
makes concurrent HTTP GET requests to each file. When the coverage
parameter is set to True, each full day of data has 96 15 minute interval files to pull. If you are pulling the current day and coverage is set to True, gdeltPyR
all the intervals leading up to the latest 15 minute interval. When coverage
is False, the package pulls the last 15 minute interval when querying a historical date and the latest 15 minute interval when querying the current date. Additinally, GDELT 2.0 data only goes back as far as Feb 2015. The additional features of GDELT 2.0 are discussed here.
GDELT 1.0 releases the previous day's query at 6AM EST of the next day (if today's current date is 23 Oct, the 22 Oct results would be available at 6AM Eastern on 23 Oct).
To launch your query, just pass in dates. When passing multiple dates, pass as a list of strings. First, some information on my OS.
In [27]:
import platform
import multiprocessing
print (platform.platform())
print (multiprocessing.cpu_count())
And now the query.
In [28]:
%time results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
Let's get an idea for the number of results we returned.
In [29]:
results.info()
In ~36 seconds, gdeltPyR
returned nearly a 900,000 by 61 (rows x columns) Pandas dataframe that only consumes 407.2 MBs of memory. With the data in a tidy format, GDELT data can be analyzed with any number of pandas
data analysis pipelines and techniques.