In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)
From FAO:
FAO's three main goals are:
To support these goals, Article 1 of its constitution requires FAO to "collect, analyse, interpret and disseminate information related to nutrition, food and agriculture". Thus AQUASTAT started, with the aim to contribute to FAO's goals through the collection, analysis and dissemination of information related to water resources, water uses and agricultural water management, with an emphasis on countries in Africa, Asia, Latin America, and the Caribbean.
FAO offers data, metadata, reports, country profiles, river basin profiles, regional analyses, maps, tables, spatial data, guidelines, and other tools on:
Throughout the entire analysis you want to:
Write down hypotheses, things you need to find out to answer the question.
Make your data tidy
Transform data
Sometimes you will need to transform your data to be able to extract information from it. This step will usually occur after some of the other steps of EDA unless domain knowledge can inform these choices beforehand. Transforms include:
In [2]:
data = pd.read_csv('../../data/aquastat/aquastat.csv.gzip', compression='gzip')
In [3]:
data.head()
Out[3]:
In [4]:
data.info()
In [5]:
data[['variable','variable_full']].drop_duplicates()
Out[5]:
199 unique countries involved
In [6]:
data.country.nunique()
Out[6]:
In [7]:
countries = data.country.unique()
For 12 time periods
In [8]:
data.time_period.nunique()
Out[8]:
Each 5 years in length since 1958
In [9]:
time_periods = data.time_period.unique()
print(time_periods)
In [10]:
mid_periods = range(1960,2017,5)
Dataset is unbalanced because there is not data for every country at every time period (more on missing data in the next notebook).
In [11]:
data[data.variable=='total_area'].value.isnull().sum()
Out[11]:
We can look at this data set in a number of ways:
In [12]:
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period==time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
In [13]:
time_slice(data, time_periods[0]).head()
Out[13]:
In [14]:
def country_slice(df, country):
# Only take data for country of interest
df = df[df.country==country]
# Pivot table
df = df.pivot(index='variable', columns='time_period', values='value')
df.index.name = country
return df
In [15]:
country_slice(data, countries[40]).head()
Out[15]:
In [16]:
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable==variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
In [17]:
variable_slice(data, 'total_pop').head()
Out[17]:
In [18]:
def time_series(df, country, variable):
# Only take data for country/variable combo
series = df[(df.country==country) & (df.variable==variable)]
# Drop years with no data
series = series.dropna()[['year_measured', 'value']]
# Change years to int and set as index
series.year_measured = series.year_measured.astype(int)
series.set_index('year_measured', inplace=True)
series.columns = [variable]
return series
In [19]:
time_series(data, 'Belarus', 'total_pop')
Out[19]:
We may want to look at subsets of the data for certain assessments. Region is an intuitive way to subdivide the data.
In [20]:
data.region.unique()
Out[20]:
Reducing the number of regions will help for pattern assessment.
Create a dictionary to look up new, more simple region (Asia, North America, South America, Africa, Europe, Oceania)
In [21]:
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
In [22]:
data.region = data.region.apply(lambda x: simple_regions[x])
In [23]:
print(data.region.unique())
Function for extracting a single region:
In [24]:
def subregion(data, region):
return data[data.region==region]
Note: The functions created in this notebook and the others can also be found in scripts/aqua_helper.py
so that they can be reused in following notebooks without redefinition.