In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

The dataset

We will be using the Food and Agriculture Organization (FAO) of the United Nation's AQUASTAT dataset.

From FAO:

FAO's three main goals are:

  1. The eradication of hunger, food insecurity and malnutrition
  2. The elimination of poverty and the driving forward of economic and social progress for all
  3. The sustainable management and utilization of natural resources, including land, water, air, climate and genetic resources for the benefit of present and future generations.

To support these goals, Article 1 of its constitution requires FAO to "collect, analyse, interpret and disseminate information related to nutrition, food and agriculture". Thus AQUASTAT started, with the aim to contribute to FAO's goals through the collection, analysis and dissemination of information related to water resources, water uses and agricultural water management, with an emphasis on countries in Africa, Asia, Latin America, and the Caribbean.

FAO offers data, metadata, reports, country profiles, river basin profiles, regional analyses, maps, tables, spatial data, guidelines, and other tools on:

  • Water resources: internal, transboundary, total
  • Water uses: by sector, by source, wastewater
  • Irrigation: location, area, typology, technology, crops
  • Dams: location, height, capacity, surface area
  • Water-related institutions, policies and legistation

Question: Is water availability and water use related to GDP per capita?

Our plan

Exploratory data analysis consists of the following major tasks, which we present linearly here because each task doesn't make much sense to do without the ones prior to it. However, in reality, you are going to constantly jump around from step to step. You may want to do all the steps for a subset of the variables first. Or often, an observation will bring up a question you want to investigate and you'll branch off and explore to answer that question before returning down the main path of exhaustive EDA.

  1. Form hypotheses/develop investigation themes to explore
  2. Wrangle data
  3. Assess quality of data
  4. Profile data
  5. Explore each individual variable in the dataset
  6. Assess the relationship between each variable and the target
  7. Assess interactions between variables
  8. Explore data across many dimensions

Throughout the entire analysis you want to:

  • Capture a list of hypotheses and questions that come up for further exploration.
  • Record things to watch out for/ be aware of in future analyses.
  • Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Don't do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge.
  • Position visuals and results together. EDA relies on your natural pattern recognition abilities so maximize what you'll find by putting visualizations and results in close proximity.

To do: Develop investigation themes for exploration

Write down hypotheses, things you need to find out to answer the question.

Wrangle

Things to consider doing

Make your data tidy

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Transform data
Sometimes you will need to transform your data to be able to extract information from it. This step will usually occur after some of the other steps of EDA unless domain knowledge can inform these choices beforehand. Transforms include:

  • Log: when data is highly skewed (versus normally distributed like a bell curve), sometimes it has a log-normal distribution and taking the log of each data point will normalize it.
  • Binning of continuous variables: Binning continuous variables and then analyzing the groups of observations created can allow for easier pattern identification. Especially with non-linear relationships.
  • Simplifying of categories: you really don't want more than 8-10 categories within a single data field. Try to aggregate to higher-level categories when it makes sense.

Load the data


In [2]:
data = pd.read_csv('../../data/aquastat/aquastat.csv.gzip', compression='gzip')

In [3]:
data.head()


Out[3]:
country region variable variable_full time_period year_measured value
0 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1958-1962 1962.0 65286.0
1 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1963-1967 1967.0 65286.0
2 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1968-1972 1972.0 65286.0
3 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1973-1977 1977.0 65286.0
4 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1978-1982 1982.0 65286.0

In [4]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143280 entries, 0 to 143279
Data columns (total 7 columns):
country          143280 non-null object
region           143280 non-null object
variable         143280 non-null object
variable_full    143280 non-null object
time_period      143280 non-null object
year_measured    96411 non-null float64
value            96411 non-null float64
dtypes: float64(2), object(5)
memory usage: 7.7+ MB

Research the variables


In [5]:
data[['variable','variable_full']].drop_duplicates()


Out[5]:
variable variable_full
0 total_area Total area of the country (1000 ha)
576 arable_land Arable land area (1000 ha)
1152 permanent_crop_area Permanent crops area (1000 ha)
1728 cultivated_area Cultivated area (arable land + permanent crops...
2304 percent_cultivated % of total country area cultivated (%)
2880 total_pop Total population (1000 inhab)
3456 rural_pop Rural population (1000 inhab)
4032 urban_pop Urban population (1000 inhab)
4608 gdp Gross Domestic Product (GDP) (current US$)
5184 gdp_per_capita GDP per capita (current US$/inhab)
5760 agg_to_gdp Agriculture, value added to GDP (%)
6336 human_dev_index Human Development Index (HDI) [highest = 1] (-)
6912 gender_inequal_index Gender Inequality Index (GII) [equality = 0; i...
7488 percent_undernourished Prevalence of undernourishment (3-year average...
8064 number_undernourished Number of people undernourished (3-year averag...
8640 avg_annual_rain_depth Long-term average annual precipitation in dept...
9216 avg_annual_rain_vol Long-term average annual precipitation in volu...
9792 national_rainfall_index National Rainfall Index (NRI) (mm/year)
10368 surface_water_produced Surface water produced internally (10^9 m3/year)
10944 groundwater_produced Groundwater produced internally (10^9 m3/year)
11520 surface_groundwater_overlap Overlap between surface water and groundwater ...
12096 irwr Total internal renewable water resources (IRWR...
12672 irwr_per_capita Total internal renewable water resources per c...
13248 surface_entering Surface water: entering the country (total) (1...
13824 surface_inflow_submit_no_treaty Surface water: inflow not submitted to treatie...
14400 surface_inflow_submit_treaty Surface water: inflow submitted to treaties (1...
14976 surface_inflow_secure_treaty Surface water: inflow secured through treaties...
15552 total_flow_border_rivers Surface water: total flow of border rivers (10...
16128 accounted_flow_border_rivers Surface water: accounted flow of border rivers...
16704 accounted_flow Surface water: accounted inflow (10^9 m3/year)
17280 surface_to_other_countries Surface water: leaving the country to other co...
17856 surface_outflow_submit_no_treaty Surface water: outflow to other countries not ...
18432 surface_outflow_submit_treaty Surface water: outflow to other countries subm...
19008 surface_outflow_secure_treaty Surface water: outflow to other countries secu...
19584 surface_total_external_renewable Surface water: total external renewable (10^9 ...
20160 groundwater_entering Groundwater: entering the country (total) (10^...
20736 groundwater_accounted_inflow Groundwater: accounted inflow (10^9 m3/year)
21312 groundwater_to_other_countries Groundwater: leaving the country to other coun...
21888 groundwater_accounted_outflow Groundwater: accounted outflow to other countr...
22464 water_total_external_renewable Water resources: total external renewable (10^...
23040 total_renewable_surface Total renewable surface water (10^9 m3/year)
23616 total_renewable_groundwater Total renewable groundwater (10^9 m3/year)
24192 overlap_surface_groundwater Overlap: between surface water and groundwater...
24768 total_renewable Total renewable water resources (10^9 m3/year)
25344 dependency_ratio Dependency ratio (%)
25920 total_renewable_per_capita Total renewable water resources per capita (m3...
26496 exploitable_regular_renewable_surface Exploitable: regular renewable surface water (...
27072 exploitable_irregular_renewable_surface Exploitable: irregular renewable surface water...
27648 exploitable_total_renewable_surface Exploitable: total renewable surface water (10...
28224 exploitable_regular_renewable_groundwater Exploitable: regular renewable groundwater (10...
28800 exploitable_total Total exploitable water resources (10^9 m3/year)
29376 interannual_variability Interannual variability (WRI) (-)
29952 seasonal_variability Seasonal variability (WRI) (-)
30528 total_dam_capacity Total dam capacity (km3)
31104 dam_capacity_per_capita Dam capacity per capita (m3/inhab)
31680 irrigation_potential Irrigation potential (1000 ha)
32256 flood_occurence Flood occurrence (WRI) (-)
32832 total_pop_access_drinking Total population with access to safe drinking-...
33408 rural_pop_access_drinking Rural population with access to safe drinking-...
33984 urban_pop_access_drinking Urban population with access to safe drinking-...

Describe the panel

199 unique countries involved


In [6]:
data.country.nunique()


Out[6]:
199

In [7]:
countries = data.country.unique()

For 12 time periods


In [8]:
data.time_period.nunique()


Out[8]:
12

Each 5 years in length since 1958


In [9]:
time_periods = data.time_period.unique()
print(time_periods)


['1958-1962' '1963-1967' '1968-1972' '1973-1977' '1978-1982' '1983-1987'
 '1988-1992' '1993-1997' '1998-2002' '2003-2007' '2008-2012' '2013-2017']

In [10]:
mid_periods = range(1960,2017,5)

Dataset is unbalanced because there is not data for every country at every time period (more on missing data in the next notebook).


In [11]:
data[data.variable=='total_area'].value.isnull().sum()


Out[11]:
220

Ways to look at this data

We can look at this data set in a number of ways:

  • Cross-section: all countries during a single time period
  • Time-series: a single country over time
  • Panel data: all countries over time (as data is given)
  • Geospatial: all countries in geographical relation to each other

Slicing

For a given time slice


In [12]:
def time_slice(df, time_period):

    # Only take data for time period of interest
    df = df[df.time_period==time_period] 

    # Pivot table 
    df = df.pivot(index='country', columns='variable', values='value')
    
    df.columns.name = time_period
    
    return df

In [13]:
time_slice(data, time_periods[0]).head()


Out[13]:
1958-1962 accounted_flow accounted_flow_border_rivers agg_to_gdp arable_land avg_annual_rain_depth avg_annual_rain_vol cultivated_area dam_capacity_per_capita dependency_ratio exploitable_irregular_renewable_surface ... total_flow_border_rivers total_pop total_pop_access_drinking total_renewable total_renewable_groundwater total_renewable_per_capita total_renewable_surface urban_pop urban_pop_access_drinking water_total_external_renewable
country
Afghanistan 19.00 9.0 NaN 7700.0 327.0 213.5000 7760.0 128.40 28.7200 NaN ... 33.4 9344.00 NaN 65.3300 10.650 6992.0 55.68 804.90 NaN 18.18
Albania 3.30 0.0 NaN 436.0 1485.0 42.6900 487.0 NaN 10.9300 NaN ... 0.0 1738.00 NaN 30.2000 6.200 17376.0 26.35 533.20 NaN 3.30
Algeria 0.39 0.0 NaN 6300.0 89.0 212.0000 6900.0 89.99 3.5990 5.0 ... 0.0 11690.00 NaN 11.6700 1.517 998.3 10.15 3934.00 NaN 0.42
Andorra NaN NaN NaN 1.0 NaN 0.4724 1.0 NaN NaN NaN ... NaN 15.38 NaN 0.3156 NaN 20520.0 NaN 9.76 NaN NaN
Angola 0.40 0.0 NaN 2700.0 1010.0 1259.0000 3200.0 25.96 0.2695 NaN ... 0.0 5466.00 NaN 148.4000 58.000 27150.0 145.40 577.00 NaN 0.40

5 rows × 60 columns

For a given country


In [14]:
def country_slice(df, country):
    
    # Only take data for country of interest
    df = df[df.country==country] 

    # Pivot table 
    df = df.pivot(index='variable', columns='time_period', values='value')
    
    df.index.name = country
    return df

In [15]:
country_slice(data, countries[40]).head()


Out[15]:
time_period 1958-1962 1963-1967 1968-1972 1973-1977 1978-1982 1983-1987 1988-1992 1993-1997 1998-2002 2003-2007 2008-2012 2013-2017
Thailand
accounted_flow 214.1 214.10 214.10 214.10 214.10 214.10 214.1 214.100 214.100 214.100 214.10 214.1
accounted_flow_border_rivers 214.1 214.10 214.10 214.10 214.10 214.10 214.1 214.100 214.100 214.100 214.10 214.1
agg_to_gdp 34.0 29.24 25.34 24.78 18.55 15.73 12.3 9.067 8.696 9.351 11.57 10.5
arable_land 10600.0 11600.00 13150.00 15773.00 17199.00 17930.00 17238.0 16242.000 15389.000 15200.000 16560.00 16810.0
avg_annual_rain_depth 1622.0 1622.00 1622.00 1622.00 1622.00 1622.00 1622.0 1622.000 1622.000 1622.000 1622.00 1622.0

By variable


In [16]:
def variable_slice(df, variable):
    
    # Only data for that variable
    df = df[df.variable==variable]
    
    # Get variable for each country over the time periods 
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

In [17]:
variable_slice(data, 'total_pop').head()


Out[17]:
time_period 1958-1962 1963-1967 1968-1972 1973-1977 1978-1982 1983-1987 1988-1992 1993-1997 1998-2002 2003-2007 2008-2012 2013-2017
country
Afghanistan 9344.00 10369.00 11717.00 13056.00 12667.00 11338.00 13746.0 18034.00 21487.00 25878.00 29727.00 32527.00
Albania 1738.00 1999.00 2254.00 2518.00 2788.00 3121.00 3241.0 3092.00 3123.00 3011.00 2881.00 2897.00
Algeria 11690.00 13354.00 15377.00 17690.00 20576.00 23918.00 27181.0 29888.00 31990.00 34262.00 37439.00 39667.00
Andorra 15.38 20.75 26.89 32.77 39.11 48.46 58.9 64.15 71.05 84.88 79.32 70.47
Angola 5466.00 5963.00 6588.00 7501.00 8808.00 10286.00 11849.0 13802.00 16110.00 19184.00 22686.00 25022.00

Time series for given country and variable


In [18]:
def time_series(df, country, variable):
    
    # Only take data for country/variable combo 
    series = df[(df.country==country) & (df.variable==variable)]
    
    # Drop years with no data 
    series = series.dropna()[['year_measured', 'value']]
    
    # Change years to int and set as index 
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series

In [19]:
time_series(data, 'Belarus', 'total_pop')


Out[19]:
total_pop
year_measured
1992 10235.0
1997 10091.0
2002 9826.0
2007 9556.0
2012 9491.0
2015 9496.0

By region

We may want to look at subsets of the data for certain assessments. Region is an intuitive way to subdivide the data.


In [20]:
data.region.unique()


Out[20]:
array(['World | Asia',
       'Americas | Central America and Caribbean | Central America',
       'Americas | Central America and Caribbean | Greater Antilles',
       'Americas | Central America and Caribbean | Lesser Antilles and Bahamas',
       'Americas | Northern America | Northern America',
       'Americas | Northern America | Mexico',
       'Americas | Southern America | Guyana',
       'Americas | Southern America | Andean',
       'Americas | Southern America | Brazil',
       'Americas | Southern America | Southern America', 'World | Africa',
       'World | Europe', 'World | Oceania'], dtype=object)

Reducing the number of regions will help for pattern assessment.

Create a dictionary to look up new, more simple region (Asia, North America, South America, Africa, Europe, Oceania)


In [21]:
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America', 
    'World | Africa':'Africa',
    'World | Europe':'Europe', 
    'World | Oceania':'Oceania'
}

In [22]:
data.region = data.region.apply(lambda x: simple_regions[x])

In [23]:
print(data.region.unique())


['Asia' 'North America' 'South America' 'Africa' 'Europe' 'Oceania']

Function for extracting a single region:


In [24]:
def subregion(data, region):
    return data[data.region==region]

Note: The functions created in this notebook and the others can also be found in scripts/aqua_helper.py so that they can be reused in following notebooks without redefinition.

Exercises

  • Create a dataframe containing each variable for every country for the time period of 1963-1967.
  • Create a dataframe containing the total renewable surface water for each country over each time period.
  • Create a dataframe containing the total population of each country in Asia over each time period.

To do: Update investigation themes