In [18]:
# must go first
%matplotlib inline
%config InlineBackend.figure_format='retina'
# plotting
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_context("poster", font_scale=1.3)
import folium
# system packages
import os, sys
import warnings
warnings.filterwarnings('ignore')
# basic wrangling
import numpy as np
import pandas as pd
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# File with functions from prior notebook(s)
sys.path.append('../../scripts/')
from aqua_helper import time_slice, country_slice, time_series, simple_regions, subregion, variable_slice
# Update matplotlib defaults to something nicer
mpl_update = {'font.size':16,
'xtick.labelsize':14,
'ytick.labelsize':14,
'figure.figsize':[12.0,8.0],
'axes.color_cycle':['#0055A7', '#2C3E4F', '#26C5ED', '#00cc66', '#D34100', '#FF9700','#091D32'],
'axes.labelsize':20,
'axes.labelcolor':'#677385',
'axes.titlesize':20,
'lines.color':'#0055A7',
'lines.linewidth':3,
'text.color':'#677385'}
mpl.rcParams.update(mpl_update)
Throughout the entire analysis you want to:
Write questions that results raise as you go. Keep updating list of hypotheses
In [19]:
data = pd.read_csv('../../data/aquastat/aquastat.csv.gzip', compression='gzip')
In [20]:
data[['variable','variable_full']].drop_duplicates()
Out[20]:
variable
variable_full
0
total_area
Total area of the country (1000 ha)
576
arable_land
Arable land area (1000 ha)
1152
permanent_crop_area
Permanent crops area (1000 ha)
1728
cultivated_area
Cultivated area (arable land + permanent crops...
2304
percent_cultivated
% of total country area cultivated (%)
2880
total_pop
Total population (1000 inhab)
3456
rural_pop
Rural population (1000 inhab)
4032
urban_pop
Urban population (1000 inhab)
4608
gdp
Gross Domestic Product (GDP) (current US$)
5184
gdp_per_capita
GDP per capita (current US$/inhab)
5760
agg_to_gdp
Agriculture, value added to GDP (%)
6336
human_dev_index
Human Development Index (HDI) [highest = 1] (-)
6912
gender_inequal_index
Gender Inequality Index (GII) [equality = 0; i...
7488
percent_undernourished
Prevalence of undernourishment (3-year average...
8064
number_undernourished
Number of people undernourished (3-year averag...
8640
avg_annual_rain_depth
Long-term average annual precipitation in dept...
9216
avg_annual_rain_vol
Long-term average annual precipitation in volu...
9792
national_rainfall_index
National Rainfall Index (NRI) (mm/year)
10368
surface_water_produced
Surface water produced internally (10^9 m3/year)
10944
groundwater_produced
Groundwater produced internally (10^9 m3/year)
11520
surface_groundwater_overlap
Overlap between surface water and groundwater ...
12096
irwr
Total internal renewable water resources (IRWR...
12672
irwr_per_capita
Total internal renewable water resources per c...
13248
surface_entering
Surface water: entering the country (total) (1...
13824
surface_inflow_submit_no_treaty
Surface water: inflow not submitted to treatie...
14400
surface_inflow_submit_treaty
Surface water: inflow submitted to treaties (1...
14976
surface_inflow_secure_treaty
Surface water: inflow secured through treaties...
15552
total_flow_border_rivers
Surface water: total flow of border rivers (10...
16128
accounted_flow_border_rivers
Surface water: accounted flow of border rivers...
16704
accounted_flow
Surface water: accounted inflow (10^9 m3/year)
17280
surface_to_other_countries
Surface water: leaving the country to other co...
17856
surface_outflow_submit_no_treaty
Surface water: outflow to other countries not ...
18432
surface_outflow_submit_treaty
Surface water: outflow to other countries subm...
19008
surface_outflow_secure_treaty
Surface water: outflow to other countries secu...
19584
surface_total_external_renewable
Surface water: total external renewable (10^9 ...
20160
groundwater_entering
Groundwater: entering the country (total) (10^...
20736
groundwater_accounted_inflow
Groundwater: accounted inflow (10^9 m3/year)
21312
groundwater_to_other_countries
Groundwater: leaving the country to other coun...
21888
groundwater_accounted_outflow
Groundwater: accounted outflow to other countr...
22464
water_total_external_renewable
Water resources: total external renewable (10^...
23040
total_renewable_surface
Total renewable surface water (10^9 m3/year)
23616
total_renewable_groundwater
Total renewable groundwater (10^9 m3/year)
24192
overlap_surface_groundwater
Overlap: between surface water and groundwater...
24768
total_renewable
Total renewable water resources (10^9 m3/year)
25344
dependency_ratio
Dependency ratio (%)
25920
total_renewable_per_capita
Total renewable water resources per capita (m3...
26496
exploitable_regular_renewable_surface
Exploitable: regular renewable surface water (...
27072
exploitable_irregular_renewable_surface
Exploitable: irregular renewable surface water...
27648
exploitable_total_renewable_surface
Exploitable: total renewable surface water (10...
28224
exploitable_regular_renewable_groundwater
Exploitable: regular renewable groundwater (10...
28800
exploitable_total
Total exploitable water resources (10^9 m3/year)
29376
interannual_variability
Interannual variability (WRI) (-)
29952
seasonal_variability
Seasonal variability (WRI) (-)
30528
total_dam_capacity
Total dam capacity (km3)
31104
dam_capacity_per_capita
Dam capacity per capita (m3/inhab)
31680
irrigation_potential
Irrigation potential (1000 ha)
32256
flood_occurence
Flood occurrence (WRI) (-)
32832
total_pop_access_drinking
Total population with access to safe drinking-...
33408
rural_pop_access_drinking
Rural population with access to safe drinking-...
33984
urban_pop_access_drinking
Urban population with access to safe drinking-...
Simplify regions
In [21]:
data.region = data.region.apply(lambda x: simple_regions[x])
Before trying to understand what information is in the data, make sure you understand what the data represents and what's missing.
Package that provides a number of functions for visualizing what data is missing within a dataset: missingno
In [22]:
recent = time_slice(data, '2013-2017')
In [23]:
msno.bar(recent, labels=True)
Discussion: What questions does this figure bring up?
Add these to your list of questions!
In [24]:
msno.matrix(recent, labels=True)
Discussion: What additional information does this provide or what additional questions does it suggest?
"Exploitable" variables are missing for most countries.
Question to consider: Does this happen in each time period?
In [25]:
msno.matrix(variable_slice(data, 'exploitable_total'), inline=False, sort='descending');
plt.xlabel('Time period');
plt.ylabel('Country');
plt.title('Missing total exploitable water resources data across countries and time periods \n \n \n \n');
Total exploitable water resources is only reported on for a fraction of the countries and only a very small fraction of those countries have data for the most recent time period. Either a) data has not been reported yet and it will be at some point or b) most countries have stopped reporting on this factor or c) we do not have the domain knowledge to understand what's happening.
We are going to remove exploitable variables for future analysis because such few data points can cause a lot of problems.
In [26]:
data = data.loc[~data.variable.str.contains('exploitable'),:]
In [27]:
msno.matrix(variable_slice(data, 'national_rainfall_index'),
inline=False, sort='descending');
plt.xlabel('Time period');
plt.ylabel('Country');
plt.title('Missing national rainfall index data across countries and time periods \n \n \n \n');
National rainfall index is no longer reported on after 2002.
In [28]:
data = data.loc[~(data.variable=='national_rainfall_index')]
Let's look at North America only.
In [29]:
north_america = subregion(data, 'North America')
In [30]:
msno.bar(msno.nullity_sort(time_slice(north_america, '2013-2017'), sort='descending').T, inline=False)
plt.title('Fraction of fields complete by country for North America \n \n');
Question: Is there any pattern in the countries with most missing data?
Question: What are potential reasons for missing data? What can we check?
In [31]:
folium.Map(location=[18.1160128,-77.8364762], tiles="CartoDB positron",
zoom_start=5, width=1200, height=600)
Out[31]:
Spot check what data is missing for the Bahamas to get more granular understanding.
In [32]:
msno.nullity_filter(country_slice(data, 'Bahamas').T, filter='bottom', p=0.1)
Out[32]:
Bahamas
dam_capacity_per_capita
flood_occurence
gender_inequal_index
groundwater_produced
interannual_variability
irrigation_potential
number_undernourished
overlap_surface_groundwater
percent_undernourished
seasonal_variability
surface_groundwater_overlap
surface_water_produced
total_dam_capacity
total_renewable_groundwater
total_renewable_surface
time_period
1958-1962
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1963-1967
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1968-1972
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1973-1977
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1978-1982
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1983-1987
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1988-1992
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1993-1997
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1998-2002
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2003-2007
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2008-2012
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2013-2017
NaN
NaN
0.2979
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
To do: Choose another region to assess for missing data.
In [33]:
# JSON with coordinates for country boundaries
geo = r'../../data/aquastat/world.json'
null_data = recent['agg_to_gdp'].notnull()*1
map = folium.Map(location=[48, -102], zoom_start=2)
map.choropleth(geo_path=geo,
data=null_data,
columns=['country', 'agg_to_gdp'],
key_on='feature.properties.name', reset=True,
fill_color='GnBu', fill_opacity=1, line_opacity=0.2,
legend_name='Missing agricultural contribution to GDP data 2013-2017')
map
Out[33]:
Question: What does the pale pale green mean? Compared to the green? (E.g. Greenland versus Canada)
Now let's functionalize so we can look at other variables geospatially.
In [34]:
def plot_null_map(df, time_period, variable,
legend_name=None):
geo = r'../../data/aquastat/world.json'
ts = time_slice(df, time_period).reset_index().copy()
ts[variable]=ts[variable].notnull()*1
map = folium.Map(location=[48, -102], zoom_start=2)
map.choropleth(geo_path=geo,
data=ts,
columns=['country', variable],
key_on='feature.properties.name', reset=True,
fill_color='GnBu', fill_opacity=1, line_opacity=0.2,
legend_name=legend_name if legend_name else variable)
return map
In [35]:
plot_null_map(data, '2013-2017', 'number_undernourished', 'Number undernourished is missing')
Out[35]:
Question: Are there any patterns in missing data? Any questions that come to mind for further investigation?
To do: Look at other variables
In [36]:
fig, ax = plt.subplots(figsize=(16, 16));
sns.heatmap(data.groupby(['time_period','variable']).value.count().unstack().T , ax=ax);
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Variable');
plt.title('Number of countries with data reported for each variable over time');
Before trying to understand what information is in the data, make sure you understand what the data represents.
Sanity check! Do the values make sense?
Things to do:
Questions to consider:
This stage really morphs into the univariate exploration that comes next as you are often diving into each variable one by one and first understanding it, exploring it, then checking that understanding again. We can however do some initial profiling with a few handy python packages.
In [37]:
pivottablejs.pivot_ui(time_slice(data, '2013-2017'),)
Out[37]:
In [38]:
pandas_profiling.ProfileReport(time_slice(data, '2013-2017'))
Out[38]:
Overview
Dataset info
Number of variables
55
Number of observations
199
Total Missing (%)
6.4%
Total size in memory
85.6 KiB
Average record size in memory
440.4 B
Variables types
Numeric
30
Categorical
0
Date
0
Text (Unique)
1
Rejected
24
Warnings
accounted_flow
has 71 / 35.7% zerosaccounted_flow
has 7 / 3.5% missing values Missingaccounted_flow_border_rivers
has 150 / 75.4% zerosaccounted_flow_border_rivers
has 7 / 3.5% missing values Missingagg_to_gdp
has 32 / 16.1% missing values Missingarable_land
has 3 / 1.5% zerosarable_land
has 3 / 1.5% missing values Missingavg_annual_rain_depth
has 18 / 9.0% missing values Missingavg_annual_rain_vol
has 16 / 8.0% missing values Missingcultivated_area
is highly correlated with arable_land
(ρ = 0.99585) Rejecteddam_capacity_per_capita
has 75 / 37.7% missing values Missingdependency_ratio
has 68 / 34.2% zerosdependency_ratio
has 7 / 3.5% missing values Missingflood_occurence
has 7 / 3.5% zerosflood_occurence
has 23 / 11.6% missing values Missinggdp
has 11 / 5.5% missing values Missinggdp_per_capita
has 11 / 5.5% missing values Missinggender_inequal_index
has 43 / 21.6% missing values Missinggroundwater_accounted_inflow
has 178 / 89.4% zerosgroundwater_accounted_inflow
has 7 / 3.5% missing values Missinggroundwater_accounted_outflow
has 140 / 70.4% zerosgroundwater_accounted_outflow
has 45 / 22.6% missing values Missinggroundwater_entering
has 179 / 89.9% zerosgroundwater_entering
has 7 / 3.5% missing values Missinggroundwater_produced
has 2 / 1.0% zerosgroundwater_produced
has 29 / 14.6% missing values Missinggroundwater_to_other_countries
is highly correlated with groundwater_accounted_outflow
(ρ = 1) Rejectedhuman_dev_index
has 12 / 6.0% missing values Missinginterannual_variability
has 33 / 16.6% missing values Missingirrigation_potential
has 88 / 44.2% missing values Missingirwr
is highly correlated with avg_annual_rain_vol
(ρ = 0.96167) Rejectedirwr_per_capita
has 18 / 9.0% missing values Missingnumber_undernourished
is highly correlated with irrigation_potential
(ρ = 0.96711) Rejectedoverlap_surface_groundwater
is highly correlated with groundwater_produced
(ρ = 0.9919) Rejectedpercent_cultivated
has 3 / 1.5% missing values Missingpercent_undernourished
has 116 / 58.3% missing values Missingpermanent_crop_area
has 6 / 3.0% zerospermanent_crop_area
has 3 / 1.5% missing values Missingrural_pop
is highly correlated with number_undernourished
(ρ = 0.99226) Rejectedrural_pop_access_drinking
has 16 / 8.0% missing values Missingseasonal_variability
has 33 / 16.6% missing values Missingsurface_entering
is highly correlated with accounted_flow
(ρ = 0.98177) Rejectedsurface_groundwater_overlap
is highly correlated with overlap_surface_groundwater
(ρ = 1) Rejectedsurface_inflow_secure_treaty
has 178 / 89.4% zerossurface_inflow_secure_treaty
has 7 / 3.5% missing values Missingsurface_inflow_submit_no_treaty
is highly correlated with surface_entering
(ρ = 0.99629) Rejectedsurface_inflow_submit_treaty
is highly correlated with surface_inflow_secure_treaty
(ρ = 0.979) Rejectedsurface_outflow_secure_treaty
has 179 / 89.9% zerossurface_outflow_secure_treaty
has 5 / 2.5% missing values Missingsurface_outflow_submit_no_treaty
has 78 / 39.2% zerossurface_outflow_submit_no_treaty
has 16 / 8.0% missing values Missingsurface_outflow_submit_treaty
is highly correlated with surface_outflow_secure_treaty
(ρ = 0.97841) Rejectedsurface_to_other_countries
is highly correlated with surface_outflow_submit_no_treaty
(ρ = 0.99643) Rejectedsurface_total_external_renewable
is highly correlated with surface_inflow_submit_no_treaty
(ρ = 0.97923) Rejectedsurface_water_produced
is highly correlated with irwr
(ρ = 0.99953) Rejectedtotal_area
has 2 / 1.0% missing values Missingtotal_dam_capacity
is highly correlated with number_undernourished
(ρ = 0.90598) Rejectedtotal_flow_border_rivers
is highly correlated with accounted_flow_border_rivers
(ρ = 0.96631) Rejectedtotal_pop
is highly correlated with rural_pop
(ρ = 0.96084) Rejectedtotal_pop_access_drinking
is highly correlated with rural_pop_access_drinking
(ρ = 0.94921) Rejectedtotal_renewable
is highly correlated with surface_water_produced
(ρ = 0.97515) Rejectedtotal_renewable_groundwater
is highly correlated with surface_groundwater_overlap
(ρ = 0.99191) Rejectedtotal_renewable_per_capita
is highly correlated with irwr_per_capita
(ρ = 0.97641) Rejectedtotal_renewable_surface
is highly correlated with total_renewable
(ρ = 0.99966) Rejectedurban_pop
is highly correlated with total_pop
(ρ = 0.95137) Rejectedurban_pop_access_drinking
has 10 / 5.0% missing values Missingwater_total_external_renewable
is highly correlated with surface_total_external_renewable
(ρ = 1) Rejected
Variables
accounted_flow
Numeric
Distinct count
113
Unique (%)
58.9%
Missing (%)
3.5%
Missing (n)
7
Infinite (%)
0.0%
Infinite (n)
0
Mean
63.557
Minimum
0
Maximum
2986
Zeros (%)
35.7%
Quantile statistics
Minimum
0
5-th percentile
0
Q1
0
Median
3
Q3
26.065
95-th percentile
270.63
Maximum
2986
Range
2986
Interquartile range
26.065
Descriptive statistics
Standard deviation
250.32
Coef of variation
3.9386
Kurtosis
99.577
Mean
63.557
MAD
93.89
Skewness
9.0735
Sum
12203
Variance
62662
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
71
35.7%
3.0
3
1.5%
2.0
3
1.5%
11.0
3
1.5%
80.0
2
1.0%
10.15
2
1.0%
0.3
2
1.0%
1.0
2
1.0%
53.32
1
0.5%
524.7
1
0.5%
Other values (102)
102
51.3%
(Missing)
7
3.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0
71
35.7%
0.015
1
0.5%
0.038
1
0.5%
0.096
1
0.5%
0.165
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
584.2
1
0.5%
610.0
1
0.5%
635.2
1
0.5%
1122.0
1
0.5%
2986.0
1
0.5%
accounted_flow_border_rivers
Numeric
Distinct count
41
Unique (%)
21.4%
Missing (%)
3.5%
Missing (n)
7
Infinite (%)
0.0%
Infinite (n)
0
Mean
8.3438
Minimum
0
Maximum
558
Zeros (%)
75.4%
Quantile statistics
Minimum
0
5-th percentile
0
Q1
0
Median
0
Q3
0
95-th percentile
29.198
Maximum
558
Range
558
Interquartile range
0
Descriptive statistics
Standard deviation
46.811
Coef of variation
5.6103
Kurtosis
103.31
Mean
8.3438
MAD
14.415
Skewness
9.4653
Sum
1602
Variance
2191.3
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
150
75.4%
10.15
2
1.0%
1.45
2
1.0%
11.0
2
1.0%
3.815
1
0.5%
7.74
1
0.5%
34.33
1
0.5%
25.0
1
0.5%
75.0
1
0.5%
1.25
1
0.5%
Other values (30)
30
15.1%
(Missing)
7
3.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0
150
75.4%
0.035
1
0.5%
0.038
1
0.5%
0.14
1
0.5%
0.432
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
84.05
1
0.5%
110.0
1
0.5%
197.5
1
0.5%
214.1
1
0.5%
558.0
1
0.5%
agg_to_gdp
Numeric
Distinct count
165
Unique (%)
98.8%
Missing (%)
16.1%
Missing (n)
32
Infinite (%)
0.0%
Infinite (n)
0
Mean
12.545
Minimum
0.0349
Maximum
59.23
Zeros (%)
0.0%
Quantile statistics
Minimum
0.0349
5-th percentile
0.62919
Q1
2.8885
Median
8.498
Q3
19.465
95-th percentile
36.168
Maximum
59.23
Range
59.195
Interquartile range
16.576
Descriptive statistics
Standard deviation
11.997
Coef of variation
0.95628
Kurtosis
1.5798
Mean
12.545
MAD
9.5593
Skewness
1.3303
Sum
2095
Variance
143.92
Memory size
1.6 KiB
Value
Count
Frequency (%)
40.97
2
1.0%
2.384
2
1.0%
32.94
2
1.0%
1.652
1
0.5%
10.12
1
0.5%
4.725
1
0.5%
23.66
1
0.5%
24.09
1
0.5%
6.574
1
0.5%
17.39
1
0.5%
Other values (154)
154
77.4%
(Missing)
32
16.1%
Minimum 5 values
Value
Count
Frequency (%)
0.0349
1
0.5%
0.1363
1
0.5%
0.1815
1
0.5%
0.3004
1
0.5%
0.4083
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
42.89
1
0.5%
43.92
1
0.5%
47.52
1
0.5%
52.39
1
0.5%
59.23
1
0.5%
arable_land
Numeric
Distinct count
173
Unique (%)
88.3%
Missing (%)
1.5%
Missing (n)
3
Infinite (%)
0.0%
Infinite (n)
0
Mean
7229.3
Minimum
0
Maximum
156360
Zeros (%)
1.5%
Quantile statistics
Minimum
0
5-th percentile
1.9
Q1
120
Median
1204.5
Q3
4671.5
95-th percentile
30963
Maximum
156360
Range
156360
Interquartile range
4551.5
Descriptive statistics
Standard deviation
21029
Coef of variation
2.9089
Kurtosis
31.537
Mean
7229.3
MAD
9460.8
Skewness
5.3616
Sum
1416900
Variance
442230000
Memory size
1.6 KiB
Value
Count
Frequency (%)
2.0
4
2.0%
1.0
4
2.0%
3.0
3
1.5%
3800.0
3
1.5%
0.0
3
1.5%
5.0
3
1.5%
2350.0
2
1.0%
120.0
2
1.0%
800.0
2
1.0%
300.0
2
1.0%
Other values (162)
168
84.4%
(Missing)
3
1.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0
3
1.5%
0.08
1
0.5%
0.56
1
0.5%
1.0
4
2.0%
1.6
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
80017.0
1
0.5%
106298.0
1
0.5%
123122.0
1
0.5%
154605.0
1
0.5%
156360.0
1
0.5%
avg_annual_rain_depth
Numeric
Distinct count
174
Unique (%)
96.1%
Missing (%)
9.0%
Missing (n)
18
Infinite (%)
0.0%
Infinite (n)
0
Mean
1166
Minimum
51
Maximum
3240
Zeros (%)
0.0%
Quantile statistics
Minimum
51
5-th percentile
121
Q1
562
Median
1030
Q3
1705
95-th percentile
2702
Maximum
3240
Range
3189
Interquartile range
1143
Descriptive statistics
Standard deviation
800.11
Coef of variation
0.68622
Kurtosis
-0.40545
Mean
1166
MAD
662.11
Skewness
0.65639
Sum
211040
Variance
640180
Memory size
1.6 KiB
Value
Count
Frequency (%)
250.0
2
1.0%
788.0
2
1.0%
900.0
2
1.0%
1500.0
2
1.0%
1274.0
2
1.0%
282.0
2
1.0%
2200.0
2
1.0%
228.0
2
1.0%
657.0
1
0.5%
241.0
1
0.5%
Other values (163)
163
81.9%
(Missing)
18
9.0%
Minimum 5 values
Value
Count
Frequency (%)
51.0
1
0.5%
56.0
1
0.5%
59.0
1
0.5%
74.0
1
0.5%
78.0
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
2928.0
1
0.5%
3028.0
1
0.5%
3142.0
1
0.5%
3200.0
1
0.5%
3240.0
1
0.5%
avg_annual_rain_vol
Numeric
Distinct count
181
Unique (%)
98.9%
Missing (%)
8.0%
Missing (n)
16
Infinite (%)
0.0%
Infinite (n)
0
Mean
595.32
Minimum
0.064
Maximum
14995
Zeros (%)
0.0%
Quantile statistics
Minimum
0.064
5-th percentile
0.86507
Q1
33.095
Median
127.8
Q3
434.5
95-th percentile
3427.4
Maximum
14995
Range
14995
Interquartile range
401.4
Descriptive statistics
Standard deviation
1590.4
Coef of variation
2.6715
Kurtosis
41.211
Mean
595.32
MAD
734.91
Skewness
5.7112
Sum
108940
Variance
2529400
Memory size
1.6 KiB
Value
Count
Frequency (%)
1259.0
2
1.0%
220.8
2
1.0%
297.2
2
1.0%
279.2
1
0.5%
3618.0
1
0.5%
199.8
1
0.5%
25.86
1
0.5%
1415.0
1
0.5%
7030.0
1
0.5%
513.1
1
0.5%
Other values (170)
170
85.4%
(Missing)
16
8.0%
Minimum 5 values
Value
Count
Frequency (%)
0.064
1
0.5%
0.1792
1
0.5%
0.371
1
0.5%
0.4532
1
0.5%
0.4724
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
5362.0
1
0.5%
6192.0
1
0.5%
7030.0
1
0.5%
7865.0
1
0.5%
14995.0
1
0.5%
country
Categorical, Unique
First 3 values
Haiti
Djibouti
Antigua and Barbuda
Last 3 values
Croatia
Nepal
Burkina Faso
First 10 values
Value
Count
Frequency (%)
Afghanistan
1
0.5%
Albania
1
0.5%
Algeria
1
0.5%
Andorra
1
0.5%
Angola
1
0.5%
Last 10 values
Value
Count
Frequency (%)
Venezuela (Bolivarian Republic of)
1
0.5%
Viet Nam
1
0.5%
Yemen
1
0.5%
Zambia
1
0.5%
Zimbabwe
1
0.5%
cultivated_area
Highly correlated
This variable is highly correlated with arable_land
and should be ignored for analysis
Correlation
0.99585
dam_capacity_per_capita
Numeric
Distinct count
124
Unique (%)
100.0%
Missing (%)
37.7%
Missing (n)
75
Infinite (%)
0.0%
Infinite (n)
0
Mean
1580.3
Minimum
0.1873
Maximum
36832
Zeros (%)
0.0%
Quantile statistics
Minimum
0.1873
5-th percentile
3.2488
Q1
71.072
Median
328.5
Q3
1205.5
95-th percentile
5561.6
Maximum
36832
Range
36832
Interquartile range
1134.4
Descriptive statistics
Standard deviation
4130.5
Coef of variation
2.6138
Kurtosis
48.996
Mean
1580.3
MAD
1934
Skewness
6.4315
Sum
195950
Variance
17061000
Memory size
1.6 KiB
Value
Count
Frequency (%)
633.1
2
1.0%
65.35
1
0.5%
3370.0
1
0.5%
61.76
1
0.5%
55.49
1
0.5%
1321.0
1
0.5%
4.704
1
0.5%
1055.0
1
0.5%
38.97
1
0.5%
2050.0
1
0.5%
Other values (113)
113
56.8%
(Missing)
75
37.7%
Minimum 5 values
Value
Count
Frequency (%)
0.1873
1
0.5%
0.6846
1
0.5%
1.948
1
0.5%
1.969
1
0.5%
2.16
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
6386.0
1
0.5%
6405.0
1
0.5%
7001.0
1
0.5%
23414.0
1
0.5%
36832.0
1
0.5%
dependency_ratio
Numeric
Distinct count
125
Unique (%)
65.1%
Missing (%)
3.5%
Missing (n)
7
Infinite (%)
0.0%
Infinite (n)
0
Mean
22.819
Minimum
0
Maximum
100
Zeros (%)
34.2%
Quantile statistics
Minimum
0
5-th percentile
0
Q1
0
Median
6.183
Q3
40.8
95-th percentile
87.299
Maximum
100
Range
100
Interquartile range
40.8
Descriptive statistics
Standard deviation
29.869
Coef of variation
1.309
Kurtosis
0.010321
Mean
22.819
MAD
25.123
Skewness
1.1509
Sum
4381.2
Variance
892.16
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
68
34.2%
30.52
2
1.0%
80.39
1
0.5%
4.123
1
0.5%
14.63
1
0.5%
7.407
1
0.5%
24.49
1
0.5%
21.77
1
0.5%
5.769
1
0.5%
64.27
1
0.5%
Other values (114)
114
57.3%
(Missing)
7
3.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0
68
34.2%
0.2691
1
0.5%
0.2695
1
0.5%
0.7496
1
0.5%
0.7854
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
96.49
1
0.5%
96.55
1
0.5%
96.91
1
0.5%
97.0
1
0.5%
100.0
1
0.5%
flood_occurence
Numeric
Distinct count
41
Unique (%)
23.3%
Missing (%)
11.6%
Missing (n)
23
Infinite (%)
0.0%
Infinite (n)
0
Mean
2.6955
Minimum
0
Maximum
4.9
Zeros (%)
3.5%
Quantile statistics
Minimum
0
5-th percentile
0.35
Q1
2.3
Median
2.9
Q3
3.325
95-th percentile
3.825
Maximum
4.9
Range
4.9
Interquartile range
1.025
Descriptive statistics
Standard deviation
0.99495
Coef of variation
0.36912
Kurtosis
0.96957
Mean
2.6955
MAD
0.75212
Skewness
-1.0132
Sum
474.4
Variance
0.98992
Memory size
1.6 KiB
Value
Count
Frequency (%)
3.6
12
6.0%
3.0
11
5.5%
3.3
11
5.5%
3.1
11
5.5%
2.5
10
5.0%
2.9
8
4.0%
3.5
8
4.0%
2.8
8
4.0%
2.7
7
3.5%
0.0
7
3.5%
Other values (30)
83
41.7%
(Missing)
23
11.6%
Minimum 5 values
Value
Count
Frequency (%)
0.0
7
3.5%
0.1
1
0.5%
0.2
1
0.5%
0.4
1
0.5%
0.6
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
3.9
5
2.5%
4.0
1
0.5%
4.5
1
0.5%
4.7
1
0.5%
4.9
1
0.5%
gdp
Numeric
Distinct count
185
Unique (%)
98.4%
Missing (%)
5.5%
Missing (n)
11
Infinite (%)
0.0%
Infinite (n)
0
Mean
384890000000
Minimum
37860000
Maximum
17900000000000
Zeros (%)
0.0%
Quantile statistics
Minimum
37860000
5-th percentile
754760000
Q1
7675800000
Median
30475000000
Q3
192500000000
95-th percentile
1490500000000
Maximum
17900000000000
Range
17900000000000
Interquartile range
184820000000
Descriptive statistics
Standard deviation
1604200000000
Coef of variation
4.1681
Kurtosis
85.899
Mean
384890000000
MAD
549640000000
Skewness
8.7022
Sum
72359000000000
Variance
2.5736e+24
Memory size
1.6 KiB
Value
Count
Frequency (%)
195000000000.0
2
1.0%
167000000000.0
2
1.0%
296000000000.0
2
1.0%
292000000000.0
2
1.0%
2.85e+12
1
0.5%
13779570706.0
1
0.5%
35237742278.0
1
0.5%
199000000000.0
1
0.5%
52132289700.0
1
0.5%
11099473097.0
1
0.5%
Other values (174)
174
87.4%
(Missing)
11
5.5%
Minimum 5 values
Value
Count
Frequency (%)
37859554.0
1
0.5%
145237022.0
1
0.5%
186716626.0
1
0.5%
287400000.0
1
0.5%
318071979.0
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
2.85e+12
1
0.5%
3.36e+12
1
0.5%
4.12e+12
1
0.5%
1.09e+13
1
0.5%
1.79e+13
1
0.5%
gdp_per_capita
Numeric
Distinct count
188
Unique (%)
100.0%
Missing (%)
5.5%
Missing (n)
11
Infinite (%)
0.0%
Infinite (n)
0
Mean
12531
Minimum
276
Maximum
101910
Zeros (%)
0.0%
Quantile statistics
Minimum
276
5-th percentile
537.11
Q1
1732.2
Median
4911.5
Q3
14474
95-th percentile
50644
Maximum
101910
Range
101640
Interquartile range
12742
Descriptive statistics
Standard deviation
17543
Coef of variation
1.4
Kurtosis
5.4979
Mean
12531
MAD
12391
Skewness
2.2522
Sum
2355700
Variance
307760000
Memory size
1.6 KiB
Value
Count
Frequency (%)
3974.0
2
1.0%
26018.0
1
0.5%
1773.0
1
0.5%
3822.0
1
0.5%
6862.0
1
0.5%
1429.0
1
0.5%
23030.0
1
0.5%
17282.0
1
0.5%
1579.0
1
0.5%
3491.0
1
0.5%
Other values (177)
177
88.9%
(Missing)
11
5.5%
Minimum 5 values
Value
Count
Frequency (%)
276.0
1
0.5%
306.8
1
0.5%
359.0
1
0.5%
381.4
1
0.5%
411.8
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
55906.0
1
0.5%
74458.0
1
0.5%
74720.0
1
0.5%
80130.0
1
0.5%
101911.0
1
0.5%
gender_inequal_index
Numeric
Distinct count
155
Unique (%)
99.4%
Missing (%)
21.6%
Missing (n)
43
Infinite (%)
0.0%
Infinite (n)
0
Mean
0.36695
Minimum
0.0164
Maximum
0.744
Zeros (%)
0.0%
Quantile statistics
Minimum
0.0164
5-th percentile
0.066375
Q1
0.18738
Median
0.38605
Q3
0.52572
95-th percentile
0.65758
Maximum
0.744
Range
0.7276
Interquartile range
0.33835
Descriptive statistics
Standard deviation
0.19133
Coef of variation
0.52141
Kurtosis
-1.1218
Mean
0.36695
MAD
0.16375
Skewness
-0.10515
Sum
57.244
Variance
0.036608
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.1247
2
1.0%
0.1636
2
1.0%
0.1507
1
0.5%
0.3573
1
0.5%
0.4485
1
0.5%
0.5329
1
0.5%
0.6224
1
0.5%
0.0884
1
0.5%
0.4796
1
0.5%
0.4134
1
0.5%
Other values (144)
144
72.4%
(Missing)
43
21.6%
Minimum 5 values
Value
Count
Frequency (%)
0.0164
1
0.5%
0.0278
1
0.5%
0.0407
1
0.5%
0.0484
1
0.5%
0.0528
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
0.6789
1
0.5%
0.6934
1
0.5%
0.7065
1
0.5%
0.7132
1
0.5%
0.744
1
0.5%
groundwater_accounted_inflow
Numeric
Distinct count
15
Unique (%)
7.8%
Missing (%)
3.5%
Missing (n)
7
Infinite (%)
0.0%
Infinite (n)
0
Mean
0.012557
Minimum
-1.2
Maximum
1.33
Zeros (%)
89.4%
Quantile statistics
Minimum
-1.2
5-th percentile
0
Q1
0
Median
0
Q3
0
95-th percentile
0.0245
Maximum
1.33
Range
2.53
Interquartile range
0
Descriptive statistics
Standard deviation
0.1577
Coef of variation
12.559
Kurtosis
52.707
Mean
0.012557
MAD
0.036051
Skewness
2.4695
Sum
2.411
Variance
0.02487
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
178
89.4%
0.08
2
1.0%
0.01
1
0.5%
0.032
1
0.5%
-1.2
1
0.5%
0.725
1
0.5%
0.002
1
0.5%
1.33
1
0.5%
0.03
1
0.5%
0.02
1
0.5%
Other values (4)
4
2.0%
(Missing)
7
3.5%
Minimum 5 values
Value
Count
Frequency (%)
-1.2
1
0.5%
0.0
178
89.4%
0.002
1
0.5%
0.01
1
0.5%
0.02
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
0.1
1
0.5%
0.112
1
0.5%
0.725
1
0.5%
1.0
1
0.5%
1.33
1
0.5%
groundwater_accounted_outflow
Numeric
Distinct count
15
Unique (%)
9.7%
Missing (%)
22.6%
Missing (n)
45
Infinite (%)
0.0%
Infinite (n)
0
Mean
0.27273
Minimum
0
Maximum
26.12
Zeros (%)
70.4%
Quantile statistics
Minimum
0
5-th percentile
0
Q1
0
Median
0
Q3
0
95-th percentile
0.301
Maximum
26.12
Range
26.12
Interquartile range
0
Descriptive statistics
Standard deviation
2.2802
Coef of variation
8.3604
Kurtosis
112.51
Mean
0.27273
MAD
0.51012
Skewness
10.334
Sum
42.001
Variance
5.1991
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
140
70.4%
0.95
2
1.0%
0.032
1
0.5%
0.03
1
0.5%
0.34
1
0.5%
0.025
1
0.5%
0.1
1
0.5%
0.7
1
0.5%
26.12
1
0.5%
0.394
1
0.5%
Other values (4)
4
2.0%
(Missing)
45
22.6%
Minimum 5 values
Value
Count
Frequency (%)
0.0
140
70.4%
0.025
1
0.5%
0.03
1
0.5%
0.032
1
0.5%
0.08
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
0.7
1
0.5%
0.95
2
1.0%
1.0
1
0.5%
11.0
1
0.5%
26.12
1
0.5%
groundwater_entering
Numeric
Distinct count
14
Unique (%)
7.3%
Missing (%)
3.5%
Missing (n)
7
Infinite (%)
0.0%
Infinite (n)
0
Mean
0.070786
Minimum
0
Maximum
11.13
Zeros (%)
89.9%
Quantile statistics
Minimum
0
5-th percentile
0
Q1
0
Median
0
Q3
0
95-th percentile
0.0245
Maximum
11.13
Range
11.13
Interquartile range
0
Descriptive statistics
Standard deviation
0.80753
Coef of variation
11.408
Kurtosis
187.02
Mean
0.070786
MAD
0.13469
Skewness
13.6
Sum
13.591
Variance
0.6521
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
179
89.9%
0.08
2
1.0%
0.01
1
0.5%
0.032
1
0.5%
0.725
1
0.5%
0.002
1
0.5%
0.27
1
0.5%
0.03
1
0.5%
0.02
1
0.5%
0.112
1
0.5%
Other values (3)
3
1.5%
(Missing)
7
3.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0
179
89.9%
0.002
1
0.5%
0.01
1
0.5%
0.02
1
0.5%
0.03
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
0.112
1
0.5%
0.27
1
0.5%
0.725
1
0.5%
1.0
1
0.5%
11.13
1
0.5%
groundwater_produced
Numeric
Distinct count
150
Unique (%)
88.2%
Missing (%)
14.6%
Missing (n)
29
Infinite (%)
0.0%
Infinite (n)
0
Mean
62.768
Minimum
0
Maximum
1383
Zeros (%)
1.0%
Quantile statistics
Minimum
0
5-th percentile
0.0767
Q1
2.2
Median
9.65
Q3
40.98
95-th percentile
398.05
Maximum
1383
Range
1383
Interquartile range
38.78
Descriptive statistics
Standard deviation
165.13
Coef of variation
2.6308
Kurtosis
29.206
Mean
62.768
MAD
82.406
Skewness
4.8804
Sum
10670
Variance
27268
Memory size
1.6 KiB
Value
Count
Frequency (%)
6.0
4
2.0%
20.0
4
2.0%
0.5
4
2.0%
2.5
3
1.5%
1.3
3
1.5%
4.0
3
1.5%
3.2
2
1.0%
55.0
2
1.0%
2.2
2
1.0%
10.0
2
1.0%
Other values (139)
141
70.9%
(Missing)
29
14.6%
Minimum 5 values
Value
Count
Frequency (%)
0.0
2
1.0%
0.01
1
0.5%
0.015
1
0.5%
0.02
1
0.5%
0.03
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
510.0
1
0.5%
645.6
1
0.5%
788.0
1
0.5%
828.8
1
0.5%
1383.0
1
0.5%
groundwater_to_other_countries
Highly correlated
This variable is highly correlated with groundwater_accounted_outflow
and should be ignored for analysis
Correlation
1
human_dev_index
Numeric
Distinct count
186
Unique (%)
99.5%
Missing (%)
6.0%
Missing (n)
12
Infinite (%)
0.0%
Infinite (n)
0
Mean
0.69128
Minimum
0.3483
Maximum
0.9439
Zeros (%)
0.0%
Quantile statistics
Minimum
0.3483
5-th percentile
0.41939
Q1
0.57255
Median
0.7238
Q3
0.80925
95-th percentile
0.91278
Maximum
0.9439
Range
0.5956
Interquartile range
0.2367
Descriptive statistics
Standard deviation
0.15429
Coef of variation
0.2232
Kurtosis
-0.91053
Mean
0.69128
MAD
0.12987
Skewness
-0.35983
Sum
129.27
Variance
0.023807
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.715
2
1.0%
0.7928
2
1.0%
0.514
1
0.5%
0.6087
1
0.5%
0.6276
1
0.5%
0.575
1
0.5%
0.8701
1
0.5%
0.8175
1
0.5%
0.6357
1
0.5%
0.9155
1
0.5%
Other values (175)
175
87.9%
(Missing)
12
6.0%
Minimum 5 values
Value
Count
Frequency (%)
0.3483
1
0.5%
0.3501
1
0.5%
0.3909
1
0.5%
0.3919
1
0.5%
0.3999
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
0.9218
1
0.5%
0.9233
1
0.5%
0.9296
1
0.5%
0.935
1
0.5%
0.9439
1
0.5%
interannual_variability
Numeric
Distinct count
33
Unique (%)
19.9%
Missing (%)
16.6%
Missing (n)
33
Infinite (%)
0.0%
Infinite (n)
0
Mean
1.7584
Minimum
0.6
Maximum
4.9
Zeros (%)
0.0%
Quantile statistics
Minimum
0.6
5-th percentile
0.8
Q1
1.1
Median
1.5
Q3
2.3
95-th percentile
3.5
Maximum
4.9
Range
4.3
Interquartile range
1.2
Descriptive statistics
Standard deviation
0.88408
Coef of variation
0.50276
Kurtosis
0.88868
Mean
1.7584
MAD
0.71277
Skewness
1.1336
Sum
291.9
Variance
0.7816
Memory size
1.6 KiB
Value
Count
Frequency (%)
1.0
15
7.5%
1.2
12
6.0%
0.9
11
5.5%
1.4
11
5.5%
1.5
10
5.0%
1.1
10
5.0%
1.3
9
4.5%
0.8
8
4.0%
2.7
7
3.5%
2.3
6
3.0%
Other values (22)
67
33.7%
(Missing)
33
16.6%
Minimum 5 values
Value
Count
Frequency (%)
0.6
3
1.5%
0.7
2
1.0%
0.8
8
4.0%
0.9
11
5.5%
1.0
15
7.5%
Maximum 5 values
Value
Count
Frequency (%)
3.6
2
1.0%
3.8
1
0.5%
4.2
2
1.0%
4.3
2
1.0%
4.9
1
0.5%
irrigation_potential
Numeric
Distinct count
105
Unique (%)
94.6%
Missing (%)
44.2%
Missing (n)
88
Infinite (%)
0.0%
Infinite (n)
0
Mean
4638.7
Minimum
0.2
Maximum
139500
Zeros (%)
0.0%
Quantile statistics
Minimum
0.2
5-th percentile
7.465
Q1
183.5
Median
566
Q3
3099
95-th percentile
15500
Maximum
139500
Range
139500
Interquartile range
2915.5
Descriptive statistics
Standard deviation
15300
Coef of variation
3.2984
Kurtosis
58.212
Mean
4638.7
MAD
6008.1
Skewness
7.1447
Sum
514900
Variance
234100000
Memory size
1.6 KiB
Value
Count
Frequency (%)
2700.0
2
1.0%
5500.0
2
1.0%
165.0
2
1.0%
600.0
2
1.0%
1900.0
2
1.0%
30.0
2
1.0%
200.0
2
1.0%
70000.0
1
0.5%
40.0
1
0.5%
0.894
1
0.5%
Other values (94)
94
47.2%
(Missing)
88
44.2%
Minimum 5 values
Value
Count
Frequency (%)
0.2
1
0.5%
0.3
1
0.5%
0.894
1
0.5%
1.0
1
0.5%
2.4
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
21300.0
1
0.5%
29000.0
1
0.5%
29350.0
1
0.5%
70000.0
1
0.5%
139500.0
1
0.5%
irwr
Highly correlated
This variable is highly correlated with avg_annual_rain_vol
and should be ignored for analysis
Correlation
0.96167
irwr_per_capita
Numeric
Distinct count
182
Unique (%)
100.6%
Missing (%)
9.0%
Missing (n)
18
Infinite (%)
0.0%
Infinite (n)
0
Mean
16036
Minimum
0
Maximum
516090
Zeros (%)
0.5%
Quantile statistics
Minimum
0
5-th percentile
93.01
Q1
913.2
Median
2599
Q3
11227
95-th percentile
72201
Maximum
516090
Range
516090
Interquartile range
10314
Descriptive statistics
Standard deviation
49232
Coef of variation
3.07
Kurtosis
66.768
Mean
16036
MAD
20476
Skewness
7.4684
Sum
2902600
Variance
2423700000
Memory size
1.6 KiB
Value
Count
Frequency (%)
822.2
1
0.5%
566.3
1
0.5%
1571.0
1
0.5%
11761.0
1
0.5%
19444.0
1
0.5%
2886.0
1
0.5%
3303.0
1
0.5%
1213.0
1
0.5%
5372.0
1
0.5%
3585.0
1
0.5%
Other values (171)
171
85.9%
(Missing)
18
9.0%
Minimum 5 values
Value
Count
Frequency (%)
0.0
1
0.5%
2.905
1
0.5%
16.38
1
0.5%
19.67
1
0.5%
25.06
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
100671.0
1
0.5%
105132.0
1
0.5%
182320.0
1
0.5%
314170.0
1
0.5%
516090.0
1
0.5%
number_undernourished
Highly correlated
This variable is highly correlated with irrigation_potential
and should be ignored for analysis
Correlation
0.96711
overlap_surface_groundwater
Highly correlated
This variable is highly correlated with groundwater_produced
and should be ignored for analysis
Correlation
0.9919
percent_cultivated
Numeric
Distinct count
195
Unique (%)
99.5%
Missing (%)
1.5%
Missing (n)
3
Infinite (%)
0.0%
Infinite (n)
0
Mean
18.513
Minimum
0.0862
Maximum
63.41
Zeros (%)
0.0%
Quantile statistics
Minimum
0.0862
5-th percentile
0.96727
Q1
5.9638
Median
14.68
Q3
27.88
95-th percentile
50.148
Maximum
63.41
Range
63.324
Interquartile range
21.916
Descriptive statistics
Standard deviation
15.496
Coef of variation
0.83705
Kurtosis
0.28068
Mean
18.513
MAD
12.504
Skewness
0.98211
Sum
3628.6
Variance
240.14
Memory size
1.6 KiB
Value
Count
Frequency (%)
27.88
2
1.0%
60.0
2
1.0%
16.13
1
0.5%
3.412
1
0.5%
18.02
1
0.5%
10.91
1
0.5%
31.98
1
0.5%
6.111
1
0.5%
21.4
1
0.5%
55.7
1
0.5%
Other values (184)
184
92.5%
(Missing)
3
1.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0862
1
0.5%
0.2223
1
0.5%
0.3658
1
0.5%
0.4334
1
0.5%
0.4473
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
56.76
1
0.5%
57.57
1
0.5%
60.0
2
1.0%
62.3
1
0.5%
63.41
1
0.5%
percent_undernourished
Numeric
Distinct count
69
Unique (%)
83.1%
Missing (%)
58.3%
Missing (n)
116
Infinite (%)
0.0%
Infinite (n)
0
Mean
17.223
Minimum
5.1
Maximum
53.4
Zeros (%)
0.0%
Quantile statistics
Minimum
5.1
5-th percentile
5.5
Q1
7.9
Median
13.5
Q3
23.45
95-th percentile
40.88
Maximum
53.4
Range
48.3
Interquartile range
15.55
Descriptive statistics
Standard deviation
11.416
Coef of variation
0.66285
Kurtosis
0.8264
Mean
17.223
MAD
9.2504
Skewness
1.1583
Sum
1429.5
Variance
130.33
Memory size
1.6 KiB
Value
Count
Frequency (%)
7.4
3
1.5%
14.2
3
1.5%
20.7
3
1.5%
7.5
2
1.0%
9.5
2
1.0%
16.4
2
1.0%
26.8
2
1.0%
6.2
2
1.0%
22.0
2
1.0%
15.9
2
1.0%
Other values (58)
60
30.2%
(Missing)
116
58.3%
Minimum 5 values
Value
Count
Frequency (%)
5.1
2
1.0%
5.2
1
0.5%
5.3
1
0.5%
5.5
2
1.0%
5.6
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
41.6
1
0.5%
42.3
1
0.5%
47.7
1
0.5%
47.8
1
0.5%
53.4
1
0.5%
permanent_crop_area
Numeric
Distinct count
155
Unique (%)
79.1%
Missing (%)
1.5%
Missing (n)
3
Infinite (%)
0.0%
Infinite (n)
0
Mean
839.45
Minimum
0
Maximum
22500
Zeros (%)
3.0%
Quantile statistics
Minimum
0
5-th percentile
0.575
Q1
14.35
Median
112
Q3
455.5
95-th percentile
4500
Maximum
22500
Range
22500
Interquartile range
441.15
Descriptive statistics
Standard deviation
2429.3
Coef of variation
2.8939
Kurtosis
42.919
Mean
839.45
MAD
1128.2
Skewness
5.9581
Sum
164530
Variance
5901600
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
6
3.0%
4.0
5
2.5%
3.0
4
2.0%
6.0
3
1.5%
2.0
3
1.5%
700.0
3
1.5%
100.0
3
1.5%
5.0
3
1.5%
1.0
3
1.5%
60.0
2
1.0%
Other values (144)
161
80.9%
(Missing)
3
1.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0
6
3.0%
0.1
2
1.0%
0.4
1
0.5%
0.5
1
0.5%
0.6
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
6572.0
1
0.5%
6600.0
1
0.5%
13000.0
1
0.5%
16226.0
1
0.5%
22500.0
1
0.5%
rural_pop
Highly correlated
This variable is highly correlated with number_undernourished
and should be ignored for analysis
Correlation
0.99226
rural_pop_access_drinking
Numeric
Distinct count
118
Unique (%)
64.5%
Missing (%)
8.0%
Missing (n)
16
Infinite (%)
0.0%
Infinite (n)
0
Mean
84.026
Minimum
28.2
Maximum
100
Zeros (%)
0.0%
Quantile statistics
Minimum
28.2
5-th percentile
45.65
Q1
72.95
Median
92.1
Q3
99.35
95-th percentile
100
Maximum
100
Range
71.8
Interquartile range
26.4
Descriptive statistics
Standard deviation
18.907
Coef of variation
0.22501
Kurtosis
0.37805
Mean
84.026
MAD
15.52
Skewness
-1.1836
Sum
15377
Variance
357.46
Memory size
1.6 KiB
Value
Count
Frequency (%)
100.0
39
19.6%
99.0
6
3.0%
98.3
3
1.5%
92.1
2
1.0%
67.3
2
1.0%
97.0
2
1.0%
73.8
2
1.0%
99.7
2
1.0%
95.1
2
1.0%
69.4
2
1.0%
Other values (107)
121
60.8%
(Missing)
16
8.0%
Minimum 5 values
Value
Count
Frequency (%)
28.2
1
0.5%
31.2
1
0.5%
31.5
1
0.5%
32.8
1
0.5%
35.3
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
99.6
1
0.5%
99.7
2
1.0%
99.8
1
0.5%
99.9
1
0.5%
100.0
39
19.6%
seasonal_variability
Numeric
Distinct count
43
Unique (%)
25.9%
Missing (%)
16.6%
Missing (n)
33
Infinite (%)
0.0%
Infinite (n)
0
Mean
2.2904
Minimum
0.3
Maximum
4.6
Zeros (%)
0.0%
Quantile statistics
Minimum
0.3
5-th percentile
0.625
Q1
1.525
Median
2.3
Q3
3.1
95-th percentile
3.875
Maximum
4.6
Range
4.3
Interquartile range
1.575
Descriptive statistics
Standard deviation
1.0288
Coef of variation
0.44917
Kurtosis
-0.87948
Mean
2.2904
MAD
0.87
Skewness
0.079612
Sum
380.2
Variance
1.0583
Memory size
1.6 KiB
Value
Count
Frequency (%)
2.5
8
4.0%
3.6
8
4.0%
2.1
8
4.0%
1.6
8
4.0%
1.9
7
3.5%
3.1
7
3.5%
3.5
7
3.5%
2.4
7
3.5%
1.0
6
3.0%
1.8
6
3.0%
Other values (32)
94
47.2%
(Missing)
33
16.6%
Minimum 5 values
Value
Count
Frequency (%)
0.3
1
0.5%
0.4
2
1.0%
0.5
1
0.5%
0.6
5
2.5%
0.7
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
4.0
3
1.5%
4.1
1
0.5%
4.2
1
0.5%
4.4
1
0.5%
4.6
2
1.0%
surface_entering
Highly correlated
This variable is highly correlated with accounted_flow
and should be ignored for analysis
Correlation
0.98177
surface_groundwater_overlap
Highly correlated
This variable is highly correlated with overlap_surface_groundwater
and should be ignored for analysis
Correlation
1
surface_inflow_secure_treaty
Numeric
Distinct count
16
Unique (%)
8.3%
Missing (%)
3.5%
Missing (n)
7
Infinite (%)
0.0%
Infinite (n)
0
Mean
2.1905
Minimum
0
Maximum
170.3
Zeros (%)
89.4%
Quantile statistics
Minimum
0
5-th percentile
0
Q1
0
Median
0
Q3
0
95-th percentile
2.6319
Maximum
170.3
Range
170.3
Interquartile range
0
Descriptive statistics
Standard deviation
14.253
Coef of variation
6.5067
Kurtosis
105.16
Mean
2.1905
MAD
4.1016
Skewness
9.5926
Sum
420.57
Variance
203.14
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
178
89.4%
2.208
1
0.5%
3.15
1
0.5%
65.65
1
0.5%
44.11
1
0.5%
170.3
1
0.5%
0.82
1
0.5%
0.05
1
0.5%
1.85
1
0.5%
16.09
1
0.5%
Other values (5)
5
2.5%
(Missing)
7
3.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0
178
89.4%
0.05
1
0.5%
0.82
1
0.5%
1.85
1
0.5%
2.208
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
26.5
1
0.5%
44.11
1
0.5%
55.5
1
0.5%
65.65
1
0.5%
170.3
1
0.5%
surface_inflow_submit_no_treaty
Highly correlated
This variable is highly correlated with surface_entering
and should be ignored for analysis
Correlation
0.99629
surface_inflow_submit_treaty
Highly correlated
This variable is highly correlated with surface_inflow_secure_treaty
and should be ignored for analysis
Correlation
0.979
surface_outflow_secure_treaty
Numeric
Distinct count
17
Unique (%)
8.8%
Missing (%)
2.5%
Missing (n)
5
Infinite (%)
0.0%
Infinite (n)
0
Mean
2.22
Minimum
0
Maximum
170.3
Zeros (%)
89.9%
Quantile statistics
Minimum
0
5-th percentile
0
Q1
0
Median
0
Q3
0
95-th percentile
1.3058
Maximum
170.3
Range
170.3
Interquartile range
0
Descriptive statistics
Standard deviation
14.168
Coef of variation
6.382
Kurtosis
106.2
Mean
2.22
MAD
4.1863
Skewness
9.6002
Sum
430.69
Variance
200.74
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
179
89.9%
2.208
1
0.5%
54.86
1
0.5%
0.79
1
0.5%
0.82
1
0.5%
0.05
1
0.5%
170.3
1
0.5%
18.9
1
0.5%
25.87
1
0.5%
0.432
1
0.5%
Other values (6)
6
3.0%
(Missing)
5
2.5%
Minimum 5 values
Value
Count
Frequency (%)
0.0
179
89.9%
0.05
1
0.5%
0.335
1
0.5%
0.432
1
0.5%
0.79
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
26.5
1
0.5%
33.12
1
0.5%
54.86
1
0.5%
65.5
1
0.5%
170.3
1
0.5%
surface_outflow_submit_no_treaty
Numeric
Distinct count
105
Unique (%)
57.4%
Missing (%)
8.0%
Missing (n)
16
Infinite (%)
0.0%
Infinite (n)
0
Mean
55.924
Minimum
0
Maximum
1868
Zeros (%)
39.2%
Quantile statistics
Minimum
0
5-th percentile
0
Q1
0
Median
1.725
Q3
18.135
95-th percentile
193.68
Maximum
1868
Range
1868
Interquartile range
18.135
Descriptive statistics
Standard deviation
208.77
Coef of variation
3.7332
Kurtosis
43.517
Mean
55.924
MAD
84.158
Skewness
6.2161
Sum
10234
Variance
43586
Memory size
1.6 KiB
Value
Count
Frequency (%)
0.0
78
39.2%
3.0
2
1.0%
13.2
2
1.0%
48.0
1
0.5%
37.0
1
0.5%
160.0
1
0.5%
4.86
1
0.5%
9.655
1
0.5%
0.177
1
0.5%
6.145
1
0.5%
Other values (94)
94
47.2%
(Missing)
16
8.0%
Minimum 5 values
Value
Count
Frequency (%)
0.0
78
39.2%
0.015
1
0.5%
0.017
1
0.5%
0.057
1
0.5%
0.096
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
585.7
1
0.5%
718.8
1
0.5%
1142.0
1
0.5%
1375.0
1
0.5%
1868.0
1
0.5%
surface_outflow_submit_treaty
Highly correlated
This variable is highly correlated with surface_outflow_secure_treaty
and should be ignored for analysis
Correlation
0.97841
surface_to_other_countries
Highly correlated
This variable is highly correlated with surface_outflow_submit_no_treaty
and should be ignored for analysis
Correlation
0.99643
surface_total_external_renewable
Highly correlated
This variable is highly correlated with surface_inflow_submit_no_treaty
and should be ignored for analysis
Correlation
0.97923
surface_water_produced
Highly correlated
This variable is highly correlated with irwr
and should be ignored for analysis
Correlation
0.99953
total_area
Numeric
Distinct count
195
Unique (%)
99.0%
Missing (%)
1.0%
Missing (n)
2
Infinite (%)
0.0%
Infinite (n)
0
Mean
67954
Minimum
1
Maximum
1709800
Zeros (%)
0.0%
Quantile statistics
Minimum
1
5-th percentile
31.6
Q1
2207
Median
11760
Q3
51312
95-th percentile
235220
Maximum
1709800
Range
1709800
Interquartile range
49105
Descriptive statistics
Standard deviation
190910
Coef of variation
2.8094
Kurtosis
36.522
Mean
67954
MAD
85134
Skewness
5.6078
Sum
13387000
Variance
36445000000
Memory size
1.6 KiB
Value
Count
Frequency (%)
26.0
2
1.0%
75.0
2
1.0%
46.0
2
1.0%
25637.0
1
0.5%
60355.0
1
0.5%
44655.0
1
0.5%
2571.0
1
0.5%
126700.0
1
0.5%
54909.0
1
0.5%
11137.0
1
0.5%
Other values (184)
184
92.5%
(Missing)
2
1.0%
Minimum 5 values
Value
Count
Frequency (%)
1.0
1
0.5%
2.0
1
0.5%
3.0
1
0.5%
6.0
1
0.5%
16.0
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
851577.0
1
0.5%
960001.0
1
0.5%
983151.0
1
0.5%
998467.0
1
0.5%
1709825.0
1
0.5%
total_dam_capacity
Highly correlated
This variable is highly correlated with number_undernourished
and should be ignored for analysis
Correlation
0.90598
total_flow_border_rivers
Highly correlated
This variable is highly correlated with accounted_flow_border_rivers
and should be ignored for analysis
Correlation
0.96631
total_pop
Highly correlated
This variable is highly correlated with rural_pop
and should be ignored for analysis
Correlation
0.96084
total_pop_access_drinking
Highly correlated
This variable is highly correlated with rural_pop_access_drinking
and should be ignored for analysis
Correlation
0.94921
total_renewable
Highly correlated
This variable is highly correlated with surface_water_produced
and should be ignored for analysis
Correlation
0.97515
total_renewable_groundwater
Highly correlated
This variable is highly correlated with surface_groundwater_overlap
and should be ignored for analysis
Correlation
0.99191
total_renewable_per_capita
Highly correlated
This variable is highly correlated with irwr_per_capita
and should be ignored for analysis
Correlation
0.97641
total_renewable_surface
Highly correlated
This variable is highly correlated with total_renewable
and should be ignored for analysis
Correlation
0.99966
urban_pop
Highly correlated
This variable is highly correlated with total_pop
and should be ignored for analysis
Correlation
0.95137
urban_pop_access_drinking
Numeric
Distinct count
88
Unique (%)
46.6%
Missing (%)
5.0%
Missing (n)
10
Infinite (%)
0.0%
Infinite (n)
0
Mean
94.787
Minimum
50.7
Maximum
100
Zeros (%)
0.0%
Quantile statistics
Minimum
50.7
5-th percentile
76.12
Q1
93.8
Median
98.1
Q3
99.9
95-th percentile
100
Maximum
100
Range
49.3
Interquartile range
6.1
Descriptive statistics
Standard deviation
8.4156
Coef of variation
0.088784
Kurtosis
7.5084
Mean
94.787
MAD
5.5776
Skewness
-2.6093
Sum
17915
Variance
70.822
Memory size
1.6 KiB
Value
Count
Frequency (%)
100.0
47
23.6%
99.7
7
3.5%
97.5
5
2.5%
99.6
5
2.5%
99.0
4
2.0%
98.9
4
2.0%
99.9
4
2.0%
99.5
3
1.5%
95.5
3
1.5%
97.0
3
1.5%
Other values (77)
104
52.3%
(Missing)
10
5.0%
Minimum 5 values
Value
Count
Frequency (%)
50.7
1
0.5%
58.4
1
0.5%
64.9
1
0.5%
66.0
1
0.5%
66.4
1
0.5%
Maximum 5 values
Value
Count
Frequency (%)
99.6
5
2.5%
99.7
7
3.5%
99.8
2
1.0%
99.9
4
2.0%
100.0
47
23.6%
water_total_external_renewable
Highly correlated
This variable is highly correlated with surface_total_external_renewable
and should be ignored for analysis
Correlation
1
Sample
2013-2017
accounted_flow
accounted_flow_border_rivers
agg_to_gdp
arable_land
avg_annual_rain_depth
avg_annual_rain_vol
cultivated_area
dam_capacity_per_capita
dependency_ratio
flood_occurence
gdp
gdp_per_capita
gender_inequal_index
groundwater_accounted_inflow
groundwater_accounted_outflow
groundwater_entering
groundwater_produced
groundwater_to_other_countries
human_dev_index
interannual_variability
irrigation_potential
irwr
irwr_per_capita
number_undernourished
overlap_surface_groundwater
percent_cultivated
percent_undernourished
permanent_crop_area
rural_pop
rural_pop_access_drinking
seasonal_variability
surface_entering
surface_groundwater_overlap
surface_inflow_secure_treaty
surface_inflow_submit_no_treaty
surface_inflow_submit_treaty
surface_outflow_secure_treaty
surface_outflow_submit_no_treaty
surface_outflow_submit_treaty
surface_to_other_countries
surface_total_external_renewable
surface_water_produced
total_area
total_dam_capacity
total_flow_border_rivers
total_pop
total_pop_access_drinking
total_renewable
total_renewable_groundwater
total_renewable_per_capita
total_renewable_surface
urban_pop
urban_pop_access_drinking
water_total_external_renewable
country
Afghanistan
19.00
9.0
22.6000
7771.0
327.0
213.5000
7910.0
61.76
28.7200
3.7
1.919944e+10
590.3
0.6934
0.00
NaN
0.00
10.650
NaN
0.4653
2.5
NaN
47.1500
1450.0
8600.0
1.00
12.120
26.8
139.0
23980.00
47.0
2.5
10.00
1.00
0.0
10.00
0.0
0.82
35.52
6.7
42.22
18.18
37.50
65286.0
2.009
33.4
32527.00
55.3
65.3300
10.650
2008.0
55.68
8547.0
78.2
18.18
Albania
3.30
0.0
22.0500
615.6
1485.0
42.6900
696.0
1391.00
10.9300
2.7
1.145560e+10
3954.0
0.2174
0.00
0.0
0.00
6.200
0.0
0.7328
1.2
NaN
26.9000
9285.0
NaN
2.35
24.210
NaN
80.4
1062.00
95.2
2.4
3.30
2.35
0.0
3.30
0.0
0.00
11.50
0.0
11.50
3.30
23.05
2875.0
4.030
0.0
2897.00
95.1
30.2000
6.200
10425.0
26.35
1835.0
94.9
3.30
Algeria
0.39
0.0
13.0500
7469.0
89.0
212.0000
8439.0
209.30
3.5990
2.8
1.670000e+11
4210.0
0.4131
0.03
0.1
0.03
1.487
0.1
0.7356
2.3
1300.0
11.2500
283.6
NaN
0.00
3.543
NaN
969.8
10928.00
81.8
1.9
0.39
0.00
0.0
0.39
0.0
0.00
0.32
0.0
0.32
0.39
9.76
238174.0
8.304
0.0
39667.00
83.6
11.6700
1.517
294.2
10.15
28739.0
84.3
0.42
Andorra
NaN
NaN
0.5239
2.8
NaN
0.4724
2.8
NaN
NaN
3.3
3.249101e+09
46106.0
NaN
NaN
NaN
NaN
NaN
NaN
0.8446
1.5
NaN
0.3156
4479.0
NaN
NaN
5.957
NaN
0.0
1.57
100.0
1.6
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
47.0
NaN
NaN
70.47
100.0
0.3156
NaN
4479.0
NaN
68.9
100.0
NaN
Angola
0.40
0.0
NaN
4900.0
1010.0
1259.0000
5190.0
377.50
0.2695
1.7
1.030000e+11
4116.0
NaN
0.00
0.0
0.00
58.000
0.0
0.5316
2.5
3700.0
148.0000
5915.0
3200.0
55.00
4.163
14.2
290.0
14970.00
28.2
3.1
0.40
55.00
0.0
0.40
0.0
0.00
122.80
0.0
122.80
0.40
145.00
124670.0
9.445
0.0
25022.00
49.0
148.4000
58.000
5931.0
145.40
10052.0
75.4
0.40
Content source: cmawer/pycon-2017-eda-tutorial
Similar notebooks: