In [18]:
# must go first 
%matplotlib inline 
%config InlineBackend.figure_format='retina'

# plotting
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_context("poster", font_scale=1.3)
import folium

# system packages 
import os, sys
import warnings
warnings.filterwarnings('ignore')

# basic wrangling 
import numpy as np
import pandas as pd

# eda tools 
import pivottablejs
import missingno as msno
import pandas_profiling

# File with functions from prior notebook(s)
sys.path.append('../../scripts/')
from aqua_helper import time_slice, country_slice, time_series, simple_regions, subregion, variable_slice

# Update matplotlib defaults to something nicer 
mpl_update = {'font.size':16,
              'xtick.labelsize':14,
              'ytick.labelsize':14,
              'figure.figsize':[12.0,8.0],
              'axes.color_cycle':['#0055A7', '#2C3E4F', '#26C5ED', '#00cc66', '#D34100', '#FF9700','#091D32'], 
              'axes.labelsize':20,
              'axes.labelcolor':'#677385',
              'axes.titlesize':20,
              'lines.color':'#0055A7',
              'lines.linewidth':3,
              'text.color':'#677385'}
mpl.rcParams.update(mpl_update)

Our plan

Exploratory data analysis consists of the following major tasks, which we present linearly here because each task doesn't make much sense to do without the ones prior to it. However, in reality, you are going to constantly jump around from step to step. You may want to do all the steps for a subset of the variables first. Or often, an observation will bring up a question you want to investigate and you'll branch off and explore to answer that question before returning down the main path of exhaustive EDA.

  1. Research the fields of the dataset
  2. Form hypotheses/develop investigation themes to explore
  3. Wrangle data
  4. Assess quality of data
  5. Profile data
  6. Explore each individual variable in the dataset
  7. Assess the relationship between each variable and the target
  8. Assess interactions between variables
  9. Explore data across many dimensions

Throughout the entire analysis you want to:

  • Capture a list of hypotheses and questions that come up for further exploration.
  • Record things to watch out for/ be aware of in future analyses.
  • Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Don't do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge.
  • Position visuals and results together. EDA relies on your natural pattern recognition abilities so maximize what you'll find by putting visualizations and results in close proximity.

Write questions that results raise as you go. Keep updating list of hypotheses

Data wrangling


In [19]:
data = pd.read_csv('../../data/aquastat/aquastat.csv.gzip', compression='gzip')

In [20]:
data[['variable','variable_full']].drop_duplicates()


Out[20]:
variable variable_full
0 total_area Total area of the country (1000 ha)
576 arable_land Arable land area (1000 ha)
1152 permanent_crop_area Permanent crops area (1000 ha)
1728 cultivated_area Cultivated area (arable land + permanent crops...
2304 percent_cultivated % of total country area cultivated (%)
2880 total_pop Total population (1000 inhab)
3456 rural_pop Rural population (1000 inhab)
4032 urban_pop Urban population (1000 inhab)
4608 gdp Gross Domestic Product (GDP) (current US$)
5184 gdp_per_capita GDP per capita (current US$/inhab)
5760 agg_to_gdp Agriculture, value added to GDP (%)
6336 human_dev_index Human Development Index (HDI) [highest = 1] (-)
6912 gender_inequal_index Gender Inequality Index (GII) [equality = 0; i...
7488 percent_undernourished Prevalence of undernourishment (3-year average...
8064 number_undernourished Number of people undernourished (3-year averag...
8640 avg_annual_rain_depth Long-term average annual precipitation in dept...
9216 avg_annual_rain_vol Long-term average annual precipitation in volu...
9792 national_rainfall_index National Rainfall Index (NRI) (mm/year)
10368 surface_water_produced Surface water produced internally (10^9 m3/year)
10944 groundwater_produced Groundwater produced internally (10^9 m3/year)
11520 surface_groundwater_overlap Overlap between surface water and groundwater ...
12096 irwr Total internal renewable water resources (IRWR...
12672 irwr_per_capita Total internal renewable water resources per c...
13248 surface_entering Surface water: entering the country (total) (1...
13824 surface_inflow_submit_no_treaty Surface water: inflow not submitted to treatie...
14400 surface_inflow_submit_treaty Surface water: inflow submitted to treaties (1...
14976 surface_inflow_secure_treaty Surface water: inflow secured through treaties...
15552 total_flow_border_rivers Surface water: total flow of border rivers (10...
16128 accounted_flow_border_rivers Surface water: accounted flow of border rivers...
16704 accounted_flow Surface water: accounted inflow (10^9 m3/year)
17280 surface_to_other_countries Surface water: leaving the country to other co...
17856 surface_outflow_submit_no_treaty Surface water: outflow to other countries not ...
18432 surface_outflow_submit_treaty Surface water: outflow to other countries subm...
19008 surface_outflow_secure_treaty Surface water: outflow to other countries secu...
19584 surface_total_external_renewable Surface water: total external renewable (10^9 ...
20160 groundwater_entering Groundwater: entering the country (total) (10^...
20736 groundwater_accounted_inflow Groundwater: accounted inflow (10^9 m3/year)
21312 groundwater_to_other_countries Groundwater: leaving the country to other coun...
21888 groundwater_accounted_outflow Groundwater: accounted outflow to other countr...
22464 water_total_external_renewable Water resources: total external renewable (10^...
23040 total_renewable_surface Total renewable surface water (10^9 m3/year)
23616 total_renewable_groundwater Total renewable groundwater (10^9 m3/year)
24192 overlap_surface_groundwater Overlap: between surface water and groundwater...
24768 total_renewable Total renewable water resources (10^9 m3/year)
25344 dependency_ratio Dependency ratio (%)
25920 total_renewable_per_capita Total renewable water resources per capita (m3...
26496 exploitable_regular_renewable_surface Exploitable: regular renewable surface water (...
27072 exploitable_irregular_renewable_surface Exploitable: irregular renewable surface water...
27648 exploitable_total_renewable_surface Exploitable: total renewable surface water (10...
28224 exploitable_regular_renewable_groundwater Exploitable: regular renewable groundwater (10...
28800 exploitable_total Total exploitable water resources (10^9 m3/year)
29376 interannual_variability Interannual variability (WRI) (-)
29952 seasonal_variability Seasonal variability (WRI) (-)
30528 total_dam_capacity Total dam capacity (km3)
31104 dam_capacity_per_capita Dam capacity per capita (m3/inhab)
31680 irrigation_potential Irrigation potential (1000 ha)
32256 flood_occurence Flood occurrence (WRI) (-)
32832 total_pop_access_drinking Total population with access to safe drinking-...
33408 rural_pop_access_drinking Rural population with access to safe drinking-...
33984 urban_pop_access_drinking Urban population with access to safe drinking-...

Simplify regions


In [21]:
data.region = data.region.apply(lambda x: simple_regions[x])

Data quality assessment and profiling

Before trying to understand what information is in the data, make sure you understand what the data represents and what's missing.

Overview

Basic things to do

  • Categorical: count, count distinct, assess unique values
  • Numerical: count, min, max
  • Spot-check random samples and samples that you are familiar with
  • Slice and dice

Main questions

  • What data isn't there?
  • Is the data that is there right?
  • Is the data being generated the way you think?

Helpful packages

Example backlog

  • Assess the prevalence of missing data across all data fields, assess whether its missing is random or systematic, and identify patterns when such data is missing
  • Identify any default values that imply missing data for a given field
  • Determine sampling strategy for quality assessment and initial EDA
  • For datetime data types, ensure consistent formatting and granularity of data, and perform sanity checks on all dates present in the data.
  • In cases where multiple fields capture the same or similar information, understand the relationships between them and assess the most effective field to use
  • Assess data type of each field
  • For discrete value types, ensure data formats are consistent
  • For discrete value types, assess number of distinct values and percent unique and do sanity check on types of answers
  • For continuous data types, assess descriptive statistics and perform sanity check on values
  • Understand relationships between timestamps and assess which to use in analysis
  • Slice data by device type, operating system, software version and ensure consistency in data across slices
  • For device or app data, identify version release dates and assess data for any changes in format or value around those dates

Missing data

What data isn’t there?

Questions to be considering

  • Are there systematic reasons for missing data?
  • Are there fields that are always missing at the same time ?
  • Is there information in what data is missing?

Package that provides a number of functions for visualizing what data is missing within a dataset: missingno

By variable


In [22]:
recent = time_slice(data, '2013-2017')

In [23]:
msno.bar(recent, labels=True)


Discussion: What questions does this figure bring up?

Add these to your list of questions!


In [24]:
msno.matrix(recent, labels=True)


Discussion: What additional information does this provide or what additional questions does it suggest?

Deep dive: exploitable variables

"Exploitable" variables are missing for most countries.

Question to consider: Does this happen in each time period?


In [25]:
msno.matrix(variable_slice(data, 'exploitable_total'), inline=False, sort='descending');
plt.xlabel('Time period');
plt.ylabel('Country');
plt.title('Missing total exploitable water resources data across countries and time periods \n \n \n \n');


Total exploitable water resources is only reported on for a fraction of the countries and only a very small fraction of those countries have data for the most recent time period. Either a) data has not been reported yet and it will be at some point or b) most countries have stopped reporting on this factor or c) we do not have the domain knowledge to understand what's happening.

We are going to remove exploitable variables for future analysis because such few data points can cause a lot of problems.


In [26]:
data = data.loc[~data.variable.str.contains('exploitable'),:]

Deep dive: National rainfall index


In [27]:
msno.matrix(variable_slice(data, 'national_rainfall_index'), 
            inline=False, sort='descending');
plt.xlabel('Time period');
plt.ylabel('Country');
plt.title('Missing national rainfall index data across countries and time periods \n \n \n \n');


National rainfall index is no longer reported on after 2002.


In [28]:
data = data.loc[~(data.variable=='national_rainfall_index')]

By country

Let's look at North America only.


In [29]:
north_america = subregion(data, 'North America')

In [30]:
msno.bar(msno.nullity_sort(time_slice(north_america, '2013-2017'), sort='descending').T, inline=False)
plt.title('Fraction of fields complete by country for North America \n \n');


Question: Is there any pattern in the countries with most missing data?

Question: What are potential reasons for missing data? What can we check?


In [31]:
folium.Map(location=[18.1160128,-77.8364762], tiles="CartoDB positron", 
           zoom_start=5, width=1200, height=600)


Out[31]:

Spot check what data is missing for the Bahamas to get more granular understanding.


In [32]:
msno.nullity_filter(country_slice(data, 'Bahamas').T, filter='bottom', p=0.1)


Out[32]:
Bahamas dam_capacity_per_capita flood_occurence gender_inequal_index groundwater_produced interannual_variability irrigation_potential number_undernourished overlap_surface_groundwater percent_undernourished seasonal_variability surface_groundwater_overlap surface_water_produced total_dam_capacity total_renewable_groundwater total_renewable_surface
time_period
1958-1962 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1963-1967 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1968-1972 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1973-1977 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1978-1982 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1983-1987 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1988-1992 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1993-1997 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1998-2002 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2003-2007 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2008-2012 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-2017 NaN NaN 0.2979 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

To do: Choose another region to assess for missing data.

By country for a single variable


In [33]:
# JSON with coordinates for country boundaries 
geo = r'../../data/aquastat/world.json'

null_data = recent['agg_to_gdp'].notnull()*1
map = folium.Map(location=[48, -102], zoom_start=2)
map.choropleth(geo_path=geo, 
               data=null_data,
               columns=['country', 'agg_to_gdp'],
               key_on='feature.properties.name', reset=True,
               fill_color='GnBu', fill_opacity=1, line_opacity=0.2,
               legend_name='Missing agricultural contribution to GDP data 2013-2017')
map


Out[33]:

Question: What does the pale pale green mean? Compared to the green? (E.g. Greenland versus Canada)

Now let's functionalize so we can look at other variables geospatially.


In [34]:
def plot_null_map(df, time_period, variable, 
                  legend_name=None):
    geo = r'../../data/aquastat/world.json'
    
    
    ts = time_slice(df, time_period).reset_index().copy()
    ts[variable]=ts[variable].notnull()*1
    map = folium.Map(location=[48, -102], zoom_start=2)
    map.choropleth(geo_path=geo, 
                   data=ts,
                   columns=['country', variable],
                   key_on='feature.properties.name', reset=True,
                   fill_color='GnBu', fill_opacity=1, line_opacity=0.2,
                   legend_name=legend_name if legend_name else variable)
    return map

In [35]:
plot_null_map(data, '2013-2017', 'number_undernourished', 'Number undernourished is missing')


Out[35]:

Question: Are there any patterns in missing data? Any questions that come to mind for further investigation?

To do: Look at other variables

Over time


In [36]:
fig, ax = plt.subplots(figsize=(16, 16));
sns.heatmap(data.groupby(['time_period','variable']).value.count().unstack().T , ax=ax);
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Variable');
plt.title('Number of countries with data reported for each variable over time');


Profiling

Before trying to understand what information is in the data, make sure you understand what the data represents.

Sanity check! Do the values make sense?

Things to do:

  • Categorical: count, count distinct, assess unique values
  • Numerical: count, min, max
  • Spot-check random samples
  • Slice and dice

Questions to consider:

  • Are there frequent values that are default values?
  • Are there fields that represent the same information?
  • What timestamp should you use?
  • Are there numerical values reported as strings?
  • Are there special values?
  • Are there variables that are numerical but really should be categorical?
  • Is data consistent across different operating systems, device type, platforms, countries?
  • Are there any direct relationships between fields (e.g. a value of x always implies a specific value of y)?
  • What are the units of measurement? Are they consistent?
  • Is data consistent across the population and time?
  • Are there obvious changes in reported data around the time of important events that affect data generation (e.g. version release)?

This stage really morphs into the univariate exploration that comes next as you are often diving into each variable one by one and first understanding it, exploring it, then checking that understanding again. We can however do some initial profiling with a few handy python packages.

pivottablejs


In [37]:
pivottablejs.pivot_ui(time_slice(data, '2013-2017'),)


Out[37]:

pandas_profiling


In [38]:
pandas_profiling.ProfileReport(time_slice(data, '2013-2017'))


Out[38]:

Overview

Dataset info

Number of variables 55
Number of observations 199
Total Missing (%) 6.4%
Total size in memory 85.6 KiB
Average record size in memory 440.4 B

Variables types

Numeric 30
Categorical 0
Date 0
Text (Unique) 1
Rejected 24

Warnings

  • accounted_flow has 71 / 35.7% zeros
  • accounted_flow has 7 / 3.5% missing values Missing
  • accounted_flow_border_rivers has 150 / 75.4% zeros
  • accounted_flow_border_rivers has 7 / 3.5% missing values Missing
  • agg_to_gdp has 32 / 16.1% missing values Missing
  • arable_land has 3 / 1.5% zeros
  • arable_land has 3 / 1.5% missing values Missing
  • avg_annual_rain_depth has 18 / 9.0% missing values Missing
  • avg_annual_rain_vol has 16 / 8.0% missing values Missing
  • cultivated_area is highly correlated with arable_land (ρ = 0.99585) Rejected
  • dam_capacity_per_capita has 75 / 37.7% missing values Missing
  • dependency_ratio has 68 / 34.2% zeros
  • dependency_ratio has 7 / 3.5% missing values Missing
  • flood_occurence has 7 / 3.5% zeros
  • flood_occurence has 23 / 11.6% missing values Missing
  • gdp has 11 / 5.5% missing values Missing
  • gdp_per_capita has 11 / 5.5% missing values Missing
  • gender_inequal_index has 43 / 21.6% missing values Missing
  • groundwater_accounted_inflow has 178 / 89.4% zeros
  • groundwater_accounted_inflow has 7 / 3.5% missing values Missing
  • groundwater_accounted_outflow has 140 / 70.4% zeros
  • groundwater_accounted_outflow has 45 / 22.6% missing values Missing
  • groundwater_entering has 179 / 89.9% zeros
  • groundwater_entering has 7 / 3.5% missing values Missing
  • groundwater_produced has 2 / 1.0% zeros
  • groundwater_produced has 29 / 14.6% missing values Missing
  • groundwater_to_other_countries is highly correlated with groundwater_accounted_outflow (ρ = 1) Rejected
  • human_dev_index has 12 / 6.0% missing values Missing
  • interannual_variability has 33 / 16.6% missing values Missing
  • irrigation_potential has 88 / 44.2% missing values Missing
  • irwr is highly correlated with avg_annual_rain_vol (ρ = 0.96167) Rejected
  • irwr_per_capita has 18 / 9.0% missing values Missing
  • number_undernourished is highly correlated with irrigation_potential (ρ = 0.96711) Rejected
  • overlap_surface_groundwater is highly correlated with groundwater_produced (ρ = 0.9919) Rejected
  • percent_cultivated has 3 / 1.5% missing values Missing
  • percent_undernourished has 116 / 58.3% missing values Missing
  • permanent_crop_area has 6 / 3.0% zeros
  • permanent_crop_area has 3 / 1.5% missing values Missing
  • rural_pop is highly correlated with number_undernourished (ρ = 0.99226) Rejected
  • rural_pop_access_drinking has 16 / 8.0% missing values Missing
  • seasonal_variability has 33 / 16.6% missing values Missing
  • surface_entering is highly correlated with accounted_flow (ρ = 0.98177) Rejected
  • surface_groundwater_overlap is highly correlated with overlap_surface_groundwater (ρ = 1) Rejected
  • surface_inflow_secure_treaty has 178 / 89.4% zeros
  • surface_inflow_secure_treaty has 7 / 3.5% missing values Missing
  • surface_inflow_submit_no_treaty is highly correlated with surface_entering (ρ = 0.99629) Rejected
  • surface_inflow_submit_treaty is highly correlated with surface_inflow_secure_treaty (ρ = 0.979) Rejected
  • surface_outflow_secure_treaty has 179 / 89.9% zeros
  • surface_outflow_secure_treaty has 5 / 2.5% missing values Missing
  • surface_outflow_submit_no_treaty has 78 / 39.2% zeros
  • surface_outflow_submit_no_treaty has 16 / 8.0% missing values Missing
  • surface_outflow_submit_treaty is highly correlated with surface_outflow_secure_treaty (ρ = 0.97841) Rejected
  • surface_to_other_countries is highly correlated with surface_outflow_submit_no_treaty (ρ = 0.99643) Rejected
  • surface_total_external_renewable is highly correlated with surface_inflow_submit_no_treaty (ρ = 0.97923) Rejected
  • surface_water_produced is highly correlated with irwr (ρ = 0.99953) Rejected
  • total_area has 2 / 1.0% missing values Missing
  • total_dam_capacity is highly correlated with number_undernourished (ρ = 0.90598) Rejected
  • total_flow_border_rivers is highly correlated with accounted_flow_border_rivers (ρ = 0.96631) Rejected
  • total_pop is highly correlated with rural_pop (ρ = 0.96084) Rejected
  • total_pop_access_drinking is highly correlated with rural_pop_access_drinking (ρ = 0.94921) Rejected
  • total_renewable is highly correlated with surface_water_produced (ρ = 0.97515) Rejected
  • total_renewable_groundwater is highly correlated with surface_groundwater_overlap (ρ = 0.99191) Rejected
  • total_renewable_per_capita is highly correlated with irwr_per_capita (ρ = 0.97641) Rejected
  • total_renewable_surface is highly correlated with total_renewable (ρ = 0.99966) Rejected
  • urban_pop is highly correlated with total_pop (ρ = 0.95137) Rejected
  • urban_pop_access_drinking has 10 / 5.0% missing values Missing
  • water_total_external_renewable is highly correlated with surface_total_external_renewable (ρ = 1) Rejected

Variables

accounted_flow
Numeric

Distinct count 113
Unique (%) 58.9%
Missing (%) 3.5%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 63.557
Minimum 0
Maximum 2986
Zeros (%) 35.7%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 3
Q3 26.065
95-th percentile 270.63
Maximum 2986
Range 2986
Interquartile range 26.065

Descriptive statistics

Standard deviation 250.32
Coef of variation 3.9386
Kurtosis 99.577
Mean 63.557
MAD 93.89
Skewness 9.0735
Sum 12203
Variance 62662
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 71 35.7%
 
3.0 3 1.5%
 
2.0 3 1.5%
 
11.0 3 1.5%
 
80.0 2 1.0%
 
10.15 2 1.0%
 
0.3 2 1.0%
 
1.0 2 1.0%
 
53.32 1 0.5%
 
524.7 1 0.5%
 
Other values (102) 102 51.3%
 
(Missing) 7 3.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 71 35.7%
 
0.015 1 0.5%
 
0.038 1 0.5%
 
0.096 1 0.5%
 
0.165 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
584.2 1 0.5%
 
610.0 1 0.5%
 
635.2 1 0.5%
 
1122.0 1 0.5%
 
2986.0 1 0.5%
 

accounted_flow_border_rivers
Numeric

Distinct count 41
Unique (%) 21.4%
Missing (%) 3.5%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 8.3438
Minimum 0
Maximum 558
Zeros (%) 75.4%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 29.198
Maximum 558
Range 558
Interquartile range 0

Descriptive statistics

Standard deviation 46.811
Coef of variation 5.6103
Kurtosis 103.31
Mean 8.3438
MAD 14.415
Skewness 9.4653
Sum 1602
Variance 2191.3
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 150 75.4%
 
10.15 2 1.0%
 
1.45 2 1.0%
 
11.0 2 1.0%
 
3.815 1 0.5%
 
7.74 1 0.5%
 
34.33 1 0.5%
 
25.0 1 0.5%
 
75.0 1 0.5%
 
1.25 1 0.5%
 
Other values (30) 30 15.1%
 
(Missing) 7 3.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 150 75.4%
 
0.035 1 0.5%
 
0.038 1 0.5%
 
0.14 1 0.5%
 
0.432 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
84.05 1 0.5%
 
110.0 1 0.5%
 
197.5 1 0.5%
 
214.1 1 0.5%
 
558.0 1 0.5%
 

agg_to_gdp
Numeric

Distinct count 165
Unique (%) 98.8%
Missing (%) 16.1%
Missing (n) 32
Infinite (%) 0.0%
Infinite (n) 0
Mean 12.545
Minimum 0.0349
Maximum 59.23
Zeros (%) 0.0%

Quantile statistics

Minimum 0.0349
5-th percentile 0.62919
Q1 2.8885
Median 8.498
Q3 19.465
95-th percentile 36.168
Maximum 59.23
Range 59.195
Interquartile range 16.576

Descriptive statistics

Standard deviation 11.997
Coef of variation 0.95628
Kurtosis 1.5798
Mean 12.545
MAD 9.5593
Skewness 1.3303
Sum 2095
Variance 143.92
Memory size 1.6 KiB
Value Count Frequency (%)  
40.97 2 1.0%
 
2.384 2 1.0%
 
32.94 2 1.0%
 
1.652 1 0.5%
 
10.12 1 0.5%
 
4.725 1 0.5%
 
23.66 1 0.5%
 
24.09 1 0.5%
 
6.574 1 0.5%
 
17.39 1 0.5%
 
Other values (154) 154 77.4%
 
(Missing) 32 16.1%
 

Minimum 5 values

Value Count Frequency (%)  
0.0349 1 0.5%
 
0.1363 1 0.5%
 
0.1815 1 0.5%
 
0.3004 1 0.5%
 
0.4083 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
42.89 1 0.5%
 
43.92 1 0.5%
 
47.52 1 0.5%
 
52.39 1 0.5%
 
59.23 1 0.5%
 

arable_land
Numeric

Distinct count 173
Unique (%) 88.3%
Missing (%) 1.5%
Missing (n) 3
Infinite (%) 0.0%
Infinite (n) 0
Mean 7229.3
Minimum 0
Maximum 156360
Zeros (%) 1.5%

Quantile statistics

Minimum 0
5-th percentile 1.9
Q1 120
Median 1204.5
Q3 4671.5
95-th percentile 30963
Maximum 156360
Range 156360
Interquartile range 4551.5

Descriptive statistics

Standard deviation 21029
Coef of variation 2.9089
Kurtosis 31.537
Mean 7229.3
MAD 9460.8
Skewness 5.3616
Sum 1416900
Variance 442230000
Memory size 1.6 KiB
Value Count Frequency (%)  
2.0 4 2.0%
 
1.0 4 2.0%
 
3.0 3 1.5%
 
3800.0 3 1.5%
 
0.0 3 1.5%
 
5.0 3 1.5%
 
2350.0 2 1.0%
 
120.0 2 1.0%
 
800.0 2 1.0%
 
300.0 2 1.0%
 
Other values (162) 168 84.4%
 
(Missing) 3 1.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 3 1.5%
 
0.08 1 0.5%
 
0.56 1 0.5%
 
1.0 4 2.0%
 
1.6 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
80017.0 1 0.5%
 
106298.0 1 0.5%
 
123122.0 1 0.5%
 
154605.0 1 0.5%
 
156360.0 1 0.5%
 

avg_annual_rain_depth
Numeric

Distinct count 174
Unique (%) 96.1%
Missing (%) 9.0%
Missing (n) 18
Infinite (%) 0.0%
Infinite (n) 0
Mean 1166
Minimum 51
Maximum 3240
Zeros (%) 0.0%

Quantile statistics

Minimum 51
5-th percentile 121
Q1 562
Median 1030
Q3 1705
95-th percentile 2702
Maximum 3240
Range 3189
Interquartile range 1143

Descriptive statistics

Standard deviation 800.11
Coef of variation 0.68622
Kurtosis -0.40545
Mean 1166
MAD 662.11
Skewness 0.65639
Sum 211040
Variance 640180
Memory size 1.6 KiB
Value Count Frequency (%)  
250.0 2 1.0%
 
788.0 2 1.0%
 
900.0 2 1.0%
 
1500.0 2 1.0%
 
1274.0 2 1.0%
 
282.0 2 1.0%
 
2200.0 2 1.0%
 
228.0 2 1.0%
 
657.0 1 0.5%
 
241.0 1 0.5%
 
Other values (163) 163 81.9%
 
(Missing) 18 9.0%
 

Minimum 5 values

Value Count Frequency (%)  
51.0 1 0.5%
 
56.0 1 0.5%
 
59.0 1 0.5%
 
74.0 1 0.5%
 
78.0 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
2928.0 1 0.5%
 
3028.0 1 0.5%
 
3142.0 1 0.5%
 
3200.0 1 0.5%
 
3240.0 1 0.5%
 

avg_annual_rain_vol
Numeric

Distinct count 181
Unique (%) 98.9%
Missing (%) 8.0%
Missing (n) 16
Infinite (%) 0.0%
Infinite (n) 0
Mean 595.32
Minimum 0.064
Maximum 14995
Zeros (%) 0.0%

Quantile statistics

Minimum 0.064
5-th percentile 0.86507
Q1 33.095
Median 127.8
Q3 434.5
95-th percentile 3427.4
Maximum 14995
Range 14995
Interquartile range 401.4

Descriptive statistics

Standard deviation 1590.4
Coef of variation 2.6715
Kurtosis 41.211
Mean 595.32
MAD 734.91
Skewness 5.7112
Sum 108940
Variance 2529400
Memory size 1.6 KiB
Value Count Frequency (%)  
1259.0 2 1.0%
 
220.8 2 1.0%
 
297.2 2 1.0%
 
279.2 1 0.5%
 
3618.0 1 0.5%
 
199.8 1 0.5%
 
25.86 1 0.5%
 
1415.0 1 0.5%
 
7030.0 1 0.5%
 
513.1 1 0.5%
 
Other values (170) 170 85.4%
 
(Missing) 16 8.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.064 1 0.5%
 
0.1792 1 0.5%
 
0.371 1 0.5%
 
0.4532 1 0.5%
 
0.4724 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
5362.0 1 0.5%
 
6192.0 1 0.5%
 
7030.0 1 0.5%
 
7865.0 1 0.5%
 
14995.0 1 0.5%
 

country
Categorical, Unique

First 3 values
Haiti
Djibouti
Antigua and Barbuda
Last 3 values
Croatia
Nepal
Burkina Faso

First 10 values

Value Count Frequency (%)  
Afghanistan 1 0.5%
 
Albania 1 0.5%
 
Algeria 1 0.5%
 
Andorra 1 0.5%
 
Angola 1 0.5%
 

Last 10 values

Value Count Frequency (%)  
Venezuela (Bolivarian Republic of) 1 0.5%
 
Viet Nam 1 0.5%
 
Yemen 1 0.5%
 
Zambia 1 0.5%
 
Zimbabwe 1 0.5%
 

cultivated_area
Highly correlated

This variable is highly correlated with arable_land and should be ignored for analysis

Correlation 0.99585

dam_capacity_per_capita
Numeric

Distinct count 124
Unique (%) 100.0%
Missing (%) 37.7%
Missing (n) 75
Infinite (%) 0.0%
Infinite (n) 0
Mean 1580.3
Minimum 0.1873
Maximum 36832
Zeros (%) 0.0%

Quantile statistics

Minimum 0.1873
5-th percentile 3.2488
Q1 71.072
Median 328.5
Q3 1205.5
95-th percentile 5561.6
Maximum 36832
Range 36832
Interquartile range 1134.4

Descriptive statistics

Standard deviation 4130.5
Coef of variation 2.6138
Kurtosis 48.996
Mean 1580.3
MAD 1934
Skewness 6.4315
Sum 195950
Variance 17061000
Memory size 1.6 KiB
Value Count Frequency (%)  
633.1 2 1.0%
 
65.35 1 0.5%
 
3370.0 1 0.5%
 
61.76 1 0.5%
 
55.49 1 0.5%
 
1321.0 1 0.5%
 
4.704 1 0.5%
 
1055.0 1 0.5%
 
38.97 1 0.5%
 
2050.0 1 0.5%
 
Other values (113) 113 56.8%
 
(Missing) 75 37.7%
 

Minimum 5 values

Value Count Frequency (%)  
0.1873 1 0.5%
 
0.6846 1 0.5%
 
1.948 1 0.5%
 
1.969 1 0.5%
 
2.16 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
6386.0 1 0.5%
 
6405.0 1 0.5%
 
7001.0 1 0.5%
 
23414.0 1 0.5%
 
36832.0 1 0.5%
 

dependency_ratio
Numeric

Distinct count 125
Unique (%) 65.1%
Missing (%) 3.5%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 22.819
Minimum 0
Maximum 100
Zeros (%) 34.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 6.183
Q3 40.8
95-th percentile 87.299
Maximum 100
Range 100
Interquartile range 40.8

Descriptive statistics

Standard deviation 29.869
Coef of variation 1.309
Kurtosis 0.010321
Mean 22.819
MAD 25.123
Skewness 1.1509
Sum 4381.2
Variance 892.16
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 68 34.2%
 
30.52 2 1.0%
 
80.39 1 0.5%
 
4.123 1 0.5%
 
14.63 1 0.5%
 
7.407 1 0.5%
 
24.49 1 0.5%
 
21.77 1 0.5%
 
5.769 1 0.5%
 
64.27 1 0.5%
 
Other values (114) 114 57.3%
 
(Missing) 7 3.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 68 34.2%
 
0.2691 1 0.5%
 
0.2695 1 0.5%
 
0.7496 1 0.5%
 
0.7854 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
96.49 1 0.5%
 
96.55 1 0.5%
 
96.91 1 0.5%
 
97.0 1 0.5%
 
100.0 1 0.5%
 

flood_occurence
Numeric

Distinct count 41
Unique (%) 23.3%
Missing (%) 11.6%
Missing (n) 23
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.6955
Minimum 0
Maximum 4.9
Zeros (%) 3.5%

Quantile statistics

Minimum 0
5-th percentile 0.35
Q1 2.3
Median 2.9
Q3 3.325
95-th percentile 3.825
Maximum 4.9
Range 4.9
Interquartile range 1.025

Descriptive statistics

Standard deviation 0.99495
Coef of variation 0.36912
Kurtosis 0.96957
Mean 2.6955
MAD 0.75212
Skewness -1.0132
Sum 474.4
Variance 0.98992
Memory size 1.6 KiB
Value Count Frequency (%)  
3.6 12 6.0%
 
3.0 11 5.5%
 
3.3 11 5.5%
 
3.1 11 5.5%
 
2.5 10 5.0%
 
2.9 8 4.0%
 
3.5 8 4.0%
 
2.8 8 4.0%
 
2.7 7 3.5%
 
0.0 7 3.5%
 
Other values (30) 83 41.7%
 
(Missing) 23 11.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 7 3.5%
 
0.1 1 0.5%
 
0.2 1 0.5%
 
0.4 1 0.5%
 
0.6 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
3.9 5 2.5%
 
4.0 1 0.5%
 
4.5 1 0.5%
 
4.7 1 0.5%
 
4.9 1 0.5%
 

gdp
Numeric

Distinct count 185
Unique (%) 98.4%
Missing (%) 5.5%
Missing (n) 11
Infinite (%) 0.0%
Infinite (n) 0
Mean 384890000000
Minimum 37860000
Maximum 17900000000000
Zeros (%) 0.0%

Quantile statistics

Minimum 37860000
5-th percentile 754760000
Q1 7675800000
Median 30475000000
Q3 192500000000
95-th percentile 1490500000000
Maximum 17900000000000
Range 17900000000000
Interquartile range 184820000000

Descriptive statistics

Standard deviation 1604200000000
Coef of variation 4.1681
Kurtosis 85.899
Mean 384890000000
MAD 549640000000
Skewness 8.7022
Sum 72359000000000
Variance 2.5736e+24
Memory size 1.6 KiB
Value Count Frequency (%)  
195000000000.0 2 1.0%
 
167000000000.0 2 1.0%
 
296000000000.0 2 1.0%
 
292000000000.0 2 1.0%
 
2.85e+12 1 0.5%
 
13779570706.0 1 0.5%
 
35237742278.0 1 0.5%
 
199000000000.0 1 0.5%
 
52132289700.0 1 0.5%
 
11099473097.0 1 0.5%
 
Other values (174) 174 87.4%
 
(Missing) 11 5.5%
 

Minimum 5 values

Value Count Frequency (%)  
37859554.0 1 0.5%
 
145237022.0 1 0.5%
 
186716626.0 1 0.5%
 
287400000.0 1 0.5%
 
318071979.0 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
2.85e+12 1 0.5%
 
3.36e+12 1 0.5%
 
4.12e+12 1 0.5%
 
1.09e+13 1 0.5%
 
1.79e+13 1 0.5%
 

gdp_per_capita
Numeric

Distinct count 188
Unique (%) 100.0%
Missing (%) 5.5%
Missing (n) 11
Infinite (%) 0.0%
Infinite (n) 0
Mean 12531
Minimum 276
Maximum 101910
Zeros (%) 0.0%

Quantile statistics

Minimum 276
5-th percentile 537.11
Q1 1732.2
Median 4911.5
Q3 14474
95-th percentile 50644
Maximum 101910
Range 101640
Interquartile range 12742

Descriptive statistics

Standard deviation 17543
Coef of variation 1.4
Kurtosis 5.4979
Mean 12531
MAD 12391
Skewness 2.2522
Sum 2355700
Variance 307760000
Memory size 1.6 KiB
Value Count Frequency (%)  
3974.0 2 1.0%
 
26018.0 1 0.5%
 
1773.0 1 0.5%
 
3822.0 1 0.5%
 
6862.0 1 0.5%
 
1429.0 1 0.5%
 
23030.0 1 0.5%
 
17282.0 1 0.5%
 
1579.0 1 0.5%
 
3491.0 1 0.5%
 
Other values (177) 177 88.9%
 
(Missing) 11 5.5%
 

Minimum 5 values

Value Count Frequency (%)  
276.0 1 0.5%
 
306.8 1 0.5%
 
359.0 1 0.5%
 
381.4 1 0.5%
 
411.8 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
55906.0 1 0.5%
 
74458.0 1 0.5%
 
74720.0 1 0.5%
 
80130.0 1 0.5%
 
101911.0 1 0.5%
 

gender_inequal_index
Numeric

Distinct count 155
Unique (%) 99.4%
Missing (%) 21.6%
Missing (n) 43
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.36695
Minimum 0.0164
Maximum 0.744
Zeros (%) 0.0%

Quantile statistics

Minimum 0.0164
5-th percentile 0.066375
Q1 0.18738
Median 0.38605
Q3 0.52572
95-th percentile 0.65758
Maximum 0.744
Range 0.7276
Interquartile range 0.33835

Descriptive statistics

Standard deviation 0.19133
Coef of variation 0.52141
Kurtosis -1.1218
Mean 0.36695
MAD 0.16375
Skewness -0.10515
Sum 57.244
Variance 0.036608
Memory size 1.6 KiB
Value Count Frequency (%)  
0.1247 2 1.0%
 
0.1636 2 1.0%
 
0.1507 1 0.5%
 
0.3573 1 0.5%
 
0.4485 1 0.5%
 
0.5329 1 0.5%
 
0.6224 1 0.5%
 
0.0884 1 0.5%
 
0.4796 1 0.5%
 
0.4134 1 0.5%
 
Other values (144) 144 72.4%
 
(Missing) 43 21.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.0164 1 0.5%
 
0.0278 1 0.5%
 
0.0407 1 0.5%
 
0.0484 1 0.5%
 
0.0528 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
0.6789 1 0.5%
 
0.6934 1 0.5%
 
0.7065 1 0.5%
 
0.7132 1 0.5%
 
0.744 1 0.5%
 

groundwater_accounted_inflow
Numeric

Distinct count 15
Unique (%) 7.8%
Missing (%) 3.5%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.012557
Minimum -1.2
Maximum 1.33
Zeros (%) 89.4%

Quantile statistics

Minimum -1.2
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0.0245
Maximum 1.33
Range 2.53
Interquartile range 0

Descriptive statistics

Standard deviation 0.1577
Coef of variation 12.559
Kurtosis 52.707
Mean 0.012557
MAD 0.036051
Skewness 2.4695
Sum 2.411
Variance 0.02487
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 178 89.4%
 
0.08 2 1.0%
 
0.01 1 0.5%
 
0.032 1 0.5%
 
-1.2 1 0.5%
 
0.725 1 0.5%
 
0.002 1 0.5%
 
1.33 1 0.5%
 
0.03 1 0.5%
 
0.02 1 0.5%
 
Other values (4) 4 2.0%
 
(Missing) 7 3.5%
 

Minimum 5 values

Value Count Frequency (%)  
-1.2 1 0.5%
 
0.0 178 89.4%
 
0.002 1 0.5%
 
0.01 1 0.5%
 
0.02 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
0.1 1 0.5%
 
0.112 1 0.5%
 
0.725 1 0.5%
 
1.0 1 0.5%
 
1.33 1 0.5%
 

groundwater_accounted_outflow
Numeric

Distinct count 15
Unique (%) 9.7%
Missing (%) 22.6%
Missing (n) 45
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.27273
Minimum 0
Maximum 26.12
Zeros (%) 70.4%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0.301
Maximum 26.12
Range 26.12
Interquartile range 0

Descriptive statistics

Standard deviation 2.2802
Coef of variation 8.3604
Kurtosis 112.51
Mean 0.27273
MAD 0.51012
Skewness 10.334
Sum 42.001
Variance 5.1991
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 140 70.4%
 
0.95 2 1.0%
 
0.032 1 0.5%
 
0.03 1 0.5%
 
0.34 1 0.5%
 
0.025 1 0.5%
 
0.1 1 0.5%
 
0.7 1 0.5%
 
26.12 1 0.5%
 
0.394 1 0.5%
 
Other values (4) 4 2.0%
 
(Missing) 45 22.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 140 70.4%
 
0.025 1 0.5%
 
0.03 1 0.5%
 
0.032 1 0.5%
 
0.08 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
0.7 1 0.5%
 
0.95 2 1.0%
 
1.0 1 0.5%
 
11.0 1 0.5%
 
26.12 1 0.5%
 

groundwater_entering
Numeric

Distinct count 14
Unique (%) 7.3%
Missing (%) 3.5%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.070786
Minimum 0
Maximum 11.13
Zeros (%) 89.9%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0.0245
Maximum 11.13
Range 11.13
Interquartile range 0

Descriptive statistics

Standard deviation 0.80753
Coef of variation 11.408
Kurtosis 187.02
Mean 0.070786
MAD 0.13469
Skewness 13.6
Sum 13.591
Variance 0.6521
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 179 89.9%
 
0.08 2 1.0%
 
0.01 1 0.5%
 
0.032 1 0.5%
 
0.725 1 0.5%
 
0.002 1 0.5%
 
0.27 1 0.5%
 
0.03 1 0.5%
 
0.02 1 0.5%
 
0.112 1 0.5%
 
Other values (3) 3 1.5%
 
(Missing) 7 3.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 179 89.9%
 
0.002 1 0.5%
 
0.01 1 0.5%
 
0.02 1 0.5%
 
0.03 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
0.112 1 0.5%
 
0.27 1 0.5%
 
0.725 1 0.5%
 
1.0 1 0.5%
 
11.13 1 0.5%
 

groundwater_produced
Numeric

Distinct count 150
Unique (%) 88.2%
Missing (%) 14.6%
Missing (n) 29
Infinite (%) 0.0%
Infinite (n) 0
Mean 62.768
Minimum 0
Maximum 1383
Zeros (%) 1.0%

Quantile statistics

Minimum 0
5-th percentile 0.0767
Q1 2.2
Median 9.65
Q3 40.98
95-th percentile 398.05
Maximum 1383
Range 1383
Interquartile range 38.78

Descriptive statistics

Standard deviation 165.13
Coef of variation 2.6308
Kurtosis 29.206
Mean 62.768
MAD 82.406
Skewness 4.8804
Sum 10670
Variance 27268
Memory size 1.6 KiB
Value Count Frequency (%)  
6.0 4 2.0%
 
20.0 4 2.0%
 
0.5 4 2.0%
 
2.5 3 1.5%
 
1.3 3 1.5%
 
4.0 3 1.5%
 
3.2 2 1.0%
 
55.0 2 1.0%
 
2.2 2 1.0%
 
10.0 2 1.0%
 
Other values (139) 141 70.9%
 
(Missing) 29 14.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 2 1.0%
 
0.01 1 0.5%
 
0.015 1 0.5%
 
0.02 1 0.5%
 
0.03 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
510.0 1 0.5%
 
645.6 1 0.5%
 
788.0 1 0.5%
 
828.8 1 0.5%
 
1383.0 1 0.5%
 

groundwater_to_other_countries
Highly correlated

This variable is highly correlated with groundwater_accounted_outflow and should be ignored for analysis

Correlation 1

human_dev_index
Numeric

Distinct count 186
Unique (%) 99.5%
Missing (%) 6.0%
Missing (n) 12
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.69128
Minimum 0.3483
Maximum 0.9439
Zeros (%) 0.0%

Quantile statistics

Minimum 0.3483
5-th percentile 0.41939
Q1 0.57255
Median 0.7238
Q3 0.80925
95-th percentile 0.91278
Maximum 0.9439
Range 0.5956
Interquartile range 0.2367

Descriptive statistics

Standard deviation 0.15429
Coef of variation 0.2232
Kurtosis -0.91053
Mean 0.69128
MAD 0.12987
Skewness -0.35983
Sum 129.27
Variance 0.023807
Memory size 1.6 KiB
Value Count Frequency (%)  
0.715 2 1.0%
 
0.7928 2 1.0%
 
0.514 1 0.5%
 
0.6087 1 0.5%
 
0.6276 1 0.5%
 
0.575 1 0.5%
 
0.8701 1 0.5%
 
0.8175 1 0.5%
 
0.6357 1 0.5%
 
0.9155 1 0.5%
 
Other values (175) 175 87.9%
 
(Missing) 12 6.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.3483 1 0.5%
 
0.3501 1 0.5%
 
0.3909 1 0.5%
 
0.3919 1 0.5%
 
0.3999 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
0.9218 1 0.5%
 
0.9233 1 0.5%
 
0.9296 1 0.5%
 
0.935 1 0.5%
 
0.9439 1 0.5%
 

interannual_variability
Numeric

Distinct count 33
Unique (%) 19.9%
Missing (%) 16.6%
Missing (n) 33
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.7584
Minimum 0.6
Maximum 4.9
Zeros (%) 0.0%

Quantile statistics

Minimum 0.6
5-th percentile 0.8
Q1 1.1
Median 1.5
Q3 2.3
95-th percentile 3.5
Maximum 4.9
Range 4.3
Interquartile range 1.2

Descriptive statistics

Standard deviation 0.88408
Coef of variation 0.50276
Kurtosis 0.88868
Mean 1.7584
MAD 0.71277
Skewness 1.1336
Sum 291.9
Variance 0.7816
Memory size 1.6 KiB
Value Count Frequency (%)  
1.0 15 7.5%
 
1.2 12 6.0%
 
0.9 11 5.5%
 
1.4 11 5.5%
 
1.5 10 5.0%
 
1.1 10 5.0%
 
1.3 9 4.5%
 
0.8 8 4.0%
 
2.7 7 3.5%
 
2.3 6 3.0%
 
Other values (22) 67 33.7%
 
(Missing) 33 16.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.6 3 1.5%
 
0.7 2 1.0%
 
0.8 8 4.0%
 
0.9 11 5.5%
 
1.0 15 7.5%
 

Maximum 5 values

Value Count Frequency (%)  
3.6 2 1.0%
 
3.8 1 0.5%
 
4.2 2 1.0%
 
4.3 2 1.0%
 
4.9 1 0.5%
 

irrigation_potential
Numeric

Distinct count 105
Unique (%) 94.6%
Missing (%) 44.2%
Missing (n) 88
Infinite (%) 0.0%
Infinite (n) 0
Mean 4638.7
Minimum 0.2
Maximum 139500
Zeros (%) 0.0%

Quantile statistics

Minimum 0.2
5-th percentile 7.465
Q1 183.5
Median 566
Q3 3099
95-th percentile 15500
Maximum 139500
Range 139500
Interquartile range 2915.5

Descriptive statistics

Standard deviation 15300
Coef of variation 3.2984
Kurtosis 58.212
Mean 4638.7
MAD 6008.1
Skewness 7.1447
Sum 514900
Variance 234100000
Memory size 1.6 KiB
Value Count Frequency (%)  
2700.0 2 1.0%
 
5500.0 2 1.0%
 
165.0 2 1.0%
 
600.0 2 1.0%
 
1900.0 2 1.0%
 
30.0 2 1.0%
 
200.0 2 1.0%
 
70000.0 1 0.5%
 
40.0 1 0.5%
 
0.894 1 0.5%
 
Other values (94) 94 47.2%
 
(Missing) 88 44.2%
 

Minimum 5 values

Value Count Frequency (%)  
0.2 1 0.5%
 
0.3 1 0.5%
 
0.894 1 0.5%
 
1.0 1 0.5%
 
2.4 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
21300.0 1 0.5%
 
29000.0 1 0.5%
 
29350.0 1 0.5%
 
70000.0 1 0.5%
 
139500.0 1 0.5%
 

irwr
Highly correlated

This variable is highly correlated with avg_annual_rain_vol and should be ignored for analysis

Correlation 0.96167

irwr_per_capita
Numeric

Distinct count 182
Unique (%) 100.6%
Missing (%) 9.0%
Missing (n) 18
Infinite (%) 0.0%
Infinite (n) 0
Mean 16036
Minimum 0
Maximum 516090
Zeros (%) 0.5%

Quantile statistics

Minimum 0
5-th percentile 93.01
Q1 913.2
Median 2599
Q3 11227
95-th percentile 72201
Maximum 516090
Range 516090
Interquartile range 10314

Descriptive statistics

Standard deviation 49232
Coef of variation 3.07
Kurtosis 66.768
Mean 16036
MAD 20476
Skewness 7.4684
Sum 2902600
Variance 2423700000
Memory size 1.6 KiB
Value Count Frequency (%)  
822.2 1 0.5%
 
566.3 1 0.5%
 
1571.0 1 0.5%
 
11761.0 1 0.5%
 
19444.0 1 0.5%
 
2886.0 1 0.5%
 
3303.0 1 0.5%
 
1213.0 1 0.5%
 
5372.0 1 0.5%
 
3585.0 1 0.5%
 
Other values (171) 171 85.9%
 
(Missing) 18 9.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 1 0.5%
 
2.905 1 0.5%
 
16.38 1 0.5%
 
19.67 1 0.5%
 
25.06 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
100671.0 1 0.5%
 
105132.0 1 0.5%
 
182320.0 1 0.5%
 
314170.0 1 0.5%
 
516090.0 1 0.5%
 

number_undernourished
Highly correlated

This variable is highly correlated with irrigation_potential and should be ignored for analysis

Correlation 0.96711

overlap_surface_groundwater
Highly correlated

This variable is highly correlated with groundwater_produced and should be ignored for analysis

Correlation 0.9919

percent_cultivated
Numeric

Distinct count 195
Unique (%) 99.5%
Missing (%) 1.5%
Missing (n) 3
Infinite (%) 0.0%
Infinite (n) 0
Mean 18.513
Minimum 0.0862
Maximum 63.41
Zeros (%) 0.0%

Quantile statistics

Minimum 0.0862
5-th percentile 0.96727
Q1 5.9638
Median 14.68
Q3 27.88
95-th percentile 50.148
Maximum 63.41
Range 63.324
Interquartile range 21.916

Descriptive statistics

Standard deviation 15.496
Coef of variation 0.83705
Kurtosis 0.28068
Mean 18.513
MAD 12.504
Skewness 0.98211
Sum 3628.6
Variance 240.14
Memory size 1.6 KiB
Value Count Frequency (%)  
27.88 2 1.0%
 
60.0 2 1.0%
 
16.13 1 0.5%
 
3.412 1 0.5%
 
18.02 1 0.5%
 
10.91 1 0.5%
 
31.98 1 0.5%
 
6.111 1 0.5%
 
21.4 1 0.5%
 
55.7 1 0.5%
 
Other values (184) 184 92.5%
 
(Missing) 3 1.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0862 1 0.5%
 
0.2223 1 0.5%
 
0.3658 1 0.5%
 
0.4334 1 0.5%
 
0.4473 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
56.76 1 0.5%
 
57.57 1 0.5%
 
60.0 2 1.0%
 
62.3 1 0.5%
 
63.41 1 0.5%
 

percent_undernourished
Numeric

Distinct count 69
Unique (%) 83.1%
Missing (%) 58.3%
Missing (n) 116
Infinite (%) 0.0%
Infinite (n) 0
Mean 17.223
Minimum 5.1
Maximum 53.4
Zeros (%) 0.0%

Quantile statistics

Minimum 5.1
5-th percentile 5.5
Q1 7.9
Median 13.5
Q3 23.45
95-th percentile 40.88
Maximum 53.4
Range 48.3
Interquartile range 15.55

Descriptive statistics

Standard deviation 11.416
Coef of variation 0.66285
Kurtosis 0.8264
Mean 17.223
MAD 9.2504
Skewness 1.1583
Sum 1429.5
Variance 130.33
Memory size 1.6 KiB
Value Count Frequency (%)  
7.4 3 1.5%
 
14.2 3 1.5%
 
20.7 3 1.5%
 
7.5 2 1.0%
 
9.5 2 1.0%
 
16.4 2 1.0%
 
26.8 2 1.0%
 
6.2 2 1.0%
 
22.0 2 1.0%
 
15.9 2 1.0%
 
Other values (58) 60 30.2%
 
(Missing) 116 58.3%
 

Minimum 5 values

Value Count Frequency (%)  
5.1 2 1.0%
 
5.2 1 0.5%
 
5.3 1 0.5%
 
5.5 2 1.0%
 
5.6 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
41.6 1 0.5%
 
42.3 1 0.5%
 
47.7 1 0.5%
 
47.8 1 0.5%
 
53.4 1 0.5%
 

permanent_crop_area
Numeric

Distinct count 155
Unique (%) 79.1%
Missing (%) 1.5%
Missing (n) 3
Infinite (%) 0.0%
Infinite (n) 0
Mean 839.45
Minimum 0
Maximum 22500
Zeros (%) 3.0%

Quantile statistics

Minimum 0
5-th percentile 0.575
Q1 14.35
Median 112
Q3 455.5
95-th percentile 4500
Maximum 22500
Range 22500
Interquartile range 441.15

Descriptive statistics

Standard deviation 2429.3
Coef of variation 2.8939
Kurtosis 42.919
Mean 839.45
MAD 1128.2
Skewness 5.9581
Sum 164530
Variance 5901600
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 6 3.0%
 
4.0 5 2.5%
 
3.0 4 2.0%
 
6.0 3 1.5%
 
2.0 3 1.5%
 
700.0 3 1.5%
 
100.0 3 1.5%
 
5.0 3 1.5%
 
1.0 3 1.5%
 
60.0 2 1.0%
 
Other values (144) 161 80.9%
 
(Missing) 3 1.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 6 3.0%
 
0.1 2 1.0%
 
0.4 1 0.5%
 
0.5 1 0.5%
 
0.6 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
6572.0 1 0.5%
 
6600.0 1 0.5%
 
13000.0 1 0.5%
 
16226.0 1 0.5%
 
22500.0 1 0.5%
 

rural_pop
Highly correlated

This variable is highly correlated with number_undernourished and should be ignored for analysis

Correlation 0.99226

rural_pop_access_drinking
Numeric

Distinct count 118
Unique (%) 64.5%
Missing (%) 8.0%
Missing (n) 16
Infinite (%) 0.0%
Infinite (n) 0
Mean 84.026
Minimum 28.2
Maximum 100
Zeros (%) 0.0%

Quantile statistics

Minimum 28.2
5-th percentile 45.65
Q1 72.95
Median 92.1
Q3 99.35
95-th percentile 100
Maximum 100
Range 71.8
Interquartile range 26.4

Descriptive statistics

Standard deviation 18.907
Coef of variation 0.22501
Kurtosis 0.37805
Mean 84.026
MAD 15.52
Skewness -1.1836
Sum 15377
Variance 357.46
Memory size 1.6 KiB
Value Count Frequency (%)  
100.0 39 19.6%
 
99.0 6 3.0%
 
98.3 3 1.5%
 
92.1 2 1.0%
 
67.3 2 1.0%
 
97.0 2 1.0%
 
73.8 2 1.0%
 
99.7 2 1.0%
 
95.1 2 1.0%
 
69.4 2 1.0%
 
Other values (107) 121 60.8%
 
(Missing) 16 8.0%
 

Minimum 5 values

Value Count Frequency (%)  
28.2 1 0.5%
 
31.2 1 0.5%
 
31.5 1 0.5%
 
32.8 1 0.5%
 
35.3 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
99.6 1 0.5%
 
99.7 2 1.0%
 
99.8 1 0.5%
 
99.9 1 0.5%
 
100.0 39 19.6%
 

seasonal_variability
Numeric

Distinct count 43
Unique (%) 25.9%
Missing (%) 16.6%
Missing (n) 33
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.2904
Minimum 0.3
Maximum 4.6
Zeros (%) 0.0%

Quantile statistics

Minimum 0.3
5-th percentile 0.625
Q1 1.525
Median 2.3
Q3 3.1
95-th percentile 3.875
Maximum 4.6
Range 4.3
Interquartile range 1.575

Descriptive statistics

Standard deviation 1.0288
Coef of variation 0.44917
Kurtosis -0.87948
Mean 2.2904
MAD 0.87
Skewness 0.079612
Sum 380.2
Variance 1.0583
Memory size 1.6 KiB
Value Count Frequency (%)  
2.5 8 4.0%
 
3.6 8 4.0%
 
2.1 8 4.0%
 
1.6 8 4.0%
 
1.9 7 3.5%
 
3.1 7 3.5%
 
3.5 7 3.5%
 
2.4 7 3.5%
 
1.0 6 3.0%
 
1.8 6 3.0%
 
Other values (32) 94 47.2%
 
(Missing) 33 16.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.3 1 0.5%
 
0.4 2 1.0%
 
0.5 1 0.5%
 
0.6 5 2.5%
 
0.7 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
4.0 3 1.5%
 
4.1 1 0.5%
 
4.2 1 0.5%
 
4.4 1 0.5%
 
4.6 2 1.0%
 

surface_entering
Highly correlated

This variable is highly correlated with accounted_flow and should be ignored for analysis

Correlation 0.98177

surface_groundwater_overlap
Highly correlated

This variable is highly correlated with overlap_surface_groundwater and should be ignored for analysis

Correlation 1

surface_inflow_secure_treaty
Numeric

Distinct count 16
Unique (%) 8.3%
Missing (%) 3.5%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.1905
Minimum 0
Maximum 170.3
Zeros (%) 89.4%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 2.6319
Maximum 170.3
Range 170.3
Interquartile range 0

Descriptive statistics

Standard deviation 14.253
Coef of variation 6.5067
Kurtosis 105.16
Mean 2.1905
MAD 4.1016
Skewness 9.5926
Sum 420.57
Variance 203.14
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 178 89.4%
 
2.208 1 0.5%
 
3.15 1 0.5%
 
65.65 1 0.5%
 
44.11 1 0.5%
 
170.3 1 0.5%
 
0.82 1 0.5%
 
0.05 1 0.5%
 
1.85 1 0.5%
 
16.09 1 0.5%
 
Other values (5) 5 2.5%
 
(Missing) 7 3.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 178 89.4%
 
0.05 1 0.5%
 
0.82 1 0.5%
 
1.85 1 0.5%
 
2.208 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
26.5 1 0.5%
 
44.11 1 0.5%
 
55.5 1 0.5%
 
65.65 1 0.5%
 
170.3 1 0.5%
 

surface_inflow_submit_no_treaty
Highly correlated

This variable is highly correlated with surface_entering and should be ignored for analysis

Correlation 0.99629

surface_inflow_submit_treaty
Highly correlated

This variable is highly correlated with surface_inflow_secure_treaty and should be ignored for analysis

Correlation 0.979

surface_outflow_secure_treaty
Numeric

Distinct count 17
Unique (%) 8.8%
Missing (%) 2.5%
Missing (n) 5
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.22
Minimum 0
Maximum 170.3
Zeros (%) 89.9%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1.3058
Maximum 170.3
Range 170.3
Interquartile range 0

Descriptive statistics

Standard deviation 14.168
Coef of variation 6.382
Kurtosis 106.2
Mean 2.22
MAD 4.1863
Skewness 9.6002
Sum 430.69
Variance 200.74
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 179 89.9%
 
2.208 1 0.5%
 
54.86 1 0.5%
 
0.79 1 0.5%
 
0.82 1 0.5%
 
0.05 1 0.5%
 
170.3 1 0.5%
 
18.9 1 0.5%
 
25.87 1 0.5%
 
0.432 1 0.5%
 
Other values (6) 6 3.0%
 
(Missing) 5 2.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 179 89.9%
 
0.05 1 0.5%
 
0.335 1 0.5%
 
0.432 1 0.5%
 
0.79 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
26.5 1 0.5%
 
33.12 1 0.5%
 
54.86 1 0.5%
 
65.5 1 0.5%
 
170.3 1 0.5%
 

surface_outflow_submit_no_treaty
Numeric

Distinct count 105
Unique (%) 57.4%
Missing (%) 8.0%
Missing (n) 16
Infinite (%) 0.0%
Infinite (n) 0
Mean 55.924
Minimum 0
Maximum 1868
Zeros (%) 39.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1.725
Q3 18.135
95-th percentile 193.68
Maximum 1868
Range 1868
Interquartile range 18.135

Descriptive statistics

Standard deviation 208.77
Coef of variation 3.7332
Kurtosis 43.517
Mean 55.924
MAD 84.158
Skewness 6.2161
Sum 10234
Variance 43586
Memory size 1.6 KiB
Value Count Frequency (%)  
0.0 78 39.2%
 
3.0 2 1.0%
 
13.2 2 1.0%
 
48.0 1 0.5%
 
37.0 1 0.5%
 
160.0 1 0.5%
 
4.86 1 0.5%
 
9.655 1 0.5%
 
0.177 1 0.5%
 
6.145 1 0.5%
 
Other values (94) 94 47.2%
 
(Missing) 16 8.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 78 39.2%
 
0.015 1 0.5%
 
0.017 1 0.5%
 
0.057 1 0.5%
 
0.096 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
585.7 1 0.5%
 
718.8 1 0.5%
 
1142.0 1 0.5%
 
1375.0 1 0.5%
 
1868.0 1 0.5%
 

surface_outflow_submit_treaty
Highly correlated

This variable is highly correlated with surface_outflow_secure_treaty and should be ignored for analysis

Correlation 0.97841

surface_to_other_countries
Highly correlated

This variable is highly correlated with surface_outflow_submit_no_treaty and should be ignored for analysis

Correlation 0.99643

surface_total_external_renewable
Highly correlated

This variable is highly correlated with surface_inflow_submit_no_treaty and should be ignored for analysis

Correlation 0.97923

surface_water_produced
Highly correlated

This variable is highly correlated with irwr and should be ignored for analysis

Correlation 0.99953

total_area
Numeric

Distinct count 195
Unique (%) 99.0%
Missing (%) 1.0%
Missing (n) 2
Infinite (%) 0.0%
Infinite (n) 0
Mean 67954
Minimum 1
Maximum 1709800
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 31.6
Q1 2207
Median 11760
Q3 51312
95-th percentile 235220
Maximum 1709800
Range 1709800
Interquartile range 49105

Descriptive statistics

Standard deviation 190910
Coef of variation 2.8094
Kurtosis 36.522
Mean 67954
MAD 85134
Skewness 5.6078
Sum 13387000
Variance 36445000000
Memory size 1.6 KiB
Value Count Frequency (%)  
26.0 2 1.0%
 
75.0 2 1.0%
 
46.0 2 1.0%
 
25637.0 1 0.5%
 
60355.0 1 0.5%
 
44655.0 1 0.5%
 
2571.0 1 0.5%
 
126700.0 1 0.5%
 
54909.0 1 0.5%
 
11137.0 1 0.5%
 
Other values (184) 184 92.5%
 
(Missing) 2 1.0%
 

Minimum 5 values

Value Count Frequency (%)  
1.0 1 0.5%
 
2.0 1 0.5%
 
3.0 1 0.5%
 
6.0 1 0.5%
 
16.0 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
851577.0 1 0.5%
 
960001.0 1 0.5%
 
983151.0 1 0.5%
 
998467.0 1 0.5%
 
1709825.0 1 0.5%
 

total_dam_capacity
Highly correlated

This variable is highly correlated with number_undernourished and should be ignored for analysis

Correlation 0.90598

total_flow_border_rivers
Highly correlated

This variable is highly correlated with accounted_flow_border_rivers and should be ignored for analysis

Correlation 0.96631

total_pop
Highly correlated

This variable is highly correlated with rural_pop and should be ignored for analysis

Correlation 0.96084

total_pop_access_drinking
Highly correlated

This variable is highly correlated with rural_pop_access_drinking and should be ignored for analysis

Correlation 0.94921

total_renewable
Highly correlated

This variable is highly correlated with surface_water_produced and should be ignored for analysis

Correlation 0.97515

total_renewable_groundwater
Highly correlated

This variable is highly correlated with surface_groundwater_overlap and should be ignored for analysis

Correlation 0.99191

total_renewable_per_capita
Highly correlated

This variable is highly correlated with irwr_per_capita and should be ignored for analysis

Correlation 0.97641

total_renewable_surface
Highly correlated

This variable is highly correlated with total_renewable and should be ignored for analysis

Correlation 0.99966

urban_pop
Highly correlated

This variable is highly correlated with total_pop and should be ignored for analysis

Correlation 0.95137

urban_pop_access_drinking
Numeric

Distinct count 88
Unique (%) 46.6%
Missing (%) 5.0%
Missing (n) 10
Infinite (%) 0.0%
Infinite (n) 0
Mean 94.787
Minimum 50.7
Maximum 100
Zeros (%) 0.0%

Quantile statistics

Minimum 50.7
5-th percentile 76.12
Q1 93.8
Median 98.1
Q3 99.9
95-th percentile 100
Maximum 100
Range 49.3
Interquartile range 6.1

Descriptive statistics

Standard deviation 8.4156
Coef of variation 0.088784
Kurtosis 7.5084
Mean 94.787
MAD 5.5776
Skewness -2.6093
Sum 17915
Variance 70.822
Memory size 1.6 KiB
Value Count Frequency (%)  
100.0 47 23.6%
 
99.7 7 3.5%
 
97.5 5 2.5%
 
99.6 5 2.5%
 
99.0 4 2.0%
 
98.9 4 2.0%
 
99.9 4 2.0%
 
99.5 3 1.5%
 
95.5 3 1.5%
 
97.0 3 1.5%
 
Other values (77) 104 52.3%
 
(Missing) 10 5.0%
 

Minimum 5 values

Value Count Frequency (%)  
50.7 1 0.5%
 
58.4 1 0.5%
 
64.9 1 0.5%
 
66.0 1 0.5%
 
66.4 1 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
99.6 5 2.5%
 
99.7 7 3.5%
 
99.8 2 1.0%
 
99.9 4 2.0%
 
100.0 47 23.6%
 

water_total_external_renewable
Highly correlated

This variable is highly correlated with surface_total_external_renewable and should be ignored for analysis

Correlation 1

Sample

2013-2017 accounted_flow accounted_flow_border_rivers agg_to_gdp arable_land avg_annual_rain_depth avg_annual_rain_vol cultivated_area dam_capacity_per_capita dependency_ratio flood_occurence gdp gdp_per_capita gender_inequal_index groundwater_accounted_inflow groundwater_accounted_outflow groundwater_entering groundwater_produced groundwater_to_other_countries human_dev_index interannual_variability irrigation_potential irwr irwr_per_capita number_undernourished overlap_surface_groundwater percent_cultivated percent_undernourished permanent_crop_area rural_pop rural_pop_access_drinking seasonal_variability surface_entering surface_groundwater_overlap surface_inflow_secure_treaty surface_inflow_submit_no_treaty surface_inflow_submit_treaty surface_outflow_secure_treaty surface_outflow_submit_no_treaty surface_outflow_submit_treaty surface_to_other_countries surface_total_external_renewable surface_water_produced total_area total_dam_capacity total_flow_border_rivers total_pop total_pop_access_drinking total_renewable total_renewable_groundwater total_renewable_per_capita total_renewable_surface urban_pop urban_pop_access_drinking water_total_external_renewable
country
Afghanistan 19.00 9.0 22.6000 7771.0 327.0 213.5000 7910.0 61.76 28.7200 3.7 1.919944e+10 590.3 0.6934 0.00 NaN 0.00 10.650 NaN 0.4653 2.5 NaN 47.1500 1450.0 8600.0 1.00 12.120 26.8 139.0 23980.00 47.0 2.5 10.00 1.00 0.0 10.00 0.0 0.82 35.52 6.7 42.22 18.18 37.50 65286.0 2.009 33.4 32527.00 55.3 65.3300 10.650 2008.0 55.68 8547.0 78.2 18.18
Albania 3.30 0.0 22.0500 615.6 1485.0 42.6900 696.0 1391.00 10.9300 2.7 1.145560e+10 3954.0 0.2174 0.00 0.0 0.00 6.200 0.0 0.7328 1.2 NaN 26.9000 9285.0 NaN 2.35 24.210 NaN 80.4 1062.00 95.2 2.4 3.30 2.35 0.0 3.30 0.0 0.00 11.50 0.0 11.50 3.30 23.05 2875.0 4.030 0.0 2897.00 95.1 30.2000 6.200 10425.0 26.35 1835.0 94.9 3.30
Algeria 0.39 0.0 13.0500 7469.0 89.0 212.0000 8439.0 209.30 3.5990 2.8 1.670000e+11 4210.0 0.4131 0.03 0.1 0.03 1.487 0.1 0.7356 2.3 1300.0 11.2500 283.6 NaN 0.00 3.543 NaN 969.8 10928.00 81.8 1.9 0.39 0.00 0.0 0.39 0.0 0.00 0.32 0.0 0.32 0.39 9.76 238174.0 8.304 0.0 39667.00 83.6 11.6700 1.517 294.2 10.15 28739.0 84.3 0.42
Andorra NaN NaN 0.5239 2.8 NaN 0.4724 2.8 NaN NaN 3.3 3.249101e+09 46106.0 NaN NaN NaN NaN NaN NaN 0.8446 1.5 NaN 0.3156 4479.0 NaN NaN 5.957 NaN 0.0 1.57 100.0 1.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47.0 NaN NaN 70.47 100.0 0.3156 NaN 4479.0 NaN 68.9 100.0 NaN
Angola 0.40 0.0 NaN 4900.0 1010.0 1259.0000 5190.0 377.50 0.2695 1.7 1.030000e+11 4116.0 NaN 0.00 0.0 0.00 58.000 0.0 0.5316 2.5 3700.0 148.0000 5915.0 3200.0 55.00 4.163 14.2 290.0 14970.00 28.2 3.1 0.40 55.00 0.0 0.40 0.0 0.00 122.80 0.0 122.80 0.40 145.00 124670.0 9.445 0.0 25022.00 49.0 148.4000 58.000 5931.0 145.40 10052.0 75.4 0.40

To do: Write down observations from profiling

To do: Collect list of questions you have coming out of quality assessment and profiling

  • Quality concerns to ask the data source
  • Questions and hypotheses you want to explore further during exploration
  • Things you need to understand better and would ask a domain expert