Diagnostics, analysis and visualization tools
for Integrated Assessment timeseries data

First steps with the pyam_analysis package

The pyam-analysis package provides a range of diagnostic tools and functions
for analyzing and working with IAMC-style timeseries data.

The package can be used with data that follows the data template convention of the Integrated Assessment Modeling Consortium (IAMC). An illustrative example is shown below; see data.ene.iiasa.ac.at/database for more information.

model scenario region variable unit 2005 2010 2015
MESSAGE V.4 AMPERE3-Base World Primary Energy EJ/y 454.5 479.6 ...
... ... ... ... ... ... ... ...

This notebook illustrates some basic functionality of the pyam-analsysis package and the IamDataFrame class:

  1. Importing timeseries data from a csv file.
  2. Listing models, scenarios and variables included in the data.
  3. Display of timeseries data as dataframe and visualization using simple plotting functions.
  4. Evaluating the model data and executing a range of diagnostic checks to identify data outliers.
  5. Categorization of scenarios according to timeseries data.

Tutorial data

The timeseries data used in this tutorial is a partial snapshot of the scenario database compiled for the IPCC's Fifth Assessment Report (AR5):

Krey V., O. Masera, G. Blanford, T. Bruckner, R. Cooke, K. Fisher-Vanden, H. Haberl, E. Hertwich, E. Kriegler, D. Mueller, S. Paltsev, L. Price, S. Schlömer, D. Ürge-Vorsatz, D. van Vuuren, and T. Zwickel, 2014: Annex II: Metrics & Methodology.
In: Climate Change 2014: Mitigation of Climate Change. Contribution of Working Group III to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change [Edenhofer, O., R. Pichs-Madruga, Y. Sokona, E. Farahani, S. Kadner, K. Seyboth, A. Adler, I. Baum, S. Brunner, P. Eickemeier, B. Kriemann, J. Savolainen, S. Schlömer, C. von Stechow, T. Zwickel and J.C. Minx (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA. Link

The complete database is publicly available at tntcat.iiasa.ac.at/AR5DB/.

The data snapshot used for this tutorial consists of selected data from two model intercomparison projects:

  • Energy Modeling Forum Round 27 (EMF27), see the Special Issue in Climatic Change 3-4, 2014.
  • EU FP7 project AMPERE, see the following scientific publications:
    • Riahi, K., et al. (2015). "Locked into Copenhagen pledges — Implications of short-term emission targets for the cost and feasibility of long-term climate goals." Technological Forecasting and Social Change 90(Part A): 8-23.
      DOI: 10.1016/j.techfore.2013.09.016
    • Kriegler, E., et al. (2015). "Making or breaking climate targets: The AMPERE study on staged accession scenarios for climate policy." Technological Forecasting and Social Change 90(Part A): 24-44.
      DOI: 10.1016/j.techfore.2013.09.021
*The data used in this tutorial is ONLY a partial snapshot of the IPCC AR5 scenario database!* *This tutorial is only intended for an illustration of the ``pyam-analysis`` package.*

Import package and load data from the AR5 tutorial csv snapshot file

First, we import the snapshot timeseries data from the file tutorial_AR5_data.csv in the tutorial folder.

As a first step, we show lists of all models, scenarios, regions, and variables (with units) included in the snapshot.


In [1]:
import pyam_analysis as iam



In [2]:
data = '/home/gidden/work/iiasa/message/pyam-analysis/tutorial/tutorial_AR5_data.csv'
df = iam.IamDataFrame(data=data)

What's in our dataset?


In [3]:
df.models()


Out[3]:
['AIM-Enduse 12.1',
 'GCAM 3.0',
 'IMAGE 2.4',
 'MERGE_EMF27',
 'MESSAGE V.4',
 'REMIND 1.5',
 'WITCH_EMF27']

In [4]:
df.scenarios()


Out[4]:
['EMF27-450-Conv',
 'EMF27-450-NoCCS',
 'EMF27-550-LimBio',
 'EMF27-Base-FullTech',
 'EMF27-G8-EERE',
 'AMPERE3-450',
 'AMPERE3-450P-CE',
 'AMPERE3-450P-EU',
 'AMPERE3-550',
 'AMPERE3-Base-EUback',
 'AMPERE3-CF450P-EU',
 'AMPERE3-RefPol',
 'AMPERE3-550P-EU']

In [5]:
df.regions()


Out[5]:
['ASIA', 'LAM', 'MAF', 'OECD90', 'REF', 'World']

In [6]:
df.variables(include_units=True)


Out[6]:
variable unit
0 Emissions|CO2 Mt CO2/yr
1 Emissions|CO2|Fossil Fuels and Industry Mt CO2/yr
2 Primary Energy EJ/yr
3 Emissions|CO2|Fossil Fuels and Industry|Energy... Mt CO2/yr
4 Emissions|CO2|Fossil Fuels and Industry|Energy... Mt CO2/yr
5 Price|Carbon US$2005/t CO2
6 Primary Energy|Coal EJ/yr
7 Primary Energy|Fossil|w/ CCS EJ/yr
8 Temperature|Global Mean|MAGICC6|MED deg C

Filtering Data

Most functions of the IamDataFrame class take an (optional) argument filters, i.e., a dictionary of filter criteria.

Filtering by model names, scenarios and regions

The feature for filtering by model, scenario or region is implemented using regular expressions (regex, re) and the re.match() function. This implies that the filtering is done from the beginning of the text string.

Applying the filter 'model': 'MESSAGE' to the function scenarios() will return all MESSAGE V.4 scenarios included in the snapshot.
Filtering for ESSAGE will return an empty set.


In [7]:
df.scenarios({'model': 'MESSAGE'})


Out[7]:
['AMPERE3-450',
 'AMPERE3-450P-EU',
 'AMPERE3-550',
 'AMPERE3-RefPol',
 'EMF27-550-LimBio',
 'EMF27-Base-FullTech']

In [8]:
df.scenarios({'model': 'ESSAGE'})


Out[8]:
[]

Filtering by variables and hierarchy levels

Filtering for variable strings using regex is problematic due to the frequent use of the "|" character in the IAMC template to specify a hierarchical. Therefore, this package implements a pseudo-regex syntax, where | is escaped, * is used as a wildcard and exact matching at the end of the string is enforced. (in regex lingo, * is replaced by .* and $ is appended to the filter string).

Filtering for Primary Energy will return only exactly those data.
Filtering for Primary Energy|* will return all sub-categories of primary-energy level (and only the sub-categories).

In additon, IAM variables can be filtered by the level, i.e., the "depth" of the variable in a hierarchical reading of the string separated by "|". That is, the variable Primary Energy has level 0, while Primary Energy|Coal has level 1. Filtering by both variables and level will search for the hierarchical depth following the string, so filter arguments Primary Energy|* and level = 0 will return all variables immediately below Primary Energy. Filtering by level only will return all variables up to that depth.

To illustrate the functionality of the filters, we first show all sub-categories of the Emissions variable.
Then, we reduce variables to only two hierarchical levels below "Emissions|"; the list returned by the function call will not include Emissions|CO2|Fossil Fuels and Industry|Energy Supply|Electricity, because this variable is three hierarchical levels below "Emissions|".

The third example shows how to filter only by hierarchical level. The function returns all variables that are at the top hierarchical level (i.e., Primary Energy) and those at the first sub-category level. Keep in mind that there are no variables Emissions or Price (no top level).


In [9]:
df.variables(filters={'variable': 'Emissions|*'})


Out[9]:
['Emissions|CO2',
 'Emissions|CO2|Fossil Fuels and Industry',
 'Emissions|CO2|Fossil Fuels and Industry|Energy Supply',
 'Emissions|CO2|Fossil Fuels and Industry|Energy Supply|Electricity']

In [10]:
df.variables(filters={'variable': 'Emissions|*', 'level': 2})


Out[10]:
['Emissions|CO2',
 'Emissions|CO2|Fossil Fuels and Industry',
 'Emissions|CO2|Fossil Fuels and Industry|Energy Supply']

In [11]:
df.variables(filters={'level': 1})


Out[11]:
['Emissions|CO2', 'Price|Carbon', 'Primary Energy', 'Primary Energy|Coal']

Filtering by year

Filtering for years can be done by integer number, a list of integers, or the Python class range.
Note that the last year of a range is not included, so range(2010,2015) is interpreted as [2010, 2011, 2012, 2013, 2014].

Getting help

When in doubt, you can look at the help for any function by appending it with a ?.


In [12]:
df.models?

Working with Timeseries

As a next step, we want to view a selection of the data in the tutorial snapshot using the IAMC standard. The filtered data can exported as a csv file by appending .to_csv('selected_data.csv') to the next command.

For displaying data in a different format, the class IamDataFrame has a wrapper of the pandas.DataFrame.pivot_table() function. It allows to flexibly specify the columns and rows. The function automatically aggregates by summation or counting (specified by the parameter aggfunc) over all timeseries data identifiers ('model', 'scenario', 'variable', 'region', 'unit', 'year') which are not used as index or columns.

In the example below, the filter of the timeseries data is set for all subcategories of 'Primary Energy', which are then summed up in the displayed table.


In [13]:
df.timeseries(filters={
    'scenario': 'AMPERE3-450', 
    'variable': 'Primary Energy|Coal', 
    'region': 'World'
}).head()


Out[13]:
year 2005 2010 2020 2030 2040 2050 2060 2070 2080 2090 2100
model scenario region variable unit
GCAM 3.0 AMPERE3-450 World Primary Energy|Coal EJ/yr 120.76 144.95 176.44 204.42 212.84 186.02 138.23 106.98 82.44 36.55 14.89
AMPERE3-450P-CE World Primary Energy|Coal EJ/yr 120.76 144.95 178.98 218.24 213.35 192.45 142.64 108.72 82.73 36.89 15.22
AMPERE3-450P-EU World Primary Energy|Coal EJ/yr 120.76 144.95 189.86 241.25 224.25 191.70 136.72 102.51 80.72 35.70 14.45
IMAGE 2.4 AMPERE3-450 World Primary Energy|Coal EJ/yr 111.62 138.69 148.60 121.24 102.62 101.41 111.41 138.40 181.03 224.03 264.77
AMPERE3-450P-CE World Primary Energy|Coal EJ/yr 111.62 138.69 161.72 154.18 125.14 105.32 120.83 151.58 192.67 249.50 294.40

In [14]:
df.pivot_table(
    index=['year'], 
    columns=['scenario'], 
    values='value', 
    aggfunc='sum',
    filters={'variable': 'Primary Energy', 'region': 'World'}
).head()


Out[14]:
scenario AMPERE3-450 AMPERE3-450P-CE AMPERE3-450P-EU AMPERE3-550 AMPERE3-550P-EU AMPERE3-Base-EUback AMPERE3-CF450P-EU AMPERE3-RefPol EMF27-450-Conv EMF27-450-NoCCS EMF27-550-LimBio EMF27-Base-FullTech EMF27-G8-EERE
year
2005 1821.09 1366.48 1821.09 1818.71 464.82 922.58 925.23 1818.44 2234.35 1381.84 3130.81 3130.60 868.79
2010 1972.13 1492.28 1972.02 1969.57 514.07 1015.78 1018.44 1969.50 2504.99 1542.90 3457.08 3459.28 985.16
2020 2253.49 1787.40 2399.41 2322.23 611.34 1258.24 1262.07 2401.37 2428.61 1424.26 3781.16 4135.65 947.08
2030 2530.95 2101.60 2863.85 2670.22 734.11 1532.12 1536.54 2869.96 2545.94 1470.64 4057.28 4846.37 933.08
2040 2795.47 2206.09 2940.96 3000.37 789.70 1802.62 1574.14 3305.70 2698.99 1670.51 4355.16 5588.19 1007.83

If you are familiar with the python package pandas, you can access the pd.DataFrame directly.


In [15]:
df.data.head()


Out[15]:
model scenario region variable unit year value
0 AIM-Enduse 12.1 EMF27-450-Conv ASIA Emissions|CO2 Mt CO2/yr 2005 10540.74
1 AIM-Enduse 12.1 EMF27-450-Conv ASIA Emissions|CO2|Fossil Fuels and Industry Mt CO2/yr 2005 9126.18
2 AIM-Enduse 12.1 EMF27-450-Conv ASIA Primary Energy EJ/yr 2005 133.56
3 AIM-Enduse 12.1 EMF27-450-Conv LAM Emissions|CO2 Mt CO2/yr 2005 3285.00
4 AIM-Enduse 12.1 EMF27-450-Conv LAM Emissions|CO2|Fossil Fuels and Industry Mt CO2/yr 2005 1422.06

Plotting Timeseries

As a next step, we want to visualize timeseries data. In the plot below, we show CO2 emissions over time for all scenarios provided in the tutorial snapshot data.


In [16]:
df.plot_lines({'variable': 'Emissions|CO2', 'region': 'World'})


Validating and querying timeseries data

When analyzing scenario results, it is often useful to check whether certain timeseries exist or the values are within a specific range. For example, it may make sense to ensure that reported data for historical periods are close to established reference data.

The following section provides three illustrations:

  1. Check whether a timeseries 'Primary Energy' exists in each scenario (in at least one year).
  2. Check for every scenario whether the value for 'Primary Energy' at the global level exceeds 515 EJ/y in the reference year 2010 (the value must satisfy an upper bound of 515 EJ/y in this notation).
  3. Check for every scenario whether the value for 'Primary Energy|Coal' exceeds 400 EJ/y in mid-century.

The validate() function takes a filters dictionary to perform the checks on a selection of models/scenarios similar to the functions introduced above.
The criteria argument can specify a valid range by an upper and lower bound (up, lo) for a variable and a subset of years to which the validation is applied - all scenarios with a value in at least one year outside that range are considered to not satisfy the validation.

By setting the argument exclude=True, all scenarios failing the validation will be categorized as exclude. These scenarios will not be shown by default in any subsequent data tables or plots.


In [17]:
df.validate?

In [18]:
df.validate('Primary Energy')


INFO:root:48 scenarios satisfy the criteria

In [19]:
df.validate({'Primary Energy': {'up': 515, 'year': 2010}})


INFO:root:9 data points do not satisfy the criteria (out of 48 scenarios)
Out[19]:
value
model scenario region variable unit year
AIM-Enduse 12.1 EMF27-450-Conv World Primary Energy EJ/yr 2010 518.89
EMF27-450-NoCCS World Primary Energy EJ/yr 2010 518.81
EMF27-550-LimBio World Primary Energy EJ/yr 2010 518.81
EMF27-Base-FullTech World Primary Energy EJ/yr 2010 518.81
EMF27-G8-EERE World Primary Energy EJ/yr 2010 518.64
REMIND 1.5 EMF27-450-Conv World Primary Energy EJ/yr 2010 519.64
EMF27-450-NoCCS World Primary Energy EJ/yr 2010 519.64
EMF27-550-LimBio World Primary Energy EJ/yr 2010 519.64
EMF27-Base-FullTech World Primary Energy EJ/yr 2010 519.64

In [20]:
df.validate(
    {'Primary Energy|Coal': {'up': 400, 'year': 2050}}, 
    filters={'region': 'World'}, 
    exclude=False
)


INFO:root:2 data points do not satisfy the criteria (out of 48 scenarios)
Out[20]:
value
model scenario region variable unit year
GCAM 3.0 AMPERE3-Base-EUback World Primary Energy|Coal EJ/yr 2050 424.09
MERGE_EMF27 EMF27-Base-FullTech World Primary Energy|Coal EJ/yr 2050 605.76

Categorization of scenarios by timeseries characteristics

It is often useful to apply categorization to classes of scenarios according to specific characteristics of the timeseries data.

In the following example, we use the temperature change assessment by MAGICC 6 to group scenarios by the median global warming by the end of the century (year 2100).

We proceed in the following steps:

  1. Plot the timeseries data of the variable that we want to use. This provides some insights on useful thresholds for the categorization.
  2. Use the function category() to apply a categorization (and colour code for later use) to all scenarios that satisfy a number of specific criteria.
  3. Use the categorization of scenarios for analysis of other timeseries data.

In [21]:
df.plot_lines({'variable': 'Temperature*'})


We now use the categorization feature of the pyam-analysis package. By default, each model/scenario is assigned as "uncategorized".

The next function resets all scenarios back to "uncategorized". This may be helpful in this tutorial if you are going back and forth between cells.


In [22]:
df.reset_category()

In [23]:
df.category(
    'Below 1.6C',
    {'Temperature|Global Mean|MAGICC6|MED': {'up': 1.6, 'year': 2100}},
    color='cornflowerblue',
    display='list'
)


INFO:root:4 scenarios categorized as 'Below 1.6C'
Out[23]:
model scenario
GCAM 3.0 EMF27-450-Conv
EMF27-450-NoCCS
REMIND 1.5 EMF27-450-Conv
EMF27-450-NoCCS

In [24]:
df.category(
    'Below 2.0C',
    {'Temperature|Global Mean|MAGICC6|MED': {'up': 2.0, 'year': 2100}},
    filters={'category': 'uncategorized'}, 
    color='forestgreen'
)


INFO:root:8 scenarios categorized as 'Below 2.0C'

In [25]:
df.category(
    'Below 2.5C',
    {'Temperature|Global Mean|MAGICC6|MED': {'up': 2.5, 'year': 2100}},
    filters={'category': 'uncategorized'}, 
    color='gold'
)


INFO:root:16 scenarios categorized as 'Below 2.5C'

In [26]:
df.category(
    'Below 3.5C',
    {'Temperature|Global Mean|MAGICC6|MED': {'up': 3.5, 'year': 2100}},
    filters={'category': 'uncategorized'}, 
    color='firebrick'
)


INFO:root:3 scenarios categorized as 'Below 3.5C'

In [27]:
df.category(
    'Above 3.5C',
    {'Temperature|Global Mean|MAGICC6|MED': {}},
    filters={'category': 'uncategorized'}, 
    color='magenta'
)


INFO:root:9 scenarios categorized as 'Above 3.5C'

Two models included in the snapshot have not been assessed by MAGICC6 regarding their long-term climate and warming impact. Therefore, the timeseries 'Temperature|Global Mean|MAGICC6|MED' does not exist, and they have not been categorized.

Below, we display all scenarios that are uncategorized at this point.


In [28]:
df.category('uncategorized', display='list')


Out[28]:
category model scenario
uncategorized AIM-Enduse 12.1 EMF27-450-Conv
EMF27-450-NoCCS
EMF27-550-LimBio
EMF27-Base-FullTech
EMF27-G8-EERE
WITCH_EMF27 EMF27-450-Conv
EMF27-550-LimBio
EMF27-Base-FullTech

Now, we again display the median global temperature increase for all scenarios, but we use the colouring by category to illustrate the common charateristics across scenarios.


In [29]:
df.plot_lines({'variable': 'Temperature*'}, color_by_cat=True)


As a last step, we display the aggregate CO2 emissions by category. This allows to highlight alternative pathways within the same category.


In [30]:
df.plot_lines(
    {'variable': 'Emissions|CO2', 'region': 'World'}, 
    color_by_cat=True
)