Tools for specific data sources

“Centre d’études prospectives et d’informations internationales” (CEPII) (tools.cepii)

Handle data from CEPII.

CEPII is the “Centre d’études prospectives et d’informations internationales” (fr).

class message_ix_models.tools.cepii.BACI(*args, **kwargs)[source]

Provider of data from the BACI data source.

BACI is the “Base pour l’Analyse du Commerce International” (fr). The source is documented at:

Currently the class supports:

  • The 202501 release only.

  • The 1992 Harmonized System (HS92) only.

Todo

  • Aggregate to MESSAGE regions.

  • Test with additional HS categorizations.

  • Test with additional releases.

class Options(aggregate: bool = False, interpolate: bool = False, measure: str = 'quantity', name: str = '', dims: tuple[str, ...] = ('t', 'i', 'j', 'k'), filter_pattern: dict[str, 'str | Pattern'] = <factory>, test: bool = False)[source]
aggregate: bool = False

By default, do not aggregate.

dims: tuple[str, ...] = ('t', 'i', 'j', 'k')

Dimensions for the returned Key/Quantity.

Per the BACI README file, these are:

  • “t”: year

  • “i”: exporter

  • “j”: importer

  • “k”: product

filter_pattern: dict[str, str | Pattern]

Regular expressions for filtering on any of dims. Keys must be in dims; values must be regular expressions or compiled re.Pattern that fullmatch the str representation of labels on the respective dimension.

For example, filter_pattern=dict(k="270(4..|576)") matches any 6-digit label on the \(k\) dimension starting with ‘2704’, or the exact label ‘270576’.

interpolate: bool = False

By default, do not interpolate.

measure: str = 'quantity'

Either “quantity” or “value”.

test: bool = False

Set to True to use test data from the message_ix_models repository.

get() AnyQuantity[source]

Return the raw data.

This method performs the following steps:

  1. If needed, retrieve the data archive from pooch.SOURCE using the entry “CEPII_BACI”. The file is stored in the Config.cache_path, and is about 2.2 GiB.

  2. If needed, extract all the members of the archive to a …/cepii-baci/ subdirectory of the cache directory. The extracted size is about 7.9 GiB, containing about 2.6 × 10⁸ observations.

  3. Call baci_data_from_files() to read the data files and apply Options.measure and Options.filter_pattern. The function is decorated with cached(), so identical parameters and file paths result in a cache hit.

  4. Convert to genno.Quantity and return.

options: Options

Instance of the Options class.

A concrete class that overrides Options should redefine this attribute, to facilitate type checking.

transform(c: genno.Computer, base_key: Key) Key[source]

Prepare c to transform raw data from base_key.

  1. Map BACI codes for the \((i, j)\) dimensions from numeric (mainly ISO 3166-1 numeric) to ISO 3166-1 alpha_3. See get_mapping().

message_ix_models.tools.cepii.COUNTRY_CODES = [(58, 'BEL'), (251, 'FRA'), (490, 'S19'), (530, 'ANT'), (579, 'NOR'), (699, 'IND'), (711, 'ZA1'), (736, 'SDN'), (757, 'CHE'), (842, 'USA'), (849, 'PUS'), (891, 'SCG')]

Labels appearing in the \((i, j)\) dimensions of the BACI data that are not current ISO 3166-1 numeric codes. These are generally of 3 kinds:

  • Numeric codes that are in ISO 3166-3 (“Code for formerly used names of countries”), not ISO 3166-1.

  • Numeric codes for countries that exist in ISO 3166-1, but simply differ. For example, ISO has 250 for “France”, but BACI uses 251.

  • Numeric codes for countries or country groups that do not appear in ISO 3166.

This is a subset of the labels appearing in the country_code column of the file country_codes_V202501.csv in the archive BACI_HS92_V202501.zip. Only the labels appearing in the data files are included.

message_ix_models.tools.cepii.DTYPE = {'i': <class 'numpy.uint16'>, 'j': <class 'numpy.uint16'>, 'k': <class 'numpy.uint32'>, 't': <class 'numpy.uint16'>}

Dimensions and data types for input data. In order to reduce memory and disk usage:

  • np.uint16 (0 to 65_535) is used for t (year), i (exporter), and j (importer)

  • np.uint32 (0 to 4_294_967_295) is used for k (product), since these values can be as large as 999_999.

message_ix_models.tools.cepii.baci_data_from_files(paths: list[Path], measure: str, filters: dict[str, str | Pattern]) DataFrame[source]

Read the BACI data from files.

dask.dataframe.read_csv() and pyarrow are used for better performance. DTYPE is used to specify columns and dtypes.

Data returned by this function is cached using cached(); see also SKIP_CACHE.

message_ix_models.tools.cepii.get_mapping() MappingAdapter[source]

Return an adapter from codes appearing in BACI data.

The BACI data for dimensions \(i\) (exporter) and \(j\) (importer) contain ISO 3166-1 numeric codes, plus some other idiosyncratic codes from COUNTRY_CODES. The returned adapter maps these to the corresponding alpha-3 code.

Using the adapter makes data suitable for aggregation using the message_ix_models node code lists, which include those alpha-3 codes as children of each region code.

Global Fuel Economy Initiative (GFEI) (tools.gfei)

Handle data from the Global Fuel Economy Initiative (GFEI).

class message_ix_models.tools.gfei.GFEI(*args, **kwargs)[source]

Provider of exogenous data from the GFEI 2017 data source.

To use data from this source, call exo_data.prepare_computer() with the arguments:

  • source: “GFEI”.

  • source_kw including:

    • plot (optional, default False): add a task with the key “plot GFEI debug” to generate diagnostic plot using Plot.

    • aggregate, interpolate: see ExoDataSource.transform().

The source data:

  • is derived from https://theicct.org/publications/gfei-tech-policy-drivers-2005-2017, specifically the data underlying “Figure 37. Fuel consumption range by type of powertrain and vehicle size, 2017”.

  • has resolution of individual countries.

  • corresponds to new vehicle registrations in 2017.

  • has units of megajoule / kilometre, converted from original litres of gasoline equivalent per 100 km.

Note

if py:source_kw[“aggregate”] is True, the aggregation performed is an unweighted sum(). To produce meaningful values for multi-country regions, instead perform perform a weighted mean using appropriate weights; for instance the vehicle activity for each country. The class currently does not do this automatically.

class Options(aggregate: bool = False, interpolate: bool = False, measure: str = '', name: str = 'fuel economy', dims: tuple[str, ...] = ('n', 'y'), plot: bool = False)[source]
aggregate: bool = False

By default, do not aggregate.

interpolate: bool = False

By default, do not interpolate.

name: str = 'fuel economy'

Name for the returned quantity.

plot: bool = False

Also generate diagnostic plots.

get() AnyQuantity[source]

Return the data.

Implementations in concrete classes may load data from file, retrieve from remote sources or local caches, generate data, or anything else.

The Quantity returned by this method must have dimensions corresponding to key. If the original/upstream/raw data has different dimensionality (fewer or more dimensions; different dimension IDs), a concrete class must transform these, make appropriate selections, etc.

options: Options

Instance of the Options class.

A concrete class that overrides Options should redefine this attribute, to facilitate type checking.

transform(c: Computer, base_key: Key) Key[source]

Prepare c to transform raw data from base_key.

where: list['str | Path'] = ['private']

where keyword argument to path_fallback(). See _where().

class message_ix_models.tools.gfei.Plot[source]

Diagnostic plot of processed data.

basename = 'GFEI-fuel-economy-t'

File name base for saving the plot.

generate(data)[source]

Generate and return the plot.

A subclass of Plot must implement this method.

Parameters:

args (Sequence of pandas.DataFrame or other) –

One argument is given corresponding to each of the inputs.

Because plotnine operates on pandas data structures, save() automatically converts any Quantity inputs to pandas.DataFrame before they are passed to generate().

International Energy Agency (IEA) (tools.iea)

The IEA publishes many kinds of data. Each distinct data source is handled by a separate submodule of message_ix_models.tools.iea.

Documentation for all module contents:

iea

Tools for working with IEA data and structures.

Energy efficiency indicators (tools.iea.eei)

See IEA_EEI. This data is produced by the IEA and retrieved from the Energy Efficiency Indicators database. It is proprietary.

The data:

  • Has the geographic resolution of individual countries, and scope including 41 countries:

  • 24 IEA member countries for which data covering most end-uses area available: Australia, Austria, Belgium, Canada, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Italy, Japan, Korea, Luxembourg, the Netherlands, New Zealand, Poland, Portugal, Slovak Republic, Spain, Switzerland, the United Kingdom and the United States.

  • Others including Brazil, Chile, Lithuania, Morocco, Armenia, Azerbaijan, Belarus, Georgia, Kazakhstan, Kyrgyzstan, Republic of Moldova, Ukraine, Uzbekistan.

  • Includes measures/variables for energy consumption, efficiency, carbon emissions, and others for four conceptual sectors: Residential, Services, Industry and Transport.

  • The December 2020 edition covers the time periods 2000–2018 with annual resolution.

Note

Currently, iea.eei mainly retrieves and processes data useful for MESSAGEix-Transport. To retrieve other end-use sectoral data, the code can be extended.

(Extended) World Energy Balances (tools.iea.web)

Note

These data are proprietary and require a paid subscription.

The approach to handling proprietary data is the same as in project.advance and project.ssp:

  • Copies of the data are stored in the (private) message-static-data repository using Git LFS. This respository is accessible only to users who have a license for the data.

  • message_ix_models contains only a ‘fuzzed’ version of the data (same structure, random values) for testing purposes.

  • Non-IIASA users must obtain their own license to access and use the data; obtain the data themselves; and place it on the system where they use message_ix_models.

The module message_ix_models.tools.iea.web attempts to detect and support both the providers/formats described below. The code supports using data from any of the above locations and formats, in multiple ways:

The documentation for the 2023 edition of the IEA source/format is publicly available.

Structure

The data have the following conceptual dimensions, each enumerated by a different list of codes:

  • FLOW, PRODUCT: for both of these, the lists of codes appearing in the data are the same from 2021 and 2023 inclusive.

  • COUNTRY: The data provided by IEA directly contain codes that are all caps, abbreviated country names, for instance ‘DOMINICANR’. The data provided by the OECD contain ISO 3166-1 alpha-3 codes, for instance ‘DOM’. In both cases, there are additional labels denoting country groupings; these are defined in the documentation linked above.

    Changes visible in these lists include:

    • 2022 → 2023:

      • New codes: ASEAN, BFA, GREENLAND, MALI, MRT, PSE, TCD.

      • Removed: MASEAN.

    • 2021 → 2022:

      • New codes: GNQ, MDG, MKD, RWA, SWZ, UGA.

      • Removed: EQGUINEA, GREENLAND, MALI, MBURKINAFA, MCHAD, MMADAGASCA, MMAURITANI, MPALESTINE, MRWANDA, MUGANDA, NORTHMACED.

    See the transform=... source keyword argument and IEA_EWEB.transform() for different methods of handling this dimension.

  • TIME: always a year.

  • UNIT_MEASURE (not labeled): unit of measurement, either ‘TJ’ or ‘ktoe’.

message_ix_models is packaged with SDMX structure data (stored in message_ix_models/data/sdmx/) comprising code lists extracted from the raw data for the COUNTRY, FLOW, and PRODUCT dimensions. These can be used with other package utilities, for instance:

>>> from message_ix_models.util.sdmx import read

# Read a code list from file: codes used in the
# 2022 edition data from the OECD provider
>>> cl = read("IEA:PRODUCT_OECD(2022)")

# Show some of its elements
>>> print("\n".join(sorted(cl.items[:5])))
ADDITIVE
ANTCOAL
AVGAS
BIODIESEL
BIOGASES

The documentation linked above has full descriptions of each code.

IEA provider/format

From 2023 (or earlier), the data are provided directly on the IEA website at https://www.iea.org/data-and-statistics/data-product/world-energy-balances. These data are available in two formats; ‘IVT’ or “Beyond 20/20” format (not supported by this module) or fixed-width text files. The latter are characterized by:

  • Multiple ZIP archives with names like WBIG[12].zip, each containing a portion of the data and typically 110–130 MiB compressed size

  • …each containing a single, fixed-with TXT file with a name like WORLDBIG[12].TXT, typically 3–4 GiB uncompressed,

  • …with no column headers, but data resembling:

    WORLD  HARDCOAL  1960  INDPROD  KTOE ..
    

    …that appear to correspond to, respectively, the COUNTRY, PRODUCT, TIME, FLOW, and MEASURE dimensions and “Value” column of the above data, respectively.

OECD provider/format

Up until 2023, the EWEB data were available from the OECD iLibrary with DOI 10.1787/enestats-data-en. These files were characterized by:

  • Single ZIP archives with names like cac5fa90-en.zip; typically ~850 MiB compressed size,

  • …containing a single CSV file with a name like WBIG_2022-2022-1-EN-20230406T100006.csv, typically >20 GiB uncompressed,

  • …with a particular list of columns like: “MEASURE”, “Unit”, “COUNTRY”, “Country”, “PRODUCT”, “Product”, “FLOW”, “Flow”, “TIME”, “Time”, “Value”, “Flag Codes”, “Flags”,

  • …with contents that duplicated code IDs—for instance, in the “FLOW” column—with human-readable labels—for instance in the “Flow” column:

    Column name

    Example value

    MEASURE [1]

    KTOE

    Unit

    ktoe

    COUNTRY

    WLD

    Country

    World

    PRODUCT

    COAL

    Product

    Coal and coal products

    FLOW

    INDPROD

    Flow

    Production

    TIME

    2012

    Time

    2012

    Value

    1234.5678

    Flag Codes

    M

    Flags

    Missing value; data cannot exist

This source is discontinued and will not publish subsequent editions of the data.

NewClimate Institute (tools.newclimate)

Handle data from the NewClimate Institute’s Climate Policy Database (CPDB).

This module provides:

These enable programmatic use of the information in the database. For example:

from message_ix_models.tools.newclimate import SECTOR, get
from pycountry import countries

# Fetch and parse the 2024 edition of the database
policies = get("2024")
print(len(policies))  # 6507 objects

# Filter the dict to a list of policy objects matching a certain sector
p_transport = list(filter(lambda p: SECTOR.Transport in p.sector, policies.values()))
print(len(p_transport))  # 1298 objects

# Filter for any policies concerning the country of Austria, or the EU
match = {pycountry.lookup("Austria"), "European Union"}
p_AUT = list(filter(lambda p: set(p.geo) & match, policies.values()))
print(len(p_AUT)))  # 259 objects

Todo

Extend the module:

  • Serialize NewClimatePolicy objects in 1 or more formats, preferably standards-based.

  • fetch() versions of the database more recent than the latest Zenodo record, using the cpdb_api package or other code.

  • Convert to/from other data models.

class message_ix_models.tools.newclimate.HIGH_IMPACT(*values)[source]

Enumeration for NewClimatePolicy.high_impact.

Todo

If a codebook is available identifying what criteria were used to assign these codes to the primary source data, reference or quote the defintions for each code.

Do the same for the other enumerations in this module.

unclear = 5

NB both ‘unclear’ and ‘Unclear’ appear in the 2025 draft database as of 2026-04-17.

class message_ix_models.tools.newclimate.JURISDICTION(*values)[source]

Enumeration for NewClimatePolicy.jurisdiction.

class message_ix_models.tools.newclimate.NewClimatePolicy(country_update: ~message_ix_models.tools.newclimate.structure.UPDATE, decision_date: str, description: str, end_date: str, high_impact: ~message_ix_models.tools.newclimate.structure.HIGH_IMPACT, id: str, impact_indicators_base_year: str, impact_indicators_comments: str, impact_indicators_name: str, impact_indicators_target_year: str, impact_indicators_value: str, instrument: str, jurisdiction: ~message_ix_models.tools.newclimate.structure.JURISDICTION, last_update: str, name: str, objective: list[~message_ix_models.tools.newclimate.structure.OBJECTIVE], reference: str, sector: list[~message_ix_models.tools.newclimate.structure.SECTOR], start_date: str, status: ~message_ix_models.tools.newclimate.structure.STATUS, stringency: ~message_ix_models.tools.newclimate.structure.STRINGENCY, title: str, type: list[~message_ix_models.tools.newclimate.structure.TYPE], geo: list[str | Country] = <factory>)[source]

Policy in the NewClimate data model.

Properties of this class match the column names appearing in the NewClimate CSV file format as of 2024, with the following exceptions:

  • The redundant prefix “policy_” is omitted, for instance “name” instead of “policy_name”.

  • geo for geography; see the attribute documentation.

  • impact_indicators_base_year and similar have an underscore, rather than period (“.”) in the name.

For some attributes such as country_update, the type is an enumeration: only members of the enumeration may be used. For others, such as objective, the type is a list of enumeration members, because the database contains multiple values, separated by commas. It is unclear if the order in the database is meaningful or not, so list (rather than set) is used to preserve the original order.

Todo

  • Add reference(s) to documentation of the data model, if any.

  • Add docstrings for individual fields, quoting the documentation.

  • Parse dates to Python datetime objects.

property country: Country

Return a pycountry object from geo.

Raises ValueError if none exists.

country_update: UPDATE

Country update.

decision_date: str

MAY be empty.

description: str

MAY be empty.

end_date: str

End date.

classmethod from_csv_dict(data: dict[str, str]) NewClimatePolicy[source]

Create from a dict from a csv.CsvReader.

This method handles the following transformations:

  • Strip leading white space. Some cells have leading non-printing white space, like the UTF-8 byte-order mark uFEFF.

  • Replace “.” with “_”, since the former cannot be used in Python names. For example, “impact_indicators.base_year” becomes “impact_indicators_base_year”.

  • Remove the redundant prefix “policy_”.

  • Handle geographical fields. The CSV format has at least 5 fields that express geographical concepts, as well as older aliases:

    1. “city_or_local”, “city”: Identifier for a city or local geographical unit.

    2. “country”: name of a country.

    3. “country_iso”, “country_iso_code”: ISO 3166 alpha-3 code of a country.

    4. “subnational_region”, “subnational_region_or_state”: name of a region within a country.

    5. “supranational_region”, “supernational_region”: not used in the 2025 database. May be in use elsewhere. The name implies it is a name for parts or all of 2 or more countries.

    (1), (2), (4) and likely (5) are given in English.

    These are transformed into a single value for the NewClimatePolicy.geo field; see its documentation.

  • Handle older versions of field names appearing in 2022 and earlier database versions, per CSV_FIELD_ALIASES.

geo: list[str | Country]

Geography. MUST be length 1 or greater. Items MAY include:

  1. English name of supranational region

  2. A pycountry.db.Country object.

  3. English name of a subnational region.

  4. English name of a city or locality.

Some forms visible in the database include:

  • Only (1), for instance ["European Union"].

  • Only (2).

  • Both (2) and (3).

  • Both (2) and (4).

high_impact: HIGH_IMPACT

High impact.

id: str

Unique identifier for the policy.

impact_indicators_base_year: str

Impact indicator base year.

impact_indicators_comments: str

Impact indicator base year.

impact_indicators_name: str

Impact indicator name.

impact_indicators_target_year: str

Impact indicator target year.

impact_indicators_value: str

Impact indicator value.

instrument: str

Instrument.

jurisdiction: JURISDICTION

Jurisdiction.

last_update: str

Last update (of data in the database).

name: str

Name.

objective: list[OBJECTIVE]

Objective.

reference: str

MAY be empty.

sector: list[SECTOR]

Sector.

start_date: str

Start date.

status: STATUS

Status.

stringency: STRINGENCY

Stringency.

title: str

Title.

type: list[TYPE]

Type.

class message_ix_models.tools.newclimate.OBJECTIVE(*values)[source]

Enumeration for NewClimatePolicy.objective.

class message_ix_models.tools.newclimate.SECTOR(*values)[source]

Enumeration for NewClimatePolicy.sector.

class message_ix_models.tools.newclimate.STATUS(*values)[source]

Enumeration for NewClimatePolicy.status.

class message_ix_models.tools.newclimate.STRINGENCY(*values)[source]

Enumeration for NewClimatePolicy.stringency.

class message_ix_models.tools.newclimate.TYPE(*values)[source]

Enumeration for NewClimatePolicy.type.

class message_ix_models.tools.newclimate.UPDATE(*values)[source]

Enumeration for NewClimatePolicy.country_update.

NOTSET = 3

NB This member added only to accommodate database version 2022 and earlier, in which the respective field does not exist. In versions where the field does exist, its value MUST be one of the two other members.

message_ix_models.tools.newclimate.fetch(version: str) Path[source]

Retrieve data for version of the Climate Policy Database from Zenodo.

message_ix_models.tools.newclimate.get(version: str) dict[str, NewClimatePolicy][source]

fetch() and then read() data for version of the database.

message_ix_models.tools.newclimate.read(path: Path, **kwargs) dict[str, NewClimatePolicy][source]

Read a CSV file into a dict of Policy objects.

Returns:

Keys are NewClimatePolicy.id. If the file contains records with the same IDs, only the last appears, and a warning is logged.

Return type:

dict