Tools for specific data sources

“Centre d’études prospectives et d’informations internationales” (CEPII) (tools.cepii)

Handle data from CEPII.

CEPII is the “Centre d’études prospectives et d’informations internationales” (fr).

class message_ix_models.tools.cepii.BACI(*args, **kwargs)[source]

Provider of data from the BACI data source.

BACI is the “Base pour l’Analyse du Commerce International” (fr). The source is documented at:

Currently the class supports:

  • The 202501 release only.

  • The 1992 Harmonized System (HS92) only.

Todo

  • Aggregate to MESSAGE regions.

  • Test with additional HS categorizations.

  • Test with additional releases.

class Options(aggregate: bool = False, interpolate: bool = False, measure: str = 'quantity', name: str = '', dims: tuple[str, ...] = ('t', 'i', 'j', 'k'), filter_pattern: dict[str, 'str | Pattern'] = <factory>, test: bool = False)[source]
aggregate: bool = False

By default, do not aggregate.

dims: tuple[str, ...] = ('t', 'i', 'j', 'k')

Dimensions for the returned Key/Quantity.

Per the BACI README file, these are:

  • “t”: year

  • “i”: exporter

  • “j”: importer

  • “k”: product

filter_pattern: dict[str, str | Pattern]

Regular expressions for filtering on any of dims. Keys must be in dims; values must be regular expressions or compiled re.Pattern that fullmatch the str representation of labels on the respective dimension.

For example, filter_pattern=dict(k="270(4..|576)") matches any 6-digit label on the \(k\) dimension starting with ‘2704’, or the exact label ‘270576’.

interpolate: bool = False

By default, do not interpolate.

measure: str = 'quantity'

Either “quantity” or “value”.

test: bool = False

Set to True to use test data from the message_ix_models repository.

get() AnyQuantity[source]

Return the raw data.

This method performs the following steps:

  1. If needed, retrieve the data archive from pooch.SOURCE using the entry “CEPII_BACI”. The file is stored in the Config.cache_path, and is about 2.2 GiB.

  2. If needed, extract all the members of the archive to a …/cepii-baci/ subdirectory of the cache directory. The extracted size is about 7.9 GiB, containing about 2.6 × 10⁸ observations.

  3. Call baci_data_from_files() to read the data files and apply Options.measure and Options.filter_pattern. The function is decorated with cached(), so identical parameters and file paths result in a cache hit.

  4. Convert to genno.Quantity and return.

options: Options

Instance of the Options class.

A concrete class that overrides Options should redefine this attribute, to facilitate type checking.

transform(c: genno.Computer, base_key: Key) Key[source]

Prepare c to transform raw data from base_key.

  1. Map BACI codes for the \((i, j)\) dimensions from numeric (mainly ISO 3166-1 numeric) to ISO 3166-1 alpha_3. See get_mapping().

message_ix_models.tools.cepii.COUNTRY_CODES = [(58, 'BEL'), (251, 'FRA'), (490, 'S19'), (530, 'ANT'), (579, 'NOR'), (699, 'IND'), (711, 'ZA1'), (736, 'SDN'), (757, 'CHE'), (842, 'USA'), (849, 'PUS'), (891, 'SCG')]

Labels appearing in the \((i, j)\) dimensions of the BACI data that are not current ISO 3166-1 numeric codes. These are generally of 3 kinds:

  • Numeric codes that are in ISO 3166-3 (“Code for formerly used names of countries”), not ISO 3166-1.

  • Numeric codes for countries that exist in ISO 3166-1, but simply differ. For example, ISO has 250 for “France”, but BACI uses 251.

  • Numeric codes for countries or country groups that do not appear in ISO 3166.

This is a subset of the labels appearing in the country_code column of the file country_codes_V202501.csv in the archive BACI_HS92_V202501.zip. Only the labels appearing in the data files are included.

message_ix_models.tools.cepii.DTYPE = {'i': <class 'numpy.uint16'>, 'j': <class 'numpy.uint16'>, 'k': <class 'numpy.uint32'>, 't': <class 'numpy.uint16'>}

Dimensions and data types for input data. In order to reduce memory and disk usage:

  • np.uint16 (0 to 65_535) is used for t (year), i (exporter), and j (importer)

  • np.uint32 (0 to 4_294_967_295) is used for k (product), since these values can be as large as 999_999.

message_ix_models.tools.cepii.baci_data_from_files(paths: list[Path], measure: str, filters: dict[str, str | Pattern]) DataFrame[source]

Read the BACI data from files.

dask.dataframe.read_csv() and pyarrow are used for better performance. DTYPE is used to specify columns and dtypes.

Data returned by this function is cached using cached(); see also SKIP_CACHE.

message_ix_models.tools.cepii.get_mapping() MappingAdapter[source]

Return an adapter from codes appearing in BACI data.

The BACI data for dimensions \(i\) (exporter) and \(j\) (importer) contain ISO 3166-1 numeric codes, plus some other idiosyncratic codes from COUNTRY_CODES. The returned adapter maps these to the corresponding alpha-3 code.

Using the adapter makes data suitable for aggregation using the message_ix_models node code lists, which include those alpha-3 codes as children of each region code.

Global Fuel Economy Initiative (GFEI) (tools.gfei)

Handle data from the Global Fuel Economy Initiative (GFEI).

class message_ix_models.tools.gfei.GFEI(*args, **kwargs)[source]

Provider of exogenous data from the GFEI 2017 data source.

To use data from this source, call exo_data.prepare_computer() with the arguments:

  • source: “GFEI”.

  • source_kw including:

    • plot (optional, default False): add a task with the key “plot GFEI debug” to generate diagnostic plot using Plot.

    • aggregate, interpolate: see ExoDataSource.transform().

The source data:

  • is derived from https://theicct.org/publications/gfei-tech-policy-drivers-2005-2017, specifically the data underlying “Figure 37. Fuel consumption range by type of powertrain and vehicle size, 2017”.

  • has resolution of individual countries.

  • corresponds to new vehicle registrations in 2017.

  • has units of megajoule / kilometre, converted from original litres of gasoline equivalent per 100 km.

Note

if py:source_kw[“aggregate”] is True, the aggregation performed is an unweighted sum(). To produce meaningful values for multi-country regions, instead perform perform a weighted mean using appropriate weights; for instance the vehicle activity for each country. The class currently does not do this automatically.

class Options(aggregate: bool = False, interpolate: bool = False, measure: str = '', name: str = 'fuel economy', dims: tuple[str, ...] = ('n', 'y'), plot: bool = False)[source]
aggregate: bool = False

By default, do not aggregate.

interpolate: bool = False

By default, do not interpolate.

name: str = 'fuel economy'

Name for the returned quantity.

plot: bool = False

Also generate diagnostic plots.

get() AnyQuantity[source]

Return the data.

Implementations in concrete classes may load data from file, retrieve from remote sources or local caches, generate data, or anything else.

The Quantity returned by this method must have dimensions corresponding to key. If the original/upstream/raw data has different dimensionality (fewer or more dimensions; different dimension IDs), a concrete class must transform these, make appropriate selections, etc.

options: Options

Instance of the Options class.

A concrete class that overrides Options should redefine this attribute, to facilitate type checking.

transform(c: Computer, base_key: Key) Key[source]

Prepare c to transform raw data from base_key.

where: list['str | Path'] = ['private']

where keyword argument to path_fallback(). See _where().

class message_ix_models.tools.gfei.Plot[source]

Diagnostic plot of processed data.

basename = 'GFEI-fuel-economy-t'

File name base for saving the plot.

generate(data)[source]

Generate and return the plot.

A subclass of Plot must implement this method.

Parameters:

args (Sequence of pandas.DataFrame or other) –

One argument is given corresponding to each of the inputs.

Because plotnine operates on pandas data structures, save() automatically converts any Quantity inputs to pandas.DataFrame before they are passed to generate().

International Energy Agency (IEA) (tools.iea)

The IEA publishes many kinds of data. Each distinct data source is handled by a separate submodule of message_ix_models.tools.iea.

Documentation for all module contents:

iea

Tools for working with IEA data and structures.

Energy efficiency indicators (tools.iea.eei)

See IEA_EEI. This data is produced by the IEA and retrieved from the Energy Efficiency Indicators database. It is proprietary.

The data:

  • Has the geographic resolution of individual countries, and scope including 41 countries:

  • 24 IEA member countries for which data covering most end-uses area available: Australia, Austria, Belgium, Canada, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Italy, Japan, Korea, Luxembourg, the Netherlands, New Zealand, Poland, Portugal, Slovak Republic, Spain, Switzerland, the United Kingdom and the United States.

  • Others including Brazil, Chile, Lithuania, Morocco, Armenia, Azerbaijan, Belarus, Georgia, Kazakhstan, Kyrgyzstan, Republic of Moldova, Ukraine, Uzbekistan.

  • Includes measures/variables for energy consumption, efficiency, carbon emissions, and others for four conceptual sectors: Residential, Services, Industry and Transport.

  • The December 2020 edition covers the time periods 2000–2018 with annual resolution.

Note

Currently, iea.eei mainly retrieves and processes data useful for MESSAGEix-Transport. To retrieve other end-use sectoral data, the code can be extended.

(Extended) World Energy Balances (tools.iea.web)

Note

These data are proprietary and require a paid subscription.

The approach to handling proprietary data is the same as in project.advance and project.ssp:

  • Copies of the data are stored in the (private) message-static-data repository using Git LFS. This respository is accessible only to users who have a license for the data.

  • message_ix_models contains only a ‘fuzzed’ version of the data (same structure, random values) for testing purposes.

  • Non-IIASA users must obtain their own license to access and use the data; obtain the data themselves; and place it on the system where they use message_ix_models.

The module message_ix_models.tools.iea.web attempts to detect and support both the providers/formats described below. The code supports using data from any of the above locations and formats, in multiple ways:

The documentation for the 2023 edition of the IEA source/format is publicly available.

Structure

The data have the following conceptual dimensions, each enumerated by a different list of codes:

  • FLOW, PRODUCT: for both of these, the lists of codes appearing in the data are the same from 2021 and 2023 inclusive.

  • COUNTRY: The data provided by IEA directly contain codes that are all caps, abbreviated country names, for instance ‘DOMINICANR’. The data provided by the OECD contain ISO 3166-1 alpha-3 codes, for instance ‘DOM’. In both cases, there are additional labels denoting country groupings; these are defined in the documentation linked above.

    Changes visible in these lists include:

    • 2022 → 2023:

      • New codes: ASEAN, BFA, GREENLAND, MALI, MRT, PSE, TCD.

      • Removed: MASEAN.

    • 2021 → 2022:

      • New codes: GNQ, MDG, MKD, RWA, SWZ, UGA.

      • Removed: EQGUINEA, GREENLAND, MALI, MBURKINAFA, MCHAD, MMADAGASCA, MMAURITANI, MPALESTINE, MRWANDA, MUGANDA, NORTHMACED.

    See the transform=... source keyword argument and IEA_EWEB.transform() for different methods of handling this dimension.

  • TIME: always a year.

  • UNIT_MEASURE (not labeled): unit of measurement, either ‘TJ’ or ‘ktoe’.

message_ix_models is packaged with SDMX structure data (stored in message_ix_models/data/sdmx/) comprising code lists extracted from the raw data for the COUNTRY, FLOW, and PRODUCT dimensions. These can be used with other package utilities, for instance:

>>> from message_ix_models.util.sdmx import read

# Read a code list from file: codes used in the
# 2022 edition data from the OECD provider
>>> cl = read("IEA:PRODUCT_OECD(2022)")

# Show some of its elements
>>> print("\n".join(sorted(cl.items[:5])))
ADDITIVE
ANTCOAL
AVGAS
BIODIESEL
BIOGASES

The documentation linked above has full descriptions of each code.

IEA provider/format

From 2023 (or earlier), the data are provided directly on the IEA website at https://www.iea.org/data-and-statistics/data-product/world-energy-balances. These data are available in two formats; ‘IVT’ or “Beyond 20/20” format (not supported by this module) or fixed-width text files. The latter are characterized by:

  • Multiple ZIP archives with names like WBIG[12].zip, each containing a portion of the data and typically 110–130 MiB compressed size

  • …each containing a single, fixed-with TXT file with a name like WORLDBIG[12].TXT, typically 3–4 GiB uncompressed,

  • …with no column headers, but data resembling:

    WORLD  HARDCOAL  1960  INDPROD  KTOE ..
    

    …that appear to correspond to, respectively, the COUNTRY, PRODUCT, TIME, FLOW, and MEASURE dimensions and “Value” column of the above data, respectively.

OECD provider/format

Up until 2023, the EWEB data were available from the OECD iLibrary with DOI 10.1787/enestats-data-en. These files were characterized by:

  • Single ZIP archives with names like cac5fa90-en.zip; typically ~850 MiB compressed size,

  • …containing a single CSV file with a name like WBIG_2022-2022-1-EN-20230406T100006.csv, typically >20 GiB uncompressed,

  • …with a particular list of columns like: “MEASURE”, “Unit”, “COUNTRY”, “Country”, “PRODUCT”, “Product”, “FLOW”, “Flow”, “TIME”, “Time”, “Value”, “Flag Codes”, “Flags”,

  • …with contents that duplicated code IDs—for instance, in the “FLOW” column—with human-readable labels—for instance in the “Flow” column:

    Column name

    Example value

    MEASURE [1]

    KTOE

    Unit

    ktoe

    COUNTRY

    WLD

    Country

    World

    PRODUCT

    COAL

    Product

    Coal and coal products

    FLOW

    INDPROD

    Flow

    Production

    TIME

    2012

    Time

    2012

    Value

    1234.5678

    Flag Codes

    M

    Flags

    Missing value; data cannot exist

This source is discontinued and will not publish subsequent editions of the data.