Tools for specific data sources
“Centre d’études prospectives et d’informations internationales” (CEPII) (tools.cepii)
Handle data from CEPII.
CEPII is the “Centre d’études prospectives et d’informations internationales” (fr).
- class message_ix_models.tools.cepii.BACI(*args, **kwargs)[source]
Provider of data from the BACI data source.
BACI is the “Base pour l’Analyse du Commerce International” (fr). The source is documented at:
https://www.cepii.fr/DATA_DOWNLOAD/baci/doc/baci_webpage.html
https://www.cepii.fr/CEPII/en/bdd_modele/bdd_modele_item.asp?id=37
Currently the class supports:
The 202501 release only.
The 1992 Harmonized System (HS92) only.
Todo
Aggregate to MESSAGE regions.
Test with additional HS categorizations.
Test with additional releases.
- class Options(aggregate: bool = False, interpolate: bool = False, measure: str = 'quantity', name: str = '', dims: tuple[str, ...] = ('t', 'i', 'j', 'k'), filter_pattern: dict[str, 'str | Pattern'] = <factory>, test: bool = False)[source]
-
- dims: tuple[str, ...] = ('t', 'i', 'j', 'k')
Dimensions for the returned
Key/Quantity.Per the BACI README file, these are:
“t”: year
“i”: exporter
“j”: importer
“k”: product
- filter_pattern: dict[str, str | Pattern]
Regular expressions for filtering on any of
dims. Keys must be indims; values must be regular expressions or compiledre.Patternthat fullmatch thestrrepresentation of labels on the respective dimension.For example,
filter_pattern=dict(k="270(4..|576)")matches any 6-digit label on the \(k\) dimension starting with ‘2704’, or the exact label ‘270576’.
- get() AnyQuantity[source]
Return the raw data.
This method performs the following steps:
If needed, retrieve the data archive from
pooch.SOURCEusing the entry “CEPII_BACI”. The file is stored in theConfig.cache_path, and is about 2.2 GiB.If needed, extract all the members of the archive to a
…/cepii-baci/subdirectory of the cache directory. The extracted size is about 7.9 GiB, containing about 2.6 × 10⁸ observations.Call
baci_data_from_files()to read the data files and applyOptions.measureandOptions.filter_pattern. The function is decorated withcached(), so identical parameters and file paths result in a cache hit.Convert to
genno.Quantityand return.
- options: Options
Instance of the
Optionsclass.A concrete class that overrides
Optionsshould redefine this attribute, to facilitate type checking.
- transform(c: genno.Computer, base_key: Key) Key[source]
Prepare c to transform raw data from base_key.
Map BACI codes for the \((i, j)\) dimensions from numeric (mainly ISO 3166-1 numeric) to ISO 3166-1 alpha_3. See
get_mapping().
- message_ix_models.tools.cepii.COUNTRY_CODES = [(58, 'BEL'), (251, 'FRA'), (490, 'S19'), (530, 'ANT'), (579, 'NOR'), (699, 'IND'), (711, 'ZA1'), (736, 'SDN'), (757, 'CHE'), (842, 'USA'), (849, 'PUS'), (891, 'SCG')]
Labels appearing in the \((i, j)\) dimensions of the
BACIdata that are not current ISO 3166-1 numeric codes. These are generally of 3 kinds:Numeric codes that are in ISO 3166-3 (“Code for formerly used names of countries”), not ISO 3166-1.
Numeric codes for countries that exist in ISO 3166-1, but simply differ. For example, ISO has 250 for “France”, but BACI uses 251.
Numeric codes for countries or country groups that do not appear in ISO 3166.
This is a subset of the labels appearing in the
country_codecolumn of the filecountry_codes_V202501.csvin the archiveBACI_HS92_V202501.zip. Only the labels appearing in the data files are included.
- message_ix_models.tools.cepii.DTYPE = {'i': <class 'numpy.uint16'>, 'j': <class 'numpy.uint16'>, 'k': <class 'numpy.uint32'>, 't': <class 'numpy.uint16'>}
Dimensions and data types for input data. In order to reduce memory and disk usage:
np.uint16(0 to 65_535) is used for t (year), i (exporter), and j (importer)np.uint32(0 to 4_294_967_295) is used for k (product), since these values can be as large as 999_999.
- message_ix_models.tools.cepii.baci_data_from_files(paths: list[Path], measure: str, filters: dict[str, str | Pattern]) DataFrame[source]
Read the
BACIdata from files.dask.dataframe.read_csv()and pyarrow are used for better performance.DTYPEis used to specify columns and dtypes.Data returned by this function is cached using
cached(); see alsoSKIP_CACHE.
- message_ix_models.tools.cepii.get_mapping() MappingAdapter[source]
Return an adapter from codes appearing in BACI data.
The BACI data for dimensions \(i\) (exporter) and \(j\) (importer) contain ISO 3166-1 numeric codes, plus some other idiosyncratic codes from
COUNTRY_CODES. The returned adapter maps these to the corresponding alpha-3 code.Using the adapter makes data suitable for aggregation using the
message_ix_modelsnodecode lists, which include those alpha-3 codes as children of each region code.
Global Fuel Economy Initiative (GFEI) (tools.gfei)
Handle data from the Global Fuel Economy Initiative (GFEI).
- class message_ix_models.tools.gfei.GFEI(*args, **kwargs)[source]
Provider of exogenous data from the GFEI 2017 data source.
To use data from this source, call
exo_data.prepare_computer()with the arguments:source: “GFEI”.
source_kw including:
plot (optional, default
False): add a task with the key “plot GFEI debug” to generate diagnostic plot usingPlot.aggregate, interpolate: see
ExoDataSource.transform().
The source data:
is derived from https://theicct.org/publications/gfei-tech-policy-drivers-2005-2017, specifically the data underlying “Figure 37. Fuel consumption range by type of powertrain and vehicle size, 2017”.
has resolution of individual countries.
corresponds to new vehicle registrations in 2017.
has units of megajoule / kilometre, converted from original litres of gasoline equivalent per 100 km.
Note
if py:source_kw[“aggregate”] is True, the aggregation performed is an unweighted
sum(). To produce meaningful values for multi-country regions, instead perform perform a weighted mean using appropriate weights; for instance the vehicle activity for each country. The class currently does not do this automatically.- class Options(aggregate: bool = False, interpolate: bool = False, measure: str = '', name: str = 'fuel economy', dims: tuple[str, ...] = ('n', 'y'), plot: bool = False)[source]
- get() AnyQuantity[source]
Return the data.
Implementations in concrete classes may load data from file, retrieve from remote sources or local caches, generate data, or anything else.
The Quantity returned by this method must have dimensions corresponding to
key. If the original/upstream/raw data has different dimensionality (fewer or more dimensions; different dimension IDs), a concrete class must transform these, make appropriate selections, etc.
- options: Options
Instance of the
Optionsclass.A concrete class that overrides
Optionsshould redefine this attribute, to facilitate type checking.
- where: list['str | Path'] = ['private']
wherekeyword argument topath_fallback(). See_where().
- class message_ix_models.tools.gfei.Plot[source]
Diagnostic plot of processed data.
- basename = 'GFEI-fuel-economy-t'
File name base for saving the plot.
- generate(data)[source]
Generate and return the plot.
A subclass of Plot must implement this method.
- Parameters:
args (
Sequenceofpandas.DataFrameorother) –One argument is given corresponding to each of the
inputs.Because
plotnineoperates on pandas data structures,save()automatically converts anyQuantityinputs topandas.DataFramebefore they are passed togenerate().
International Energy Agency (IEA) (tools.iea)
The IEA publishes many kinds of data.
Each distinct data source is handled by a separate submodule of message_ix_models.tools.iea.
Documentation for all module contents:
Tools for working with IEA data and structures. |
Energy efficiency indicators (tools.iea.eei)
See IEA_EEI.
This data is produced by the IEA and retrieved from the Energy Efficiency Indicators database.
It is proprietary.
The data:
Has the geographic resolution of individual countries, and scope including 41 countries:
24 IEA member countries for which data covering most end-uses area available: Australia, Austria, Belgium, Canada, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Italy, Japan, Korea, Luxembourg, the Netherlands, New Zealand, Poland, Portugal, Slovak Republic, Spain, Switzerland, the United Kingdom and the United States.
Others including Brazil, Chile, Lithuania, Morocco, Armenia, Azerbaijan, Belarus, Georgia, Kazakhstan, Kyrgyzstan, Republic of Moldova, Ukraine, Uzbekistan.
Includes measures/variables for energy consumption, efficiency, carbon emissions, and others for four conceptual sectors: Residential, Services, Industry and Transport.
The December 2020 edition covers the time periods 2000–2018 with annual resolution.
Note
Currently, iea.eei mainly retrieves and processes data useful for MESSAGEix-Transport.
To retrieve other end-use sectoral data, the code can be extended.
(Extended) World Energy Balances (tools.iea.web)
Note
These data are proprietary and require a paid subscription.
The approach to handling proprietary data is the same as in project.advance and project.ssp:
Copies of the data are stored in the (private) message-static-data repository using Git LFS. This respository is accessible only to users who have a license for the data.
message_ix_modelscontains only a ‘fuzzed’ version of the data (same structure, random values) for testing purposes.Non-IIASA users must obtain their own license to access and use the data; obtain the data themselves; and place it on the system where they use
message_ix_models.
The module message_ix_models.tools.iea.web attempts to detect and support both the providers/formats described below.
The code supports using data from any of the above locations and formats, in multiple ways:
Use
IEA_EWEBviaexo_data.prepare_computer()to use the data ingennostructured calculations.Use
iea.web.load_data()to load data aspandas.DataFrameand apply further processing using pandas.
The documentation for the 2023 edition of the IEA source/format is publicly available.
Structure
The data have the following conceptual dimensions, each enumerated by a different list of codes:
FLOW,PRODUCT: for both of these, the lists of codes appearing in the data are the same from 2021 and 2023 inclusive.COUNTRY: The data provided by IEA directly contain codes that are all caps, abbreviated country names, for instance ‘DOMINICANR’. The data provided by the OECD contain ISO 3166-1 alpha-3 codes, for instance ‘DOM’. In both cases, there are additional labels denoting country groupings; these are defined in the documentation linked above.Changes visible in these lists include:
2022 → 2023:
New codes: ASEAN, BFA, GREENLAND, MALI, MRT, PSE, TCD.
Removed: MASEAN.
2021 → 2022:
New codes: GNQ, MDG, MKD, RWA, SWZ, UGA.
Removed: EQGUINEA, GREENLAND, MALI, MBURKINAFA, MCHAD, MMADAGASCA, MMAURITANI, MPALESTINE, MRWANDA, MUGANDA, NORTHMACED.
See the
transform=...source keyword argument andIEA_EWEB.transform()for different methods of handling this dimension.TIME: always a year.UNIT_MEASURE(not labeled): unit of measurement, either ‘TJ’ or ‘ktoe’.
message_ix_models is packaged with SDMX structure data (stored in message_ix_models/data/sdmx/) comprising code lists extracted from the raw data for the COUNTRY, FLOW, and PRODUCT dimensions.
These can be used with other package utilities, for instance:
>>> from message_ix_models.util.sdmx import read
# Read a code list from file: codes used in the
# 2022 edition data from the OECD provider
>>> cl = read("IEA:PRODUCT_OECD(2022)")
# Show some of its elements
>>> print("\n".join(sorted(cl.items[:5])))
ADDITIVE
ANTCOAL
AVGAS
BIODIESEL
BIOGASES
The documentation linked above has full descriptions of each code.
IEA provider/format
From 2023 (or earlier), the data are provided directly on the IEA website at https://www.iea.org/data-and-statistics/data-product/world-energy-balances. These data are available in two formats; ‘IVT’ or “Beyond 20/20” format (not supported by this module) or fixed-width text files. The latter are characterized by:
Multiple ZIP archives with names like
WBIG[12].zip, each containing a portion of the data and typically 110–130 MiB compressed size…each containing a single, fixed-with TXT file with a name like
WORLDBIG[12].TXT, typically 3–4 GiB uncompressed,…with no column headers, but data resembling:
WORLD HARDCOAL 1960 INDPROD KTOE ..
…that appear to correspond to, respectively, the COUNTRY, PRODUCT, TIME, FLOW, and MEASURE dimensions and “Value” column of the above data, respectively.
OECD provider/format
Up until 2023, the EWEB data were available from the OECD iLibrary with DOI 10.1787/enestats-data-en. These files were characterized by:
Single ZIP archives with names like
cac5fa90-en.zip; typically ~850 MiB compressed size,…containing a single CSV file with a name like
WBIG_2022-2022-1-EN-20230406T100006.csv, typically >20 GiB uncompressed,…with a particular list of columns like: “MEASURE”, “Unit”, “COUNTRY”, “Country”, “PRODUCT”, “Product”, “FLOW”, “Flow”, “TIME”, “Time”, “Value”, “Flag Codes”, “Flags”,
…with contents that duplicated code IDs—for instance, in the “FLOW” column—with human-readable labels—for instance in the “Flow” column:
Column name
Example value
MEASURE [1]
KTOE
Unit
ktoe
COUNTRY
WLD
Country
World
PRODUCT
COAL
Product
Coal and coal products
FLOW
INDPROD
Flow
Production
TIME
2012
Time
2012
Value
1234.5678
Flag Codes
M
Flags
Missing value; data cannot exist
This source is discontinued and will not publish subsequent editions of the data.
NewClimate Institute (tools.newclimate)
Handle data from the NewClimate Institute’s Climate Policy Database (CPDB).
This module provides:
NewClimatePolicy, a concrete subclass of the abstract/genericPolicy, that reflects the data model appearing in the CPDB.Enumerations that reflect values appearing in fields of the database which appear to be enumerated (as opposed to free text):
HIGH_IMPACT,JURISDICTION,OBJECTIVE,SECTOR,STATUS,STRINGENCY,TYPE, andUPDATE.A method
NewClimatePolicy.from_csv_dict()that interprets the CSV data format in which the database is expressed.
Functions to
fetch()versions of the database from Zenodo,read()into collections of Python objects, or do both (get()).
These enable programmatic use of the information in the database. For example:
from message_ix_models.tools.newclimate import SECTOR, get
from pycountry import countries
# Fetch and parse the 2024 edition of the database
policies = get("2024")
print(len(policies)) # 6507 objects
# Filter the dict to a list of policy objects matching a certain sector
p_transport = list(filter(lambda p: SECTOR.Transport in p.sector, policies.values()))
print(len(p_transport)) # 1298 objects
# Filter for any policies concerning the country of Austria, or the EU
match = {pycountry.lookup("Austria"), "European Union"}
p_AUT = list(filter(lambda p: set(p.geo) & match, policies.values()))
print(len(p_AUT))) # 259 objects
Todo
Extend the module:
Serialize
NewClimatePolicyobjects in 1 or more formats, preferably standards-based.fetch()versions of the database more recent than the latest Zenodo record, using the cpdb_api package or other code.Convert to/from other data models.
- class message_ix_models.tools.newclimate.HIGH_IMPACT(*values)[source]
Enumeration for
NewClimatePolicy.high_impact.Todo
If a codebook is available identifying what criteria were used to assign these codes to the primary source data, reference or quote the defintions for each code.
Do the same for the other enumerations in this module.
- unclear = 5
NB both ‘unclear’ and ‘Unclear’ appear in the 2025 draft database as of 2026-04-17.
- class message_ix_models.tools.newclimate.JURISDICTION(*values)[source]
Enumeration for
NewClimatePolicy.jurisdiction.
- class message_ix_models.tools.newclimate.NewClimatePolicy(country_update: ~message_ix_models.tools.newclimate.structure.UPDATE, decision_date: str, description: str, end_date: str, high_impact: ~message_ix_models.tools.newclimate.structure.HIGH_IMPACT, id: str, impact_indicators_base_year: str, impact_indicators_comments: str, impact_indicators_name: str, impact_indicators_target_year: str, impact_indicators_value: str, instrument: str, jurisdiction: ~message_ix_models.tools.newclimate.structure.JURISDICTION, last_update: str, name: str, objective: list[~message_ix_models.tools.newclimate.structure.OBJECTIVE], reference: str, sector: list[~message_ix_models.tools.newclimate.structure.SECTOR], start_date: str, status: ~message_ix_models.tools.newclimate.structure.STATUS, stringency: ~message_ix_models.tools.newclimate.structure.STRINGENCY, title: str, type: list[~message_ix_models.tools.newclimate.structure.TYPE], geo: list[str | Country] = <factory>)[source]
Policy in the NewClimate data model.
Properties of this class match the column names appearing in the NewClimate CSV file format as of 2024, with the following exceptions:
The redundant prefix “policy_” is omitted, for instance “name” instead of “policy_name”.
geofor geography; see the attribute documentation.impact_indicators_base_yearand similar have an underscore, rather than period (“.”) in the name.
For some attributes such as
country_update, the type is an enumeration: only members of the enumeration may be used. For others, such asobjective, the type is alistof enumeration members, because the database contains multiple values, separated by commas. It is unclear if the order in the database is meaningful or not, solist(rather thanset) is used to preserve the original order.Todo
Add reference(s) to documentation of the data model, if any.
Add docstrings for individual fields, quoting the documentation.
Parse dates to Python
datetimeobjects.
- property country: Country
Return a
pycountryobject fromgeo.Raises
ValueErrorif none exists.
- classmethod from_csv_dict(data: dict[str, str]) NewClimatePolicy[source]
Create from a
dictfrom acsv.CsvReader.This method handles the following transformations:
Strip leading white space. Some cells have leading non-printing white space, like the UTF-8 byte-order mark uFEFF.
Replace “.” with “_”, since the former cannot be used in Python names. For example, “impact_indicators.base_year” becomes “impact_indicators_base_year”.
Remove the redundant prefix “policy_”.
Handle geographical fields. The CSV format has at least 5 fields that express geographical concepts, as well as older aliases:
“city_or_local”, “city”: Identifier for a city or local geographical unit.
“country”: name of a country.
“country_iso”, “country_iso_code”: ISO 3166 alpha-3 code of a country.
“subnational_region”, “subnational_region_or_state”: name of a region within a country.
“supranational_region”, “supernational_region”: not used in the 2025 database. May be in use elsewhere. The name implies it is a name for parts or all of 2 or more countries.
(1), (2), (4) and likely (5) are given in English.
These are transformed into a single value for the
NewClimatePolicy.geofield; see its documentation.Handle older versions of field names appearing in 2022 and earlier database versions, per
CSV_FIELD_ALIASES.
- geo: list[str | Country]
Geography. MUST be length 1 or greater. Items MAY include:
English name of supranational region
A
pycountry.db.Countryobject.English name of a subnational region.
English name of a city or locality.
Some forms visible in the database include:
Only (1), for instance
["European Union"].Only (2).
Both (2) and (3).
Both (2) and (4).
- high_impact: HIGH_IMPACT
High impact.
- jurisdiction: JURISDICTION
Jurisdiction.
- stringency: STRINGENCY
Stringency.
- class message_ix_models.tools.newclimate.OBJECTIVE(*values)[source]
Enumeration for
NewClimatePolicy.objective.
- class message_ix_models.tools.newclimate.SECTOR(*values)[source]
Enumeration for
NewClimatePolicy.sector.
- class message_ix_models.tools.newclimate.STATUS(*values)[source]
Enumeration for
NewClimatePolicy.status.
- class message_ix_models.tools.newclimate.STRINGENCY(*values)[source]
Enumeration for
NewClimatePolicy.stringency.
- class message_ix_models.tools.newclimate.TYPE(*values)[source]
Enumeration for
NewClimatePolicy.type.
- class message_ix_models.tools.newclimate.UPDATE(*values)[source]
Enumeration for
NewClimatePolicy.country_update.- NOTSET = 3
NB This member added only to accommodate database version 2022 and earlier, in which the respective field does not exist. In versions where the field does exist, its value MUST be one of the two other members.
- message_ix_models.tools.newclimate.fetch(version: str) Path[source]
Retrieve data for version of the Climate Policy Database from Zenodo.
- message_ix_models.tools.newclimate.read(path: Path, **kwargs) dict[str, NewClimatePolicy][source]
Read a CSV file into a
dictof Policy objects.- Returns:
Keys are
NewClimatePolicy.id. If the file contains records with the same IDs, only the last appears, and a warning is logged.- Return type: