Configuration and (meta)data

Many, varied kinds of data are used to prepare and modify MESSAGEix-GLOBIOM scenarios. Other data are produced by code as incidental or final output. These can be categorized in several ways. One is by the purpose they serve:

configuration: settings that affect how code works, where (meta)data are located, how they should be processed, etc.
data: actual numerical values used or produced by code,
metadata: information describing how data is structured, separate from the data itself.

Another is by whether the (meta)data are input, output, or both.

This page describes how configuration and data are handled in message_ix_models and message_data. [1] In many cases it also specifies what to do for new additions to the code, using RFC 2119 keywords like must and should. The HOWTO Work with paths to files and data contains some suggested ways to handle particular situations.

Configuration 

Context objects 

Context objects are used to carry configuration, environment information, and other data between parts of the code. Scripts and user code can also store values in a Context object.

There is always at least 1 Context instance available; if necessary, additional instances can be created to be used for only part of a program.

# Get an existing instance of Context
c = Context.get_instance()

# Store a value using attribute syntax
c.foo = 42

# Store a value with spaces in the name using item syntax
c["PROJECT data source"] = "Source A"

# my_function() responds to 'foo' or 'PROJECT data source'
my_function(c)

# Store a sub-dictionary of values
c["PROJECT2"] = {"setting A": 123, "setting B": 456}

# Create a subcontext with all the settings of `c`
c2 = deepcopy(c)

# Modify one setting
c2.foo = 43

# Run code with this alternate setting
my_function(c2)

For the Command-line interface, every command decorated with @click.pass_obj gets a first positional argument context, which is an instance of this class. The settings are populated based on the command-line parameters given to mix-models or its (sub)commands.

Core configuration 

The Config class (always stored at context.core) defines configuration settings used across message_ix_models. See its documentation for details. In particular, the settings Config.cache_path and Config.local_data are relevant to this page.

Specific modules for model variants, projects, tools, etc. should…

Define a dataclass named Config to express the configuration options they understand. See for example:
- model.Config for describing existing models or constructing new models,
- report.Config for reporting,
- tools.costs.Config for a general-purpose tool in a complex module, and
- model.transport.Config for a particular model variant, here MESSAGEix-Transport.
Store this on the Context at a documented key. For example model.Config is stored at context.model or context["model"]. Usually this key should match part or all of the module name.
Retrieve and respect configuration from existing objects.

For example, module-specific code that needs to understand which node code list is used by the scenario on which it operates should retrieve this from model.Config.regions and should not create another key/setting to store the same information.

Config settings with duplicate names should only be created and used when they have a different purpose or meaning than existing settings.
Control the behaviour of other modules by setting the appropriate configuration values.

Data and metadata 

Locations 

message_ix_models contains code and tools for handling the following data locations. This section gives a brief description of these locations, using short labels (like “package data”) that also appear elsewhere in this documentation. The following sections describe how they are and should be used.

User cache

Typically this is in the user’s home directory at a path like $HOME/.cache/message-ix-models/.

Config.cache_path (equivalently Context.core.cache_path) identifies this directory. Config.get_cache_path constructs sub-paths.

Package data

These are stored in the message_ix_models/data/ subdirectory of the iiasa/message-ix-models git repository.

Some of these data are included in the packaged distributions of message_ix_models available on PyPI. Other files are omitted to keep the size of these distributions small.

package_data_path(), load_package_data(), and other more specialized code access this directory and subdirectories.

Test data: The directory message_ix_models/data/test/ contains data that is (only) used for testing. Some of the files in this directory mirror the name and structure of data files stored elsewhere, but contain reduced and/or randomized/fuzzed data.

Private data

These are stored in two non-public Git repositories.

message_data repository

These are stored in the top-level data/ directory of iiasa/message_data. This repository also contains the message_data Python package. This repository is not public; and the Python package is not published on or installable from PyPI. Users with access to the repository can read more in its its documentation.

private_data_path(), load_private_data(), and other more specialized code access this directory and subdirectories.

Static private data

These are stored in iiasa/message-static-data. This repo contains specific data files that cannot (currently, or ever) be made public, for instance because of restrictive license conditions. It contains no code.

Files are collected in this repository for convenience; users who have valid licenses to the data are granted access to the repository. In most cases, these data can also be obtained from the original source(s) with an appropriate license, registration, payment, or other conditions.

See HOWTO Connect static data to local data.

(User-)Local data

This is any arbitrary path on a user’s system.

Config.local_data (equivalently context.core.local_data) point to this directory. Context.get_local_path() and local_data_path() construct paths under this directory.

The path can be set in multiple ways. From lowest to highest precedence:

The default location is the current working directory: the directory in which the mix-models command-line interface is invoked, or in which Python code is run that imports and uses message_ix_models.
The ixmp configuration file setting message local data. See Configuration in the ixmp documentation.
The environment variable MESSAGE_LOCAL_DATA.
The mix-models --local-data=… CLI option and related options for subcommands, for instance mix-models report --output=….
Code that directly modifies the local_data setting.

Choose where to store (meta)data 

Developers of message_ix_models code must follow this order of priority in choosing where to store input and output (meta)data.

(1) Not in `message_ix_models`

Data that are available from public, stable sources should not be added to the message_ix_models repository. Instead:

Fetch the code from their original location. This should be done by extending or using message_ix_models.util.pooch, which stores the retrieved files in the user cache.
If message_ix_models relies on certain adjustments to the data, do not commit the adjusted data. Instead:
1. Commit code that performs the adjustments. This makes methods for data transformation (and any assumptions involved) transparent.
2. If necessary, store the result in the user cache.

(2) Local data or user cache 

See local data and user cache above. These locations are recommended for:

Outputs, such as data or plot files generated by reporting.
Caches: temporary data files used to speed up other code by avoiding repeat of slow operations.

These kinds of data must not be committed as message_ix_models package data. Caches and output should not be committed as message_data private data.

Thus each user should configure a local data path appropriate to their system, using either the ixmp configuration file or environment variable as described above. For example:

mix-models config set message_local_data /path/to/a/local-data/dir

It is recommended to use a directory outside any other Git-controlled directories, for instance clones of message_ix_models or message_data. (If not, users should use .gitignore files to hide the local data directory from Git.)

(3) Package data 

See above. This location is recommended for:

Configuration files used to populate Config classes for specific modules.
General-purpose metadata for the MESSAGEix-GLOBIOM base global model or variants.
Data for publicized model variants and completed/published projects.

These files may be packaged and published so that they are installable from PyPI with message_ix_models; configuration and metadata generally should be packaged. Data, if they are large, may also be excluded via MANIFEST.in. See Large/binary input data, below.

Document these data in files like doc/pkg-data/*.rst that are included in the present documentation, for example Node code lists.

(4) Static data 

This location is recommended for data that is subject to license or other conditions that prohibit their being made public, especially data provided by other people and organizations. (If this is not the case, store these as package data or fetch them.)

(Sub)directories in message-static-data, if they match directories under message_ix_models/data/, must have a matching structure.

Document these data on the page data-sources or together with other code modules that handle them. The documentation must indicate the original source and process to obtain the data files.

(5) Private data 

See above. This location is recommended for:

Data for model variants and projects under current development.
Specific data files that cannot (currently, or ever) be made public, for instance because of restrictive licenses, especially in cases where there is no public documentation or information about how users could obtain the data.

General recommendations 

Always consider: “Will this code work on another researcher’s computer?”

Prefer text formats

…such as CSV, over binary formats like Excel. CSV files are compressed by Git automatically, and Git can handle diffs to these files easily. Code that reads/writes these files is much faster, especially for files with thousands or more data points.

Do not hard-code paths

message_ix_models utility functions and Config settings allow to access all the (meta)data locations described above. It should not ever be necessary to use a hard-coded path; this is a clue that data are not in a proper location.

For system-specific paths (local data and user cache), get a Context object and use it to get an appropriate Path pointing to a file:

# Store a base path
project_path = context.get_local_path("myproject", "output")

# Use the Path object to generate a subpath
run_id = "foo"
output_file = project_path.joinpath("reporting", run_id, "all.xlsx")

Keep input and output data separate

Any directory should contain either input or output data—never both. Output data should not be stored in package data, private data, or static data paths; it must not be committed to those repositories.

Use a consistent scheme for directory trees

For a submodule for a specific model variant or project named, for instance, message_ix_models.model.[name] or message_ix_models.project.[name], keep input data in a well-organized directory under:

[base]/[name]/ —preferred, flatter,
[base]/model/[name]/,
[base]/project/[name]/,

…or similar.

Keep project-specific configuration files in the same locations:

# Located in `message_ix_models/data/`:
config = load_package_data("myproject", "config.yaml")

# Located in `data/` in the message_data repo:
config = load_private_data("myproject", "config.yaml")

# Not recommended: located in the same directory as a code file
config = yaml.safe_load(open(Path(__file__).with_name("config.yaml")))

Use a similar scheme for output data, except under the local data path.

Re-use configuration

Configuration to run a set of scenarios or to prepare reported submissions should re-use or extend existing, general-purpose code. Do not duplicate code or configuration. Instead, adjust or selectively overwrite its behaviour via project-specific configuration read from a file.

Large/binary input data 

Large, binary input data, such as Microsoft Excel spreadsheets, must not be committed as ordinary Git objects. This is because the entire file is re-added to the Git history for even small modifications, making it very large (see issue #37).

Instead, use one or more of the following patterns, in order of preference. Whichever pattern is used, code for handling large input data must be in message_ix_models, even if the data itself is private, for instance in message_data or another location.

Fetch directly from a remote source 

This corresponds to section (1) above. Preferably, do this via message_ix_models.util.pooch:

Extend pooch.SOURCE to store the Internet location, file name(s), and hash(es) of the file(s).
Call pooch.fetch() to retrieve the file and cache it locally.
Write code in message_ix_models that processes the data into a common format,
for instance by subclassing ExoDataSource.

This pattern is preferred because it can be replicated by anyone, and the reference data is public.

This pattern may be applied to:

Data published and maintained by others, or
Data created by the IIASA ECE program to be used in message_ix_models, such as Zenodo records.

Use Git Large File Storage (LFS)

Git LFS is a Git extension that allows for storing large, binary files without bloating the commit history. Essentially, Git stores a 3-line text file with a hash of the full file, and the full file is stored separately. The IIASA GitHub organization has up to 300 GB of space for such LFS objects.

To use this pattern, git add ... and git commit files in an appropriate location (above). New or unusual binary file extensions may require a git lfs command or modification to .gitattributes to ensure they are tracked by LFS and not by Git itself. See the Git LFS documentation for more detail.

For large files stored as package data using Git LFS, these:

must be added to MANIFEST.in. This avoids including the files in distributions published on PyPI.
should be added to util.pooch. This allows users who install message_ix_models from PyPI to easily retrieve the data. This usage must be included in the documentation that describes the data files.

Retrieve data from existing databases 

These include the same IIASA ECE Program ixmp databases that are used to store scenarios. Documentation must be provided that ensures this data is reproducible: that is, any original sources and code to create the database used by message_ix_models.

Other patterns 

Some other patterns exist, but should not be repeated in new code, and should be migrated to one of the above patterns.

SQL queries against a Oracle/JDBC database. See m-data:data-iea (in message_data) and issue #53 for a description of how to replace/simplify this code.