Data, metadata, and configuration
*********************************

Many, varied kinds of data are used to prepare and modify MESSAGEix-GLOBIOM scenarios.
Other data are produced by code as incidental or final output.

These can be categorized in several ways.
One is by the purpose they serve:

- **data**—actual numerical values—used or produced by code,
- **metadata**, information describing where data is, how to manipulate it, how it is structured, etc.;
- **configuration** that otherwise affects how code works.

Another is by whether the data are **input**, **output**, or both.

This page describes how to store and handle such files in :mod:`message_ix_models` and :mod:`message_data`. [1]_

.. contents::
   :local:

.. [1] Unless specifically distinguished in the text, all of the following applies to *both* :mod:`message_ix_models` and :mod:`message_data`.

.. _data-goes-where:

Choose locations for data
=========================

These are listed in order of preference.

(1) **Not** in :mod:`message_ix_models`
---------------------------------------

Data that are available from public, stable sources **should not** be added to the :mod:`message_ix_models` repository.
Instead:

1. Fetch the code from their original location.
   If possible, this **should** be done by extending or using :mod:`message_ix_models.util.pooch`.
2. If :mod:`message_ix_models` relies on certain adjustments to the data, *do not* commit the adjusted data.
   Instead:

   a. Commit code that performs the adjustments.
      This makes methods for data transformation (and any assumptions involved) transparent.
   b. If necessary, cache the result—see below.

.. _package-data:

(2) :file:`message_ix_models/data/`
-----------------------------------

- Files in this directory are **public**.
- In standard Python terms, these are “package data”.
- This is the preferred location for:

  - General-purpose metadata for the MESSAGEix-GLOBIOM base global model or variants.
  - Configuration.
  - Data for publicized model variants and completed/published projects.

- These files are packaged, published, and installable from PyPI with :mod:`message_ix_models`—*unless* specifically excluded via :file:`MANIFEST.in` (see :ref:`large-input-data`, below).
- These data can be reached with :func:`.package_data_path`, :func:`.load_package_data`, or other, more specialized code.
- Documentation files like :file:`doc/pkg-data/*.rst` describe the contents of these files, and appear in the automatically-built documentation.
  For example: :doc:`pkg-data/node`.

.. _private-data:

(3) :file:`data/` directory in the :mod:`message_data` repository
-----------------------------------------------------------------

- Files in this directory are **private** and not installable from PyPI (because :mod:`message_data` is not packaged for or installable from PyPI).
- This is the preferred location for:

  - Data for model variants and projects under current development.
  - Specific data files that cannot (currently, or ever) be made public, for instance because of restrictive licenses.

- These data can be reached with :func:`.private_data_path`, :func:`.load_private_data` or other, more specialized code.

.. _local-data:

(4) Other, system-specific (“local”) directories
------------------------------------------------

These are the preferred location for:

- Outputs, such as data or plot files generated by reporting.
- Data files not distributable with :mod:`message_ix_models`, for instance those with access conditions (registration, payment, etc.).
- Caches: temporary data files used to speed up other code by avoiding repeat of slow operations.

These kinds of data **must not** be committed to :mod:`message_ix_models`.
Caches and output **should not** be committed to :mod:`message_data`.

(4A) Local data
~~~~~~~~~~~~~~~

Each user **may** configure a location for these data, appropriate to their system, and then use :meth:`.Context.get_local_path` and/or :func:`.local_data_path` to construct paths under this directory.

This setting can be made in multiple ways.
From lowest to highest precedence:

1. The default location is the *current working directory*: the directory in which the :program:`mix-models` :doc:`cli` is invoked, or in which Python code is run that imports and uses :mod:`message_ix_models`.
2. The :mod:`ixmp` configuration file setting ``message local data``.
3. The ``MESSAGE_LOCAL_DATA`` environment variable.
4. The ``--local-data`` CLI option and related options such as the ``--output`` option to the ``report`` command.
5. Code that directly modifies the ``local_data`` setting on :class:`.Context`.

This location **should** be outside the Git-controlled directories for :mod:`message_ix_models` or :mod:`message_data`.
In other words, users **should** at least use (2) or (3) to specify such directories.
If not, they **may** use :file:`.gitignore` files to hide these from Git.

(4B) Cache data
~~~~~~~~~~~~~~~

Code **should** use :func:`.platformdirs.user_cache_path` to identify a system-specific path to a cache directory.
For example:

.. code-block:: python

   from platformdirs import user_cache_path

   # Always use "message-ix-models" as the `appname` parameter
   ucp = user_cache_path("message-ix-models")

   # Construct the sub-directory for the current module
   dir_ = ucp.joinpath("my-project", "subdir")
   dir_.mkdir(parents=True, exist_ok=True)

   # Construct a file path within this directory
   p = dir_.joinpath("data-file-name.csv")


General guidelines
==================

Always consider: “Will this code work on another researcher's computer?”

Prefer text formats
   …such as CSV, over binary formats like Excel.
   CSV files up to several thousand lines are compressed by Git automatically, and Git can handle diffs to these files easily.

*Do not* hard-code paths
   Data stored with (2–4) above can be retrieved with the utility functions mentioned, instead of hard-coded paths.

   For system-specific paths (4) only, get a :obj:`.Context` object and use it to get an appropriate :class:`~pathlib.Path` object pointing to a file:

   .. code-block:: python

       # Store a base path
       project_path = context.get_local_path("myproject", "output")

       # Use the Path object to generate a subpath
       run_id = "foo"
       output_file = project_path.joinpath("reporting", run_id, "all.xlsx")

Keep input and output data separate
   Where possible, use (1–3) above for input data, and (4A) for output data.

Use a consistent scheme for data locations
   For a submodule for a specific model variant or project named, for instance, :py:`message_ix_models.model.[name]` or :py:`message_ix_models.project.[name]`, keep input data in a well-organized directory under:

   - :file:`[base]/[name]/` —preferred, flatter,
   - :file:`[base]/model/[name]/`,
   - :file:`[base]/project/[name]/`,
   - or similar,

   where ``[base]`` is (2) or (3), above.

   Keep *project-specific configuration files* in the same locations, or (less preferable) alongside Python code files:

   .. code-block:: python

      # Located in `message_ix_models/data/`:
      config = load_package_data("myproject", "config.yaml")

      # Located in `data/` in the message_data repo:
      config = load_private_data("myproject", "config.yaml")

      # Located in the same directory as the code
      config = yaml.safe_load(open(Path(__file__).with_name("config.yaml")))

   Use a similar scheme for output data, except under (4A).

Re-use configuration
   Configuration to run a set of scenarios or to prepare reported submissions **should** re-use or extend existing, general-purpose code.
   Do not duplicate code or configuration.
   Instead, adjust or selectively overwrite its behaviour via project-specific configuration read from a file.


.. _large-input-data:
.. _binary-input-data:

Large/binary input data
=======================

These data, such as Microsoft Excel spreadsheets, **must not** be committed as ordinary Git objects.
This is because the entire file is re-added to the Git history for even small modifications, making it very large (see `issue #37 <https://github.com/iiasa/message_data/issues/37>`_).

Instead, use one or more of the following patterns, in order of preference.
Whichever pattern is used, code for handling large input data **must** be in :mod:`message_ix_models`, even if the data itself is private, for instance in :mod:`message_data` or another location.

Fetch directly from a remote source
-----------------------------------

This corresponds to section (1) above.
Preferably, do this via :mod:`message_ix_models.util.pooch`:

- Extend :data:`.pooch.SOURCE` to store the Internet location, file name(s), and hash(es) of the file(s).
- Call :func:`.pooch.fetch` to retrieve the file and cache it locally.
- Write code in :mod:`message_ix_models` that processes the data into a common format, for instance by subclassing :class:`.ExoDataSource`.

This pattern is preferred because it can be replicated by anyone, and the reference data is public.

This pattern may be applied to:

- Data published and maintained by others, or
- Data created by the IIASA ECE program to be used in :mod:`message_ix_models`, such as `Zenodo <https://zenodo.org>`_ records.

Use Git Large File Storage (LFS)
--------------------------------

`Git LFS <https://git-lfs.github.com/>`_ is a Git extension that allows for storing large, binary files without bloating the commit history.
Essentially, Git stores a 3-line text file with a hash of the full file, and the full file is stored separately.
The IIASA GitHub organization has up to 300 GB of space for such LFS objects.

To use this pattern, simply :program:`git add ...` and :program:`git commit` files in an appropriate location (above).
New or unusual binary file extensions may require a :program:`git lfs` command or modification to :file:`.gitattributes` to ensure they are tracked by LFS and not by Git itself.
See the Git LFS documentation at the link above for more detail.

For large files stored in :file:`message_ix_models/data/` (2, above) using Git LFS, these:

- **must** be added to :file:`MANIFEST.in`.
  This avoids including the files in Python distributions published on PyPI.
- **should** be added to :mod:`.util.pooch`.
  This allows users who install :mod:`message_ix_models` from PyPI to easily retrieve the data.
  This usage **must** be included in the documentation that describes the data files.

Retrieve data from existing databases
-------------------------------------

These include the same IIASA ENE ixmp databases that are used to store scenarios.
Documentation **must** be provided that ensures this data is reproducible: that is, any original sources and code to create the database used by :mod:`message_data`.

Other patterns
--------------

Some other patterns exist, but should not be repeated in new code, and should be migrated to one of the above patterns.

- SQL queries against a Oracle/JDBC database.
  See :ref:`message_data:data-iea` (in :mod:`message_data`) and `issue #53 <https://github.com/iiasa/message_data/issues/53#issuecomment-669117393>`_ for a description of how to replace/simplify this code.


Configuration
=============

:class:`.Context` objects are used to carry configuration, environment information, and other data between parts of the code.
Scripts and user code can also store values in a Context object.

.. code-block:: python

    # Get an existing instance of Context. There is always at
    # least 1 instance available
    c = Context.get_instance()

    # Store a value using attribute syntax
    c.foo = 42

    # Store a value with spaces in the name using item syntax
    c["PROJECT data source"] = "Source A"

    # my_function() responds to 'foo' or 'PROJECT data source'
    my_function(c)

    # Store a sub-dictionary of values
    c["PROJECT2"] = {"setting A": 123, "setting B": 456}

    # Create a subcontext with all the settings of `c`
    c2 = deepcopy(c)

    # Modify one setting
    c2.foo = 43

    # Run code with this alternate setting
    my_function(c2)


For the CLI, every command decorated with ``@click.pass_obj`` gets a first positional argument ``context``, which is an instance of this class.
The settings are populated based on the command-line parameters given to ``mix-models`` or (sub)commands.

.. _context:

Top-level settings
------------------

These are defined by :class:`message_ix_models.Config`.

Specific modules for model variants, projects, etc. **should**:

- Define a single :mod:`dataclass <dataclasses>` to express the configuration options they understand.
  See for example:

  - :class:`.model.Config` for describing existing models or constructing new models,
  - :class:`.report.Config` for reporting,
  - :class:`message_data.model.buildings.Config` (for the MESSAGEix-Buildings model variant / linkage).

- Store this on the :class:`.Context` at a simple key.
  For example :class:`.model.Config` is stored at :py:`context.model` or :py:`context["model"]`.
- Retrieve and respect configuration from existing objects, i.e. only duplicate settings with the same meaning when strictly necessary.
- Communicate to other modules by setting the appropriate configuration values.

.. autoclass:: message_ix_models.Config
   :members: