Data, metadata, and configuration¶
Many, varied kinds of data are used to prepare and modify MESSAGEix-GLOBIOM scenarios. Other data are produced by code as incidental or final output. These can be categorized in several ways. One is by the purpose they serve:
data—actual numerical values—used or produced by code,
metadata, information describing where data is, how to manipulate it, how it is structured, etc.;
configuration that otherwise affects how code works.
Another is whether they are input or output data.
This page describes how to store and handle such files in both message_ix_models
and message_data
.
Choose locations for data¶
These are listed in order of preference.
(1) message_ix_models/data/
¶
Files in this directory are public, and are packaged, published, and installable from PyPI with
message_ix_models
; in standard Python terms, these are “package data”.This is the preferred location for:
General-purpose metadata.
Basic configuration, e.g. for reporting, not specific to any model variant or project.
Data for publicized model variants and completed/published projects.
Data here can be loaded with
load_package_data()
or other, more specialized code.Documentation files like
doc/pkg-data/*.rst
describe the contents of these files. For example: Node code lists.
(2) data/
directory in the message_data
repo¶
Files in this directory are private and not installable from PyPI (because
message_data
is not installable).This is the preferred location for:
Data for model variants and projects under current development.
Specific data files that cannot be made public, e.g. due to licensing issues.
Data here can be loaded with
load_private_data()
or other, more specialized code.
(3) Other, system-specific (“local”) directories¶
These are the preferred location for:
Caches i.e. temporary data files used to speed up other code.
Output e.g. data or figure files generated by reporting.
These kinds of data should not be committed to either
message_ix_models
ormessage_data
.Each user may configure a location for these data, appropriate to their system.
This setting can be made in multiple ways. In order of ascending precedence:
The default location is the current working directory, i.e. whichever directory the Command-line interface is invoked in, or in which Python code is run that imports and uses
message_ix_models
.The
ixmp
configuration settingmessage local data
.The
MESSAGE_LOCAL_DATA
environment variable.The
--local-data
CLI option and related options such as--cache-path
or the--output
option to thereport
command.Code that directly modifies the
local_data
setting onContext
.
This location should be outside the Git-controlled directories for
message_ix_models
ormessage_data
. If not, use.gitignore
files to hide these from Git.
General guidelines¶
Always consider: “Will this code work on another researcher’s computer?”
- Prefer text formats
…such as e.g. CSV and YAML. CSV files up to several thousand lines are compressed by Git automatically, and Git can handle diffs to these files easily.
- Do not hard-code paths
Data stored with (1) or (2) above can be retrieved with the utility funtions mentioned, instead of hard-coded paths.
For system-specific paths (3) only, get a
Context
object and use it to get an appropriatePath
object pointing to a file# Store a base path project_path = context.get_local_path("myproject", "output") # Use the Path object to generate a subpath run_id = "foo" output_file = project_path.joinpath("reporting", run_id, "all.xlsx")
- Keep input and output data separate
Use (1) or (2), above, for the format, and (3) for the latter.
- Use a consistent scheme for data locations
For a submodule for a specific model variant or project named, e.g.
message_ix_models.model.[name]
ormessage_ix_models.projects.[name]
, keep input data in a well-organized directory under[base]/model/[name]/
,[base]/project/[name]/
, or similar, where[base]
is (1) or (2), above.Keep project-specific configuration files in the same locations, or (less preferable) alongside Python code files:
# Located in `message_ix_models/data/`: config = load_package_data("myproject", "config.yaml") # Located in `data/` in the message_data repo: config = load_private_data("myproject", "config.yaml") # Located in the same directory as the code config = yaml.safe_load(open(Path(__file__).with_name("config.yaml")))
Use a similar scheme for output data, except under (3).
- Re-use configuration
Configuration to run a set of scenarios or to prepare reported submissions should re-use or extend existing, general-purpose code. Do not duplicate code or configuration. Instea, adjust or selectively overwrite its behaviour via project-specific configuration read from a file.
Large/binary input data¶
These data, such as Microsoft Excel spreadsheets, must not be committed as ordinary Git objects. This is because the entire file is re-added to the Git history for even small modifications, making it very large (see issue #37).
Instead, use one of the following patterns, in order of preference.
Whichever pattern is used, code for handling large input data must be in message_ix_models
, even if the data itself is private, e.g. in message_data
or another location.
Fetch from a remote source¶
Use a configuration file in message_ix_models
to store metadata, i.e. the Internet location and other information needed to retrieve the data.
Then, write code that retrieves the data and caches it locally:
import requests
# Load some configuration
config = yaml.safe_load(load_package_data("big-data-source", "config.yaml"))
# Local paths for the cached raw files and extracted file(s)
cache_path = context.get_cache_path("big-data-source")
downloaded = cache_path / "downloaded_file.zip"
extracted = cache_path / "extracted_file.csv"
with open(downloaded) as f:
remote_data = requests.get(config["url"])
# Handle the data, writing to `f`
# Extract the data from `downloaded` to `extracted`
This pattern is preferred because it can be replicated by anyone, and the reference data is public.
Use Git Large File Storage (LFS)¶
Git LFS is a Git extension that allows for storing large, binary files without bloating the commit history. Essentially, Git stores a one-line text file with a hash of the full file, and the full file is stored separately. The IIASA GitHub account has up to 300 GB of space for LFS objects.
To use this pattern, simply git add ...
and git commit
files in an appropriate location (above).
New or unusual binary file extensions may require a git lfs
command or modification to .gitattributes
to ensure they are tracked by LFS and not by ordinary Git history.
See the Git LFS documentation at the link above for more detail.
Retrieve data from existing databases¶
These include the same IIASA ENE ixmp databases that are used to store scenarios.
Documentation must be provided that ensures this data is reproducible, i.e. any original source and code to create the database used by message_data
.
Other patterns¶
Some other patterns exist, but should not be repeated in new code, and should be migrated to one of the above patterns.
SQL queries against a Oracle/JDBC database. See data-iea, below, and issue #53 for a description of how to replace/simplify this code.
Configuration¶
Context
objects are used to carry configuration, environment information, and other data between parts of the code.
Scripts and user code can also store values in a Context object.
# Get an existing instance of Context. There is always at
# least 1 instance available
c = Context.get_instance()
# Store a value using attribute syntax
c.foo = 42
# Store a value with spaces in the name using item syntax
c["PROJECT data source"] = "Source A"
# my_function() responds to 'foo' or 'PROJECT data source'
my_function(c)
# Store a sub-dictionary of values
c["PROJECT2"] = {"setting A": 123, "setting B": 456}
# Create a subcontext with all the settings of `c`
c2 = deepcopy(c)
# Modify one setting
c2.foo = 43
# Run code with this alternate setting
my_function(c2)
For the CLI, every command decorated with @click.pass_obj
gets a first positional argument context
, which is an instance of this class.
The settings are populated based on the command-line parameters given to mix-models
or (sub)commands.
Top-level settings¶
See model- and project-specific documentation for further context settings, e.g. model.bare
.
Setting |
Type |
Description |
---|---|---|
cache_path |
Path |
Base path cache, e.g. as given by the |
dry_run |
bool |
Whether an operation should be carried out, or only previewed. |
local_data |
Path |
Base path for system-specific (3) data, e.g. as given by the |
platform_info |
dict |
Dictionary with keyword arguments for the |
scenario_info |
dict |
Dictionary with keys ‘model’ and ‘scenario’ as given by the |
url |
dict |
A scenario URL, e.g. as given by the |
units |
pint.UnitRegistry |
Deprecated. Use |