Concepts & Terminology

This section explains key ESGF concepts that are essential for understanding how esgprep works. Read this before diving into the tool-specific documentation.

Data Reference Syntax (DRS)

What it is: The Data Reference Syntax (DRS) is a standardized way to organize climate data files in a hierarchical directory structure. Each project (CMIP7, CORDEX, etc.) defines its own DRS specification.

Why it matters: A consistent DRS structure enables:

Automated data discovery across ESGF nodes
Predictable file locations for scripts and tools
Standardized dataset identification
Efficient data management and versioning

Structure example (CMIP7 - MIP-DRS7):

<root>/
└── MIP-DRS7/
    └── <mip_era>/
        └── <activity>/
            └── <organisation>/
                └── <source>/
                    └── <experiment>/
                        └── <variant_label>/
                            └── <region>/
                                └── <frequency>/
                                    └── <variable>/
                                        └── <branded_suffix>/
                                            └── <grid_label>/
                                                └── <directory_date>/
                                                    └── <filename>.nc

Concrete example:

/data/
└── MIP-DRS7/
    └── CMIP7/
        └── CMIP/
            └── IPSL/
                └── IPSL-CM7A-LR/
                    └── historical/
                        └── r1i1p1f1/
                            └── glb/
                                └── mon/
                                    └── tas/
                                        └── none/
                                            └── g1/
                                                ├── d20250101/
                                                │   └── tas_mon_IPSL-CM7A-LR_historical_r1i1p1f1_glb_g1_185001-201412.nc
                                                └── latest -> d20250101/

How esgprep uses it: esgdrs reads your NetCDF files, extracts facet values from filenames and global attributes, then organizes files into the correct DRS hierarchy.

Facets 

Definition: Facets are metadata attributes that categorize and identify datasets. Each project defines which facets are required and their allowed values.

Common CMIP7 facets:

Facet	Description	Example
`mip_era`	MIP generation	CMIP7
`activity`	MIP activity	CMIP, C4MIP, AerChemMIP
`organisation`	Modeling center	IPSL, CCCma, MOHC
`source`	Model name	IPSL-CM7A-LR, CanESM6-MR
`experiment`	Experiment type	historical, 1pctCO2-bgc
`variant_label`	Ensemble member	r1i1p1f1
`region`	Geographic region	glb (global)
`frequency`	Temporal frequency	mon, day, 1hr, 6hr
`variable`	Variable name	tas, pr, tos
`grid_label`	Grid type	g1, g99

How facets are extracted:

NetCDF File
    │
    ├── Filename parsing
    │   tas_mon_IPSL-CM7A-LR_historical_r1i1p1f1_glb_g1_185001-201412.nc
    │    │   │        │          │         │    │  │
    │    │   │        │          │         │    │  └── grid_label
    │    │   │        │          │         │    └── region
    │    │   │        │          │         └── variant_label
    │    │   │        │          └── experiment
    │    │   │        └── source
    │    │   └── frequency
    │    └── variable
    │
    ├── Global attributes (NetCDF metadata)
    │   organisation = "IPSL"
    │   activity = "CMIP"
    │   mip_era = "CMIP7"
    │
    └── Command-line overrides (--set-value)

Facet flow:

Facets → Dataset ID → Directory Path → Mapfile Entry

Dataset IDs 

What they are: Dataset IDs are unique identifiers for datasets, constructed by joining facet values with dots, followed by the version.

Format (new - recommended):

<project>.<facet1>.<facet2>.<facet3>...<facetN>.v<YYYYMMDD>

Note

The dataset ID format is transitioning from #YYYYMMDD suffix to .vYYYYMMDD suffix. The new format integrates the version as part of the identifier, making it more consistent with directory naming conventions.

Example (CMIP7):

# New format (recommended):
CMIP7.CMIP.IPSL.IPSL-CM7A-LR.historical.r1i1p1f1.glb.mon.tas.g1.v20250101

# Legacy format (deprecated):
CMIP7.CMIP.IPSL.IPSL-CM7A-LR.historical.r1i1p1f1.glb.mon.tas.g1#20250101

Components breakdown:

CMIP7 - MIP era
CMIP - Activity
IPSL - Organisation
IPSL-CM7A-LR - Source (model)
historical - Experiment
r1i1p1f1 - Variant label (realization, initialization, physics, forcing)
glb - Region (global)
mon - Frequency (monthly)
tas - Variable (near-surface air temperature)
g1 - Grid label
v20250101 - Version (date: vYYYYMMDD)

Why they matter: Dataset IDs are used in:

ESGF search and discovery
Mapfile generation
Data citation
Cross-node data replication

Versions 

Purpose: Versions track dataset updates over time, allowing users to access specific data releases and ensuring reproducibility of scientific analyses.

Format: vYYYYMMDD (e.g., v20250101) for directories dYYYYMMDD (e.g., d20250101) for files directory

Version management in DRS:

tas/g1/
├── files/
│   ├── d20250101/
│   │   └── tas_*.nc          (original files)
│   └── d20250615/
│       └── tas_*.nc          (updated files)
│
├── v20250101/
│   └── tas_*.nc -> ../files/d20250101/tas_*.nc    (symlinks)
│
├── v20250615/
│   ├── tas_001.nc -> ../files/d20250101/tas_001.nc  (unchanged, reuses old)
│   └── tas_002.nc -> ../files/d20250615/tas_002.nc  (new version)
│
└── latest -> v20250615/      (always points to newest)

Key concepts:

files/ directory: Contains actual data files organized by date
vYYYYMMDD/ directories: Contain symlinks to files
Symlink reuse: Unchanged files in new versions link to original files (saves disk space)
latest symlink: Always points to the most recent version

How esgprep handles versions:

esgdrs upgrade: Creates new version directories with appropriate symlinks
esgdrs latest: Updates the latest symlink
--upgrade-from-latest: Reuses unchanged files from previous version

Mapfiles 

What they are: Mapfiles are text files that list all files in a dataset for ESGF publication. They serve as the input to the ESGF publication gateway.

Format:

dataset_id | file_path | size_bytes | mod_time | checksum_type=checksum_value | ...

Note

The dataset ID in mapfiles is transitioning from #YYYYMMDD to .vYYYYMMDD format. esgmapfile generates mapfiles using the new format by default.

Example mapfile content:

CMIP7.CMIP.IPSL.IPSL-CM7A-LR.historical.r1i1p1f1.glb.mon.tas.g1.v20250101 | /data/MIP-DRS7/.../tas_mon_*.nc | 2456789012 | mod_time=1704067200.0 | checksum=abc123... | checksum_type=SHA256

Field breakdown:

Field	Description
`dataset_id`	Full dataset identifier with version (`.vYYYYMMDD` format)
`file_path`	Absolute path to the file
`size_bytes`	File size in bytes
`mod_time`	File modification timestamp
`checksum`	File integrity hash value
`checksum_type`	Hash algorithm used (SHA256, SHA2-256, etc.)

How to generate:

$> esgmapfile make --project cmip7 /data/MIP-DRS7/

Output location: By default, mapfiles are written to the current directory with naming pattern: <dataset_id>.map

CMOR (Climate Model Output Rewriter)

What it is: CMOR is a library that standardizes climate model output according to CF conventions and project-specific requirements (CMIP7, CORDEX, etc.).

CMOR-compliant files have:

Standardized variable names and units
Required global attributes (organisation, source, etc.)
Consistent filename conventions
CF-compliant coordinate systems

What esgprep expects:

NetCDF files processed by CMOR (or equivalent)
Filenames following project naming conventions
Required global attributes present
Valid facet values in vocabularies

Example CMOR filename (CMIP7):

<variable>_<frequency>_<source>_<experiment>_<variant_label>_<region>_<grid_label>[_<time_range>].nc

tas_mon_IPSL-CM7A-LR_historical_r1i1p1f1_glb_g1_185001-201412.nc

Link: CMOR Documentation

Controlled Vocabularies 

What they are: Controlled vocabularies (CVs) are approved lists of valid values for each facet. They ensure consistency across all ESGF data providers.

Managed by: The esgvoc library provides vocabulary access for esgprep.

Examples of controlled values (CMIP7):

mip_era:      CMIP7
activity:     CMIP, C4MIP, AerChemMIP, HighResMIP, ...
organisation: IPSL, CCCma, MOHC, MPI-M, ...
experiment:   historical, 1pctCO2-bgc, esm-flat10, ...
frequency:    mon, day, 6hr, 3hr, 1hr, ...

What happens with invalid values:

$> esgdrs list --project cmip7 /data/
ERROR: Invalid value 'invalid_experiment' for facet 'experiment'
Valid values: historical, 1pctCO2-bgc, esm-flat10, ...

Exploring available values:

# List all projects
$> esgvoc list-projects

# Get values for a specific collection
$> esgvoc get cmip7:activity:

# Validate a DRS path
$> esgvoc drsvalid cmip7 directory /path/to/data

Updating vocabularies:

$> pip install --upgrade esgvoc

Glossary 

CF Conventions: Climate and Forecast conventions for NetCDF metadata standardization.
Dataset: A collection of files sharing the same facet values (except time range).
DRS: Data Reference Syntax - hierarchical directory structure for climate data.
ESGF: Earth System Grid Federation - distributed data infrastructure for climate science.
Facet: A metadata attribute that categorizes data (e.g., variable, experiment, source).
Mapfile: Text file listing dataset files for ESGF publication.
MIP-DRS7: The Data Reference Syntax specification designed for CMIP7 and future MIP projects.
Variant Label: Ensemble identifier in format r<N>i<M>p<L>f<K> (realization, initialization, physics, forcing).
MIP: Model Intercomparison Project (e.g., CMIP, CORDEX).
Symlink: Symbolic link - a file pointing to another file’s location.
Version: Dataset release identifier in format vYYYYMMDD.