Concepts & Terminology
This section explains key ESGF concepts that are essential for understanding how esgprep works.
Read this before diving into the tool-specific documentation.
Data Reference Syntax (DRS)
What it is: The Data Reference Syntax (DRS) is a standardized way to organize climate data files in a hierarchical directory structure. Each project (CMIP7, CORDEX, etc.) defines its own DRS specification.
Why it matters: A consistent DRS structure enables:
Automated data discovery across ESGF nodes
Predictable file locations for scripts and tools
Standardized dataset identification
Efficient data management and versioning
Structure example (CMIP7 - MIP-DRS7):
<root>/
└── MIP-DRS7/
└── <mip_era>/
└── <activity>/
└── <organisation>/
└── <source>/
└── <experiment>/
└── <variant_label>/
└── <region>/
└── <frequency>/
└── <variable>/
└── <branded_suffix>/
└── <grid_label>/
└── <directory_date>/
└── <filename>.nc
Concrete example:
/data/
└── MIP-DRS7/
└── CMIP7/
└── CMIP/
└── IPSL/
└── IPSL-CM7A-LR/
└── historical/
└── r1i1p1f1/
└── glb/
└── mon/
└── tas/
└── none/
└── g1/
├── d20250101/
│ └── tas_mon_IPSL-CM7A-LR_historical_r1i1p1f1_glb_g1_185001-201412.nc
└── latest -> d20250101/
How esgprep uses it:
esgdrs reads your NetCDF files, extracts facet values from filenames and global attributes,
then organizes files into the correct DRS hierarchy.
Facets
Definition: Facets are metadata attributes that categorize and identify datasets. Each project defines which facets are required and their allowed values.
Common CMIP7 facets:
Facet |
Description |
Example |
|---|---|---|
|
MIP generation |
CMIP7 |
|
MIP activity |
CMIP, C4MIP, AerChemMIP |
|
Modeling center |
IPSL, CCCma, MOHC |
|
Model name |
IPSL-CM7A-LR, CanESM6-MR |
|
Experiment type |
historical, 1pctCO2-bgc |
|
Ensemble member |
r1i1p1f1 |
|
Geographic region |
glb (global) |
|
Temporal frequency |
mon, day, 1hr, 6hr |
|
Variable name |
tas, pr, tos |
|
Grid type |
g1, g99 |
How facets are extracted:
NetCDF File
│
├── Filename parsing
│ tas_mon_IPSL-CM7A-LR_historical_r1i1p1f1_glb_g1_185001-201412.nc
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ └── grid_label
│ │ │ │ │ │ └── region
│ │ │ │ │ └── variant_label
│ │ │ │ └── experiment
│ │ │ └── source
│ │ └── frequency
│ └── variable
│
├── Global attributes (NetCDF metadata)
│ organisation = "IPSL"
│ activity = "CMIP"
│ mip_era = "CMIP7"
│
└── Command-line overrides (--set-value)
Facet flow:
Facets → Dataset ID → Directory Path → Mapfile Entry
Dataset IDs
What they are: Dataset IDs are unique identifiers for datasets, constructed by joining facet values with dots, followed by the version.
Format (new - recommended):
<project>.<facet1>.<facet2>.<facet3>...<facetN>.v<YYYYMMDD>
Note
The dataset ID format is transitioning from #YYYYMMDD suffix to .vYYYYMMDD suffix.
The new format integrates the version as part of the identifier, making it more consistent
with directory naming conventions.
Example (CMIP7):
# New format (recommended):
CMIP7.CMIP.IPSL.IPSL-CM7A-LR.historical.r1i1p1f1.glb.mon.tas.g1.v20250101
# Legacy format (deprecated):
CMIP7.CMIP.IPSL.IPSL-CM7A-LR.historical.r1i1p1f1.glb.mon.tas.g1#20250101
Components breakdown:
CMIP7- MIP eraCMIP- ActivityIPSL- OrganisationIPSL-CM7A-LR- Source (model)historical- Experimentr1i1p1f1- Variant label (realization, initialization, physics, forcing)glb- Region (global)mon- Frequency (monthly)tas- Variable (near-surface air temperature)g1- Grid labelv20250101- Version (date: vYYYYMMDD)
Why they matter: Dataset IDs are used in:
ESGF search and discovery
Mapfile generation
Data citation
Cross-node data replication
Versions
Purpose: Versions track dataset updates over time, allowing users to access specific data releases and ensuring reproducibility of scientific analyses.
Format:
vYYYYMMDD (e.g., v20250101) for directories
dYYYYMMDD (e.g., d20250101) for files directory
Version management in DRS:
tas/g1/
├── files/
│ ├── d20250101/
│ │ └── tas_*.nc (original files)
│ └── d20250615/
│ └── tas_*.nc (updated files)
│
├── v20250101/
│ └── tas_*.nc -> ../files/d20250101/tas_*.nc (symlinks)
│
├── v20250615/
│ ├── tas_001.nc -> ../files/d20250101/tas_001.nc (unchanged, reuses old)
│ └── tas_002.nc -> ../files/d20250615/tas_002.nc (new version)
│
└── latest -> v20250615/ (always points to newest)
Key concepts:
files/ directory: Contains actual data files organized by date
vYYYYMMDD/ directories: Contain symlinks to files
Symlink reuse: Unchanged files in new versions link to original files (saves disk space)
latest symlink: Always points to the most recent version
How esgprep handles versions:
esgdrs upgrade: Creates new version directories with appropriate symlinksesgdrs latest: Updates thelatestsymlink--upgrade-from-latest: Reuses unchanged files from previous version
Mapfiles
What they are: Mapfiles are text files that list all files in a dataset for ESGF publication. They serve as the input to the ESGF publication gateway.
Format:
dataset_id | file_path | size_bytes | mod_time | checksum_type=checksum_value | ...
Note
The dataset ID in mapfiles is transitioning from #YYYYMMDD to .vYYYYMMDD format.
esgmapfile generates mapfiles using the new format by default.
Example mapfile content:
CMIP7.CMIP.IPSL.IPSL-CM7A-LR.historical.r1i1p1f1.glb.mon.tas.g1.v20250101 | /data/MIP-DRS7/.../tas_mon_*.nc | 2456789012 | mod_time=1704067200.0 | checksum=abc123... | checksum_type=SHA256
Field breakdown:
Field |
Description |
|---|---|
|
Full dataset identifier with version ( |
|
Absolute path to the file |
|
File size in bytes |
|
File modification timestamp |
|
File integrity hash value |
|
Hash algorithm used (SHA256, SHA2-256, etc.) |
How to generate:
$> esgmapfile make --project cmip7 /data/MIP-DRS7/
Output location:
By default, mapfiles are written to the current directory with naming pattern:
<dataset_id>.map
CMOR (Climate Model Output Rewriter)
What it is: CMOR is a library that standardizes climate model output according to CF conventions and project-specific requirements (CMIP7, CORDEX, etc.).
CMOR-compliant files have:
Standardized variable names and units
Required global attributes (organisation, source, etc.)
Consistent filename conventions
CF-compliant coordinate systems
What esgprep expects:
NetCDF files processed by CMOR (or equivalent)
Filenames following project naming conventions
Required global attributes present
Valid facet values in vocabularies
Example CMOR filename (CMIP7):
<variable>_<frequency>_<source>_<experiment>_<variant_label>_<region>_<grid_label>[_<time_range>].nc
tas_mon_IPSL-CM7A-LR_historical_r1i1p1f1_glb_g1_185001-201412.nc
Link: CMOR Documentation
Controlled Vocabularies
What they are: Controlled vocabularies (CVs) are approved lists of valid values for each facet. They ensure consistency across all ESGF data providers.
Managed by:
The esgvoc library provides vocabulary access for esgprep.
Examples of controlled values (CMIP7):
mip_era: CMIP7
activity: CMIP, C4MIP, AerChemMIP, HighResMIP, ...
organisation: IPSL, CCCma, MOHC, MPI-M, ...
experiment: historical, 1pctCO2-bgc, esm-flat10, ...
frequency: mon, day, 6hr, 3hr, 1hr, ...
What happens with invalid values:
$> esgdrs list --project cmip7 /data/
ERROR: Invalid value 'invalid_experiment' for facet 'experiment'
Valid values: historical, 1pctCO2-bgc, esm-flat10, ...
Exploring available values:
# List all projects
$> esgvoc list-projects
# Get values for a specific collection
$> esgvoc get cmip7:activity:
# Validate a DRS path
$> esgvoc drsvalid cmip7 directory /path/to/data
Updating vocabularies:
$> pip install --upgrade esgvoc
Glossary
- CF Conventions
Climate and Forecast conventions for NetCDF metadata standardization.
- Dataset
A collection of files sharing the same facet values (except time range).
- DRS
Data Reference Syntax - hierarchical directory structure for climate data.
- ESGF
Earth System Grid Federation - distributed data infrastructure for climate science.
- Facet
A metadata attribute that categorizes data (e.g., variable, experiment, source).
- Mapfile
Text file listing dataset files for ESGF publication.
- MIP-DRS7
The Data Reference Syntax specification designed for CMIP7 and future MIP projects.
- Variant Label
Ensemble identifier in format
r<N>i<M>p<L>f<K>(realization, initialization, physics, forcing).- MIP
Model Intercomparison Project (e.g., CMIP, CORDEX).
- Symlink
Symbolic link - a file pointing to another file’s location.
- Version
Dataset release identifier in format
vYYYYMMDD.
See Also
Getting Started - Hands-on tutorial
Manage local data through the DRS - Detailed DRS command reference
Generate mapfile for ESGF publication - Mapfile generation guide