Getting Started

This guide will walk you through your first use of esgprep to prepare climate data for ESGF publication. By the end, you’ll understand the complete workflow from raw NetCDF files to publication-ready mapfiles.

Note

This guide assumes you have already installed esgprep. If not, see Installation.

Prerequisites

Before starting, ensure you have:

System Requirements:
  • Python 3.12 or higher

  • Linux/Unix environment

  • Access to the filesystem containing your data

  • Sufficient disk space (at least 2x your data size for DRS structure)

Data Requirements:
  • CMOR-compliant NetCDF files

  • Files from a supported ESGF project (CMIP6, CMIP5, CORDEX, etc.)

  • Files with proper global attributes and naming conventions

Verify Your Environment:

# Check Python version
$ python3 --version
Python 3.12.7

# Check esgprep is installed
$ esgdrs --version
esgdrs (from esgprep v3.0.0)

$ esgmapfile --version
esgmapfile (from esgprep v3.0.0)

Important

Initialize Controlled Vocabularies

Before using esgprep for the first time, you must initialize the controlled vocabularies:

$ esgvoc install

This downloads ESGF project vocabularies and builds local databases. The installation may take a few minutes.

What happens if you skip this? You’ll see an error like: RuntimeError: universe connection is not initialized

Keep your vocabularies updated: Run esgvoc install periodically to get the latest controlled vocabulary updates from ESGF projects. This ensures you can work with newly added experiments, models, or updated facet values.

For more details about controlled vocabularies, see the esgvoc documentation.

Understanding the Workflow

The esgprep workflow has two main stages:

┌─────────────────┐
│  NetCDF Files   │  Your incoming CMOR-compliant standardized data
│  (any location) │  (following project norms: CMIP7, CMIP6, etc.)
└────────┬────────┘
         │
         │ esgdrs list    (preview datasets)
         │ esgdrs tree    (preview structure)
         │ esgdrs upgrade (organize files)
         ↓
┌─────────────────┐
│  DRS Structure  │  Files organized following the project DRS
│  (versioned)    │
└────────┬────────┘
         │
         │ esgmapfile make (generate publication metadata)
         ↓
┌─────────────────┐
│    Mapfiles     │  Ready for ESGF publication
└─────────────────┘

Key Concepts:

  • DRS (Data Reference Syntax): A standardized directory structure for organizing climate data

  • Facets: Metadata attributes (like project, model, experiment) that define your data

  • Dataset ID: A unique identifier constructed from facets, like CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical...

  • Mapfiles: Text files listing your data files with checksums, required for ESGF publication

Example Scenario

Let’s prepare CMIP6 data from the IPSL-CM6A-LR model for publication.

Starting Point:

Note

Your files must be CMOR-compliant standardized NetCDF files that follow your project’s conventions (CMIP7, CMIP6, CORDEX, etc.). Files produced by CMOR or following the same standards will work correctly.

You have standardized NetCDF files in an incoming directory:

/data/incoming/
├── tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc
├── tas_Amon_IPSL-CM6A-LR_historical_r2i1p1f1_gr_185001-201412.nc
└── pr_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc

Goal:

Create a DRS-compliant structure and generate mapfiles for ESGF publication.

Step 1: List Datasets

First, let’s see what datasets esgprep detects from your files:

$ esgdrs make list --project cmip6 /data/incoming/

Expected Output:

DRS tree generation [----<-] ...
DRS tree generation [<<<<<<] Completed
Number of success(es): 3
Number of error(s): 0
===================================================================================================================================
                    Publication level                       Latest version   ->  Upgrade version   Files to upgrade      Total size
-----------------------------------------------------------------------------------------------------------------------------------
CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr      Initial       ->     v20250125                     1            1.2G
CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/tas/gr      Initial       ->     v20250125                     1            1.2G
CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/pr/gr       Initial       ->     v20250125                     1            2.1G
===================================================================================================================================

What This Shows:

  • Progress spinner during file scanning: [----<-][<<<<<<]

  • Success/error counts for processed files

  • Table showing:

    • Publication level: Dataset path in DRS structure

    • Latest version: “Initial” for new datasets, or existing version number

    • Upgrade version: New version to be created (vYYYYMMDD format, typically today’s date)

    • Files to upgrade: Count of files in this dataset

    • Total size: Human-readable size of the dataset

Tip

If you see errors here about invalid facets or unrecognized project, check that:

  • Your files are CMOR-compliant

  • The project name is spelled correctly (case-sensitive: cmip6 not CMIP6)

  • Files have proper global attributes

Step 2: Preview DRS Structure

Before making changes, preview what the DRS structure will look like:

$ esgdrs make tree --project cmip6 /data/incoming/ --root /data/esgf-data

Expected Output:

DRS tree generation [----<-] ...
DRS tree generation [<<<<<<] Completed
Number of success(es): 3
Number of error(s): 0
===================================================================================================================================
                                                          Upgrade DRS Tree
-----------------------------------------------------------------------------------------------------------------------------------
/data/esgf-data
CMIP6
└── CMIP
    └── IPSL
        └── IPSL-CM6A-LR
            └── historical
                ├── r1i1p1f1
                │   └── Amon
                │       ├── tas
                │       │   └── gr
                │       │       ├── files
                │       │       │   └── d20250125
                │       │       │       └── tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc
                │       │       ├── latest -> v20250125
                │       │       └── v20250125
                │       │           └── tas_Amon_*.nc -> ../files/d20250125/tas_Amon_*.nc
                │       └── pr
                │           └── gr
                │               ├── files
                │               │   └── d20250125
                │               ├── latest -> v20250125
                │               └── v20250125
                └── r2i1p1f1
                    └── Amon
                        └── tas
                            └── gr
                                ├── files
                                │   └── d20250125
                                ├── latest -> v20250125
                                └── v20250125
===================================================================================================================================

What This Shows:

  • Complete directory hierarchy following project DRS

  • Files organized by facets: activity, institution, model, experiment, variant, frequency, variable, grid

  • files/dYYYYMMDD/ directories containing actual data files

  • vYYYYMMDD/ directories with symlinks pointing to files

  • latest symlinks pointing to newest version directory

  • Arrow notation (->) showing symlink targets

Note

The --root option specifies where to create the DRS structure. If omitted, it uses your current directory.

Step 3: See Planned Operations

For more detail on what operations will be performed:

$ esgdrs make todo --project cmip6 /data/incoming/ --root /data/esgf-data --link

Expected Output:

DRS tree generation [----<-] ...
DRS tree generation [<<<<<<] Completed
Number of success(es): 3
Number of error(s): 0
===================================================================================================================================
                                                    Unix command-lines (DRY-RUN)
-----------------------------------------------------------------------------------------------------------------------------------
mkdir -p /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/v20250125
ln -s ../files/d20250125/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/v20250125/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc
mkdir -p /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr
ln -s v20250125 /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/latest
mkdir -p /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/files/d20250125
ln /data/incoming/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/files/d20250125/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc

[... similar commands for other files ...]
===================================================================================================================================

What This Shows:

  • Header: “Unix command-lines (DRY-RUN)” indicates no actual changes

  • Exact Unix commands that will be executed:

    1. mkdir -p - Create version directories

    2. ln -s - Create symlinks from version dir to files/

    3. ln -s - Create latest symlink pointing to version

    4. mkdir -p - Create files/dYYYYMMDD directory

    5. ln - Create hard link (with --link flag) or mv - Move file (default)

  • Sequence shows complete operation for each dataset

Tip

Use --copy instead of --link if you want to preserve the original files separately. Use --symlink for symbolic links (use with caution - broken if source moves).

Step 4: Apply DRS Structure

Now let’s actually create the DRS structure:

$ esgdrs make upgrade --project cmip6 /data/incoming/ --root /data/esgf-data --link

Expected Output:

DRS tree generation [----<-] ...
DRS tree generation [<<<<<<] Completed
Number of success(es): 3
Number of error(s): 0
===================================================================================================================================
                                                         Unix command-lines
-----------------------------------------------------------------------------------------------------------------------------------
mkdir -p /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/v20250125
ln -s ../files/d20250125/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/v20250125/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc
ln -s v20250125 /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/latest
mkdir -p /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/files/d20250125
ln /data/incoming/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/files/d20250125/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc

[... similar commands for other datasets ...]
===================================================================================================================================

What Just Happened:

  • Commands shown are actually being executed (no longer “DRY-RUN”)

  • DRS directory structure created under /data/esgf-data/CMIP6/

  • Files hard-linked (with --link) to their DRS locations

  • latest symlinks created pointing to v20250125

  • Success count confirms all operations completed

Verify the Result:

$ ls -lh /data/esgf-data/cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/

drwxr-xr-x  2 user group 4.0K Nov 25 10:30 v20250125
lrwxrwxrwx  1 user group   10 Nov 25 10:30 latest -> v20250125

$ ls /data/esgf-data/cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/v20250125/

tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc

Perfect! Your data is now organized according to the project DRS.

Step 5: Generate Mapfiles

Now generate the mapfiles needed for ESGF publication:

$ esgmapfile make --project cmip6 --directory /data/esgf-data/CMIP6/ --outdir /data/mapfiles

Expected Output:

Mapfiles generation [----<-] ...
Mapfiles generation [<<<<<<] Completed

Mapfile(s) generated: 3 (in /data/mapfiles)
Number of success(es): 3
Number of error(s): 0

What Was Created:

Mapfiles are text files with one line per data file:

$ cat /data/mapfiles/CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r1i1p1f1.Amon.tas.gr.v20250125.map

Content:

CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r1i1p1f1.Amon.tas.gr#20250125 | /data/esgf-data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/tas/gr/v20250125/tas_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc | 1234567890 | mod_time=1732531200.0 | checksum=a3d5e6f7890b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d
Format:
  • dataset_id#version | file_path | size_bytes | mod_time=TIMESTAMP | checksum=HEXDIGEST

  • Note: Version uses # separator (not .v) in the dataset ID within mapfiles

  • Checksum type defaults to SHA256 unless specified with --checksum-type

These mapfiles are ready for ESGF publication!

Step 6: Verify Everything

Let’s verify the complete workflow succeeded:

# Check DRS structure exists
$ tree /data/esgf-data/CMIP6/ | head -20
# or use: find /data/esgf-data/CMIP6/ -type f -o -type l

# Check mapfiles were created
$ ls -lh /data/mapfiles/
-rw-r--r-- 1 user group 350 Jan 25 10:30 CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r1i1p1f1.Amon.tas.gr.v20250125.map
-rw-r--r-- 1 user group 350 Jan 25 10:30 CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r2i1p1f1.Amon.tas.gr.v20250125.map
-rw-r--r-- 1 user group 350 Jan 25 10:30 CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r1i1p1f1.Amon.pr.gr.v20250125.map

# View mapfile content
$ cat /data/mapfiles/*.map | head -1
CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r1i1p1f1.Amon.tas.gr#20250125 | /data/esgf-data/CMIP6/... | 1234567890 | ...

Common Options

During your workflow, you may want to use these common options:

Performance:

# Use multiple processors for faster processing
$ esgdrs make upgrade --project cmip6 /data/incoming/ --max-processes 8

# Skip checksums for faster testing (not for production!)
$ esgmapfile make --project cmip6 --directory /data/esgf-data/CMIP6/ --no-checksum

Checksums:

# Use multihash format (recommended for new data)
$ esgmapfile make --project cmip6 --directory /data/esgf-data/CMIP6/ --checksum-type sha2-256

# Provide pre-calculated checksums
$ esgmapfile make --project cmip6 --directory /data/esgf-data/CMIP6/ --checksums-from checksums.txt

Versioning:

# Specify a custom version
$ esgdrs make upgrade --project cmip6 /data/incoming/ --version 20241201

# Process all versions (not just latest)
$ esgmapfile make --project cmip6 --directory /data/esgf-data/CMIP6/ --all-versions

Logging:

# Save output to logfile
$ esgdrs make upgrade --project cmip6 /data/incoming/ --log /var/log/esgprep/

# Enable debug mode for troubleshooting
$ esgdrs make upgrade --project cmip6 /data/incoming/ --debug

What’s Next?

Congratulations! You’ve successfully:

  • ✓ Organized your data into project DRS structure

  • ✓ Generated mapfiles for ESGF publication

  • ✓ Computed checksums for data integrity

Next Steps:

  1. Publish to ESGF:

    Use the ESGF publisher tools with your mapfiles:

    esgpublish --map /data/mapfiles/*.map --service fileservice
    
  2. Learn More:

  3. Handle Updates:

    When you need to publish a new version:

    # New files in incoming directory
    $ esgdrs make list --project cmip6 /data/incoming/
    # Will show version increment: v20250126
    
    $ esgdrs make upgrade --project cmip6 /data/incoming/ --upgrade-from-latest
    # Only processes changed files, symlinks unchanged ones
    
    $ esgmapfile make --project cmip6 --directory /data/esgf-data/CMIP6/
    # Generates new mapfiles for updated version
    

Common Issues

Problem: “Project ‘cmip6’ not found in vocabulary”

ValueError: Project 'cmip6' not found in esgvoc

Solution: Check project name spelling (case-sensitive). Update esgvoc:

pip install --upgrade esgvoc

Problem: “Invalid facet value” errors

Solution: Your NetCDF files may not be CMOR-compliant. Verify:

# Check file attributes
ncdump -h your_file.nc | grep ":"

Ensure global attributes match project requirements.

Problem: Checksumming is very slow

Solution: For large files, pre-calculate checksums:

# Generate checksums separately
sha256sum /data/incoming/*.nc > checksums.txt

# Use them in esgmapfile
esgmapfile make --project cmip6 --directory /data/esgf-data/CMIP6/ --checksums-from checksums.txt

Problem: Need to test without modifying data

Solution: Use a test directory:

# Create test directory
mkdir -p /tmp/test

# Test in /tmp
esgdrs make upgrade --project cmip6 /data/incoming/ --root /tmp/test --link

# Review output with tree or find
tree /tmp/test/CMIP6/

# If satisfied, run on production location

Getting Help

If you encounter issues:

  1. Check the logs: Use --log and --debug options

  2. Read the FAQ: See Frequently Asked Questions for common questions

  3. Consult detailed docs: Generic usage, Manage local data through the DRS, Generate mapfile for ESGF publication

  4. Report bugs: https://github.com/ESGF/esgf-prepare/issues

Useful Commands for Debugging:

# Check esgprep version
esgdrs --version
esgmapfile --version

# Check available projects (if esgvoc accessible)
python -c "import esgvoc; print(esgvoc.list_projects())"

# Verify NetCDF file structure
ncdump -h your_file.nc

# Test with single file
esgdrs make list --project cmip6 /path/to/single/file.nc

Quick Reference

Essential Commands:

# Preview what datasets will be created
esgdrs make list --project PROJECT /path/to/incoming/

# Preview DRS directory structure
esgdrs make tree --project PROJECT /path/to/incoming/ --root /output/path

# See planned operations (dry-run)
esgdrs make todo --project PROJECT /path/to/incoming/ --root /output/path --link

# Apply DRS structure
esgdrs make upgrade --project PROJECT /path/to/incoming/ --root /output/path --link

# Generate mapfiles
esgmapfile make --project PROJECT --directory /path/to/drs/ --outdir /path/to/mapfiles/

# Show mapfile paths (dry-run)
esgmapfile show --project PROJECT --directory /path/to/drs/
Common Projects:
  • cmip6 - CMIP6

  • cmip5 - CMIP5

  • cordex - CORDEX

  • input4mips - input4MIPs

  • obs4mips - obs4MIPs

For a complete list, check the esgvoc library documentation.

Ready to dive deeper? Continue to Generic usage for comprehensive command-line options, or Manage local data through the DRS for detailed DRS management features.