Manage local data through the DRS

The Data Reference Syntax (DRS) defines the way your data must be organised on your filesystem. This allows a proper publication on the ESGF node. esgdrs is designed to help ESGF data node managers to prepare incoming data for publication, placing files in the DRS directory structure, and to manage multiple versions of publication-level datasets in a way that minimises disk usage.

Warning

Only CMORized netCDF files are supported as incoming files.

Several esgdrs actions are available to manage your local archive:
  • list lists publication-level datasets,
  • tree displays the final DRS tree,
  • todo shows file operations pending for the next version,
  • upgrade makes the changes to upgrade datasets to the next version.

esgdrs deduces the excepted DRS by scanning the incoming files and checking the facets against the corresponding esg.<project>.ini file. The DRS facets values are deduced from:

  1. The command-line using --set facet=value. This flag can be used several times to set several facets values.
  2. The filename pattern using the filename_format from the esg.<project>.ini.
  3. The NetCDF global attributes by picking the attribute with the nearest name of the facet key.

Warning

The incoming files are supposed to be produced by CMOR (or at least be CMOR-compliant) and unversioned. esgdrs will apply a version regardless of the incoming file path. The applied version only depends on the --version flag and the existing dataset versions in the DRS --root.

Set a facet value

In some cases, a DRS facet value cannot be properly deduces from the above sources. To solve this issue, a facet value can be set for the whole scan. By duplicating the flag several facet value can be enforced. If the same facet key is used, only the last value will be considered.

$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-value FACET_KEY=VALUE
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-value FACET_KEY1=VALUE1 --set-value FACET_KEY2=VALUE2

Note

For instance, the product facet in CMIP5 project is not part of the filename and is often set to output in CMIP5 NetCDF global attributes however it should be output1 or output2. Consequently, you can use --set-value product=output1 or --set-value product=output2 depending on the dataset.

Enforce a facet mapping

Based on the same schema of the --set-value argument, the mapping between a (list of) facet key and a (list of) particular NetCDF attribute can be enforced for the whole scan.

$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-key FACET_KEY=ATTRIBUTE
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-key FACET_KEY1=ATTRIBUTE1 --set-value FACET_KEY2=ATTRIBUTE2

Note

For instance, the institute facet in CORDEX project is not part of the filename and corresponds to the institute_id NetCDF global attribute. Consequently, you can use --set-key institute=institute_id.

Set up the version upgrade

The upgraded version can be set using --version YYYYMMDD instead of the current date (the default).

$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --version YYYYMMDD

Visualize the excepted DRS tree

In order to save disk space, the scanned files are moved into files/dYYYYMMDD folders. The vYYYYMMDD has a symbolic links skeleton that avoid to duplicate files between two versions.

$> esgdrs tree --project PROJECT_ID /PATH/TO/SCAN/

Warning

Some miscellaneous characters could appear due to wrong encoding configuration. To see ASCII characters, choose another utf-8 font in your console setup.

Set up a root directory

By default, the DRS tree is built from your current directory. This can be changed by submitting a root path.

$> esgdrs tree --project PROJECT_ID /PATH/TO/SCAN/ --root /PATH/TO/MY_ROOT

Warning

The DRS tree is automatically rebuilt from the project level. Be careful to not submit a root path including the project.

List Unix command to apply

The todo action can be seen as a dry-run to check which unix commands should be apply to build the expected DRS tree. At this step, no file are moved or copy to the final DRS.

$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/

Those Unix command-lines can also be written into a file for further process:

$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --commands-file /PATH/TO/COMMANDS.txt

Note

Only the commands statements are written to the file. This is not a logfile.

By default another esgdrs todo run will append new command-lines to the file (if exists). To overwrite existing file:

$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --commands-file /PATH/TO/COMMANDS.txt --overwrite-commands-file

Change the migration mode

esgdrs allows different file migration mode. Default is to move the files from the incoming path to the root directory. Use --copy to make hard copies, --link to make hard links or --symlink to make symbolic links from the incoming path. We recommend to use --link and remove the incoming directory after DRS checking. This doesn’t affect the symbolic link skeleton used for the dataset versioning.

$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --copy
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --link
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --symlink

Warning

esgdrs temporarily stores the result of the list action to quickly generate the DRS tree afterwards. This requires to strictly submit the same arguments from the list action to the following ones. If not, the incoming files are automatically scan again.

Run the DRS upgrade

This will apply all the Unix command you can print with the todo action.

$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/

Run the DRS upgrade from the latest version

esgdrs supports two upgrade methods:

(a) (the default) The incoming directory must contain the complete contents of the new version of the dataset. If a file is unchanged from the previous version, it must still be supplied in incoming, although esgprep will detect that it is unmodified, and will optimise disk space by removing duplicates and symlinking to the old version instead. Any files that are not supplied are treated as removed in the new version.

(b) The new version of the dataset is based primarily on the previous published version. The user supplies in the incoming directory (or directories) only the files which are modified in the new version. Any file not supplied in incoming is considered to be the same as in the previous version, and a symlink is created accordingly.

The option --upgrade-from-latest allows you to toggle to method (b):

$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --upgrade-from-latest

By construction, method (b) might not support to simply delete a file between versions, rather than modifying it. The associated flag --ignore-from-latest allows you to submit a list of filenames to ignore during the version upgrade (i.e., files to be deleted between versions).

$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --ignore-from-latest /PATH/TO/FILENAMES.TXT

Warning

If --ignore-from-latest is submitted, --upgrade-from-latest is set to True by default.

Note

We highly recommend to use the tree action to see what the upgraded tree looks like before applying the upgrade.

Rescanning data

By default the list action scans data and record the rebuilt DRS tree into a temporary Pickle file. This file is then read to skip data scan when other actions (i.e., tree, todo or upgrad) are invoked, except if key options have been changed from the previous list call. In such a case the scan is redone automatically. To force the rescan in any case:

$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --rescan

Exit status

  • Status = 0
    All the files have been successfully scanned and the DRS tree properly generated.
  • Status > 0
    Some scan errors occurred. Some files have been skipped or failed during the scan potentially leading to an incomplete DRS tree.