Manage local data through the DRS¶
The Data Reference Syntax (DRS) defines the way your data must be organised on your filesystem. This allows a proper
publication on the ESGF node. esgdrs is designed to help ESGF data node managers to prepare incoming data for
publication, placing files in the DRS directory structure, and to manage multiple versions of publication-level datasets
in a way that minimises disk usage.
Warning
Only CMORized netCDF files are supported as incoming files.
- Several 
esgdrsactions are available to manage your local archive: listlists publication-level datasets,treedisplays the final DRS tree,todoshows file operations pending for the next version,upgrademakes the changes to upgrade datasets to the next version.
esgdrs deduces the excepted DRS by scanning the incoming files and checking the facets against the
corresponding esg.<project>.ini file. The DRS facets values are deduced from:
- The command-line using
 --set facet=value. This flag can be used several times to set several facets values.- The filename pattern using the
 filename_formatfrom theesg.<project>.ini.- The NetCDF global attributes by picking the attribute with the nearest name of the facet key.
 
Warning
The incoming files are supposed to be produced by CMOR (or at least be
CMOR-compliant) and unversioned. esgdrs will apply a version regardless of the incoming file path. The
applied version only depends on the --version flag and the existing dataset versions in the DRS --root.
Set a facet value¶
In some cases, a DRS facet value cannot be properly deduces from the above sources. To solve this issue, a facet value can be set for the whole scan. By duplicating the flag several facet value can be enforced. If the same facet key is used, only the last value will be considered.
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-value FACET_KEY=VALUE
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-value FACET_KEY1=VALUE1 --set-value FACET_KEY2=VALUE2
Note
For instance, the product facet in CMIP5 project is not part of the filename and is often set to
output in CMIP5 NetCDF global attributes however it should be output1 or output2. Consequently, you can
use --set-value product=output1 or --set-value product=output2 depending on the dataset.
Enforce a facet mapping¶
Based on the same schema of the --set-value argument, the mapping between a (list of) facet key and a (list of)
particular NetCDF attribute can be enforced for the whole scan.
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-key FACET_KEY=ATTRIBUTE
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-key FACET_KEY1=ATTRIBUTE1 --set-value FACET_KEY2=ATTRIBUTE2
Note
For instance, the institute facet in CORDEX project is not part of the filename and corresponds to the
institute_id NetCDF global attribute. Consequently, you can use --set-key institute=institute_id.
Set up the version upgrade¶
The upgraded version can be set using --version YYYYMMDD instead of the current date (the default).
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --version YYYYMMDD
Visualize the excepted DRS tree¶
In order to save disk space, the scanned files are moved into files/dYYYYMMDD folders. The vYYYYMMDD has a
symbolic links skeleton that avoid to duplicate files between two versions.
$> esgdrs tree --project PROJECT_ID /PATH/TO/SCAN/
Warning
Some miscellaneous characters could appear due to wrong encoding configuration. To see ASCII characters, choose another utf-8 font in your console setup.
Set up a root directory¶
By default, the DRS tree is built from your current directory. This can be changed by submitting a root path.
$> esgdrs tree --project PROJECT_ID /PATH/TO/SCAN/ --root /PATH/TO/MY_ROOT
Warning
The DRS tree is automatically rebuilt from the project level. Be careful to not submit a root path including the project.
List Unix command to apply¶
The todo action can be seen as a dry-run to check which unix commands should be apply to build the expected DRS
tree. At this step, no file are moved or copy to the final DRS.
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/
Those Unix command-lines can also be written into a file for further process:
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --commands-file /PATH/TO/COMMANDS.txt
Note
Only the commands statements are written to the file. This is not a logfile.
By default another esgdrs todo run will append new command-lines to the file (if exists).
To overwrite existing file:
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --commands-file /PATH/TO/COMMANDS.txt --overwrite-commands-file
Change the migration mode¶
esgdrs allows different file migration mode.
Default is to move the files from the incoming path to the root directory. Use --copy to make hard copies,
--link to make hard links or --symlink to make symbolic links from the incoming path. We recommend to use
--link and remove the incoming directory after DRS checking. This doesn’t affect the symbolic link skeleton used
for the dataset versioning.
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --copy
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --link
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --symlink
Warning
esgdrs temporarily stores the result of the list action to quickly generate the DRS tree
afterwards. This requires to strictly submit the same arguments from the list action to the following ones.
If not, the incoming files are automatically scan again.
Run the DRS upgrade¶
This will apply all the Unix command you can print with the todo action.
$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/
Run the DRS upgrade from the latest version¶
esgdrs supports two upgrade methods:
(a) (the default) The incoming directory must contain the complete contents of the new version of the dataset. If a file is unchanged from the previous version, it must still be supplied in incoming, although esgprep will detect that it is unmodified, and will optimise disk space by removing duplicates and symlinking to the old version instead. Any files that are not supplied are treated as removed in the new version.
(b) The new version of the dataset is based primarily on the previous published version. The user supplies in the incoming directory (or directories) only the files which are modified in the new version. Any file not supplied in incoming is considered to be the same as in the previous version, and a symlink is created accordingly.
The option --upgrade-from-latest allows you to toggle to method (b):
$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --upgrade-from-latest
By construction, method (b) might not support to simply delete a file between versions, rather than modifying it.
The associated flag --ignore-from-latest allows you to submit a list of filenames to ignore during the version
upgrade (i.e., files to be deleted between versions).
$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --ignore-from-latest /PATH/TO/FILENAMES.TXT
Warning
If --ignore-from-latest is submitted, --upgrade-from-latest is set to True by default.
Note
We highly recommend to use the tree  action to see what the upgraded tree looks like before applying
the upgrade.
Rescanning data¶
By default the list action scans data and record the rebuilt DRS tree into a temporary Pickle file. This file is
then read to skip data scan when other actions (i.e., tree, todo or upgrad) are invoked, except if key
options have been changed from the previous list call. In such a case the scan is redone automatically.
To force the rescan in any case:
$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --rescan
Exit status¶
- Status = 0
 - All the files have been successfully scanned and the DRS tree properly generated.
 
- Status > 0
 - Some scan errors occurred. Some files have been skipped or failed during the scan potentially leading to an incomplete DRS tree.