Manage local data through the DRS¶
The Data Reference Syntax (DRS) defines the way your data must be organised on your filesystem. This allows a proper
publication on the ESGF node.
esgdrs is designed to help ESGF data node managers to prepare incoming data for
publication, placing files in the DRS directory structure, and to manage multiple versions of publication-level datasets
in a way that minimises disk usage.
Only CMORized netCDF files are supported as incoming files.
esgdrsactions are available to manage your local archive:
listlists publication-level datasets,
treedisplays the final DRS tree,
todoshows file operations pending for the next version,
upgrademakes the changes to upgrade datasets to the next version.
esgdrs deduces the excepted DRS by scanning the incoming files and checking the facets against the
esg.<project>.ini file. The DRS facets values are deduced from:
- The command-line using
--set facet=value. This flag can be used several times to set several facets values.
- The filename pattern using the
- The NetCDF global attributes by picking the attribute with the nearest name of the facet key.
The incoming files are supposed to be produced by CMOR (or at least be
CMOR-compliant) and unversioned.
esgdrs will apply a version regardless of the incoming file path. The
applied version only depends on the
--version flag and the existing dataset versions in the DRS
Set a facet value¶
In some cases, a DRS facet value cannot be properly deduces from the above sources. To solve this issue, a facet value can be set for the whole scan. By duplicating the flag several facet value can be enforced. If the same facet key is used, only the last value will be considered.
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-value FACET_KEY=VALUE $> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-value FACET_KEY1=VALUE1 --set-value FACET_KEY2=VALUE2
For instance, the
product facet in CMIP5 project is not part of the filename and is often set to
output in CMIP5 NetCDF global attributes however it should be
output2. Consequently, you can
--set-value product=output1 or
--set-value product=output2 depending on the dataset.
Enforce a facet mapping¶
Based on the same schema of the
--set-value argument, the mapping between a (list of) facet key and a (list of)
particular NetCDF attribute can be enforced for the whole scan.
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-key FACET_KEY=ATTRIBUTE $> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --set-key FACET_KEY1=ATTRIBUTE1 --set-value FACET_KEY2=ATTRIBUTE2
For instance, the
institute facet in CORDEX project is not part of the filename and corresponds to the
institute_id NetCDF global attribute. Consequently, you can use
Set up the version upgrade¶
The upgraded version can be set using
--version YYYYMMDD instead of the current date (the default).
$> esgdrs list --project PROJECT_ID /PATH/TO/SCAN/ --version YYYYMMDD
Visualize the excepted DRS tree¶
In order to save disk space, the scanned files are moved into
files/dYYYYMMDD folders. The
vYYYYMMDD has a
symbolic links skeleton that avoid to duplicate files between two versions.
$> esgdrs tree --project PROJECT_ID /PATH/TO/SCAN/
Some miscellaneous characters could appear due to wrong encoding configuration. To see ASCII characters, choose another utf-8 font in your console setup.
Set up a root directory¶
By default, the DRS tree is built from your current directory. This can be changed by submitting a root path.
$> esgdrs tree --project PROJECT_ID /PATH/TO/SCAN/ --root /PATH/TO/MY_ROOT
The DRS tree is automatically rebuilt from the project level. Be careful to not submit a root path including the project.
List Unix command to apply¶
todo action can be seen as a dry-run to check which unix commands should be apply to build the expected DRS
tree. At this step, no file are moved or copy to the final DRS.
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/
Those Unix command-lines can also be written into a file for further process:
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --commands-file /PATH/TO/COMMANDS.txt
Only the commands statements are written to the file. This is not a logfile.
By default another
esgdrs todo run will append new command-lines to the file (if exists).
To overwrite existing file:
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --commands-file /PATH/TO/COMMANDS.txt --overwrite-commands-file
Change the migration mode¶
esgdrs allows different file migration mode.
Default is to move the files from the incoming path to the root directory. Use
--copy to make hard copies,
--link to make hard links or
--symlink to make symbolic links from the incoming path. We recommend to use
--link and remove the incoming directory after DRS checking. This doesn’t affect the symbolic link skeleton used
for the dataset versioning.
$> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --copy $> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --link $> esgdrs todo --project PROJECT_ID /PATH/TO/SCAN/ --symlink
esgdrs temporarily stores the result of the
list action to quickly generate the DRS tree
afterwards. This requires to strictly submit the same arguments from the
list action to the following ones.
If not, the incoming files are automatically scan again.
Run the DRS upgrade¶
This will apply all the Unix command you can print with the
$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/
Run the DRS upgrade from the latest version¶
esgdrs supports two upgrade methods:
(a) (the default) The incoming directory must contain the complete contents of the new version of the dataset. If a file is unchanged from the previous version, it must still be supplied in incoming, although esgprep will detect that it is unmodified, and will optimise disk space by removing duplicates and symlinking to the old version instead. Any files that are not supplied are treated as removed in the new version.
(b) The new version of the dataset is based primarily on the previous published version. The user supplies in the incoming directory (or directories) only the files which are modified in the new version. Any file not supplied in incoming is considered to be the same as in the previous version, and a symlink is created accordingly.
--upgrade-from-latest allows you to toggle to method (b):
$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --upgrade-from-latest
By construction, method (b) might not support to simply delete a file between versions, rather than modifying it.
The associated flag
--ignore-from-latest allows you to submit a list of filenames to ignore during the version
upgrade (i.e., files to be deleted between versions).
$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --ignore-from-latest /PATH/TO/FILENAMES.TXT
--ignore-from-latest is submitted,
--upgrade-from-latest is set to
True by default.
We highly recommend to use the
tree action to see what the upgraded tree looks like before applying
By default the
list action scans data and record the rebuilt DRS tree into a temporary Pickle file. This file is
then read to skip data scan when other actions (i.e.,
upgrad) are invoked, except if key
options have been changed from the previous
list call. In such a case the scan is redone automatically.
To force the rescan in any case:
$> esgdrs upgrade --project PROJECT_ID /PATH/TO/SCAN/ --rescan
- Status = 0
- All the files have been successfully scanned and the DRS tree properly generated.
- Status > 0
- Some scan errors occurred. Some files have been skipped or failed during the scan potentially leading to an incomplete DRS tree.