Prepare the configuration files¶
The ESGF publisher needs two sort of configuration files:
- This file is the primary means of configuring the default behavior of the publisher.
- The project configuration files contain the project specific sections for publication of particular projects, e.g.
esg.cordex.inifor CORDEX, etc.
esg.ini template and several
esg.<project>.ini files are available from the ESGF config repo on GitHub.
The default location of all publisher related configuration files is
/esg/config/esgcet/. The ESGF publisher will automatically read from the files in this directory.
Alternatively, you may specify the location of the ini files via
Although it is recommended to generate one configuration file per project the publisher will also work for a single esg.ini that contains all necessary project sections.
After any modification of the configuration files or the models table you need to update the postgreSQL database by running:
$ esginitialize -c
The default config file: esg.ini¶
esg.ini contains all basic configs needed for the ESGF publisher, i.e. the
sections. This file will be set up during the ESGF installation process.
hessian_service_url contains the correct index node as this has been known to be overwritten.
Section name and default configs
[DEFAULT] checksum = sha256sum | SHA256 log_format = %(levelname)-10s %(asctime)s %(message)s log_level = INFO root_id = <your_node>
Ensure that the project you wish to publish is found in the
project_options. Also make sure that it has it’s own project specific ini file,
Format of the project_options:
project_options = <project_name_1> | <description> | <next_integer_value_from_above_project> <project_name_2> | <description> | <next_integer_value_from_above_project>
project_options = cmip5 | CMIP5 | 1 cordex | CORDEX | 2 obs4MIPs | obs4MIPs | 3 test | TEST | 4
project_optionsare updated automatically if you fetch the project ini files by running
dburl = postgresql://esgcet:<esgcet_password>@localhost:5432/esgcet
Ensure that the directory/mountpoint where you store the files for publication is set under
Dataset roots should never contain one another. If the data for a particular project is contained within a single directory even if the node publishes just that project, it should be a subdirectory of the root, not included in the dataset_root directory.
thredds_aggregation_services = OpenDAP | /thredds/dodsC/ | gridded LAS | http://<fqdn>/las/getUI.do | LASat<your_node> thredds_authentication_realm = THREDDS Data Server thredds_catalog_basename = %(dataset_id)s.v%(version)s.xml thredds_dataset_roots = esg_dataroot | /esg/data cmip5 | /data/cmip5 thredds_error_pattern = Catalog init thredds_fatal_error_pattern = **Fatal thredds_file_services = HTTPServer | /thredds/fileServer/ | HTTPServer | fileservice GridFTP | gsiftp://<fqdn>:2811/ | GRIDFTP | fileservice OpenDAP | /thredds/dodsC/ | OpenDAPServer | fileservice Globus | globus:#DEFAULTENDPOINTNAME#/ | Globus | fileservice thredds_master_catalog_name = Earth System Grid catalog thredds_use_numbered_directories = True thredds_max_catalogs_per_directory = 500 thredds_offline_services = SRM | srm://<fqdn>:6288/srm/v2/server?SFN=/archive.sample.gov | HRMatPCMDI thredds_password = <thredds_password> thredds_reinit_error_url = https://localhost:443/thredds/admin/content/logs/catalogInit.log thredds_reinit_success_pattern = reinit ok thredds_reinit_url = https://localhost:443/thredds/admin/debug?Catalogs/recheck thredds_restrict_access = esg-user thredds_root = /esg/content/thredds/esgcet thredds_root_catalog_name = Earth System Root catalog thredds_url = http://<fqdn>/thredds/catalog/esgcet thredds_username = dnode_user
It is recommended to have all
thredds_file_servicesincluding HTTPServer, GridFTP, OpenDAP and Globus unless specific node configuration is needed.
Index node configuration
hessian_service_certfile = %(home)s/.globus/certificate-file hessian_service_keyfile = %(home)s/.globus/certificate-file hessian_service_certs_location = %(home)s/.globus/certificates hessian_service_debug = false hessian_service_polling_delay = 3 hessian_service_polling_iterations = 10 hessian_service_port = 443 hessian_service_remote_metadata_url = http://host/esgcet/remote/hessian/guest/remoteMetadataService hessian_service_url = https://<index_fqdn>/esg-search/remote/secure/client-cert/hessian/publishingService
To specify project specific configuration in esg.ini you can add a separate config section for each project. If PIDs are used by the project, the PID configs are set in that section, the same applies for the Citation. It also overrides the hessian_service_url, if specified.
[config:cmip6] hessian_service_url = https://esgf-data.dkrz.de/esg-search/remote/secure/client-cert/hessian/publishingService citation_url = http://cera-www.dkrz.de/WDCC/meta/CMIP6/%(dataset_id)s.%(version)s.json # not mandatory for CMIP6 pid_prefix = 21.14100 # not mandatory for CMIP6 pid_exchange_name = esgffed-exchange # not mandatory for CMIP6 pid_credentials = # hostname | port | virtual_host | username | password | ssl_enabled handle-esgf-trusted.dkrz.de | 5671 | esgf-pid | esgf-publisher | <secret> | true pcmdi10.llnl.gov | 5671 | esgf-pid | esgf-publisher | <secret> | true 220.127.116.11 | 19102 | esgf-pid | esgf-publisher | <secret> | true 18.104.22.168 | 5671 | esgf-pid | esgf-publisher | <secret> | true
pid_credentialsare available on Confluence. In case you don’t have access to that page please contact your tier1 node admin.
Please change the order of the lines – put the host closest to your location first
Please ensure that the firewall is open for all PID hosts on port 5671.
This option is optional for most projects, except CMIP6.
If your project requires ES-DOC documentation access, the ES-DOC configs are set in that section too. ES-DOC authorization is controlled using GitHub’s organizations invitational based structure. The authentication part is set by creating the personal access token.
- A verified GitHub account is required, as well as a personal access token generated through your GitHub profile setting page.
- Go on the bottom of the left menu to access to “Developer Settings”
- Click on “Personal Access Token”:
- Click on “Generate new token”
- Generate your token
- Make sure you associate a meaningful name and description for your newly generated token, to help you manage your tokens.
- The next important step is to set the minimum required scope for your personal access token:
orgs:READ. Limiting the number of scopes increases the security of your own personal data associated with your github account.
- Add your GitHub username and token into your
esg.inisection as follow:
[config:cmip6] CDF2CIM_CLIENT_WS_HOST = https://cdf2cim.es-doc.org CDF2CIM_CLIENT_GITHUB_USER = <username> CDF2CIM_CLIENT_GITHUB_ACCESS_TOKEN = <secret_token>
Please then ask to your ES-DOC officer to get granted authorization. He is the only person qualified to add GitHub users to the requested teams. For the authorization, a user needs to be part of the organization team specified for the institute and project he/she on behalf of which wishes to publish data.
- If your project requires to deal with CMOR tables, the following attributes could help you to manage tables version:
cmor_table_path: Default is
/usr/local/<project>-cmor-table. Use this attribute to change the default root path of the CMOR table for the considered project.
data_specs_version: this is the CMOR table version to take into account during publication process. If your netCDF files includes a
data_specs_versionglobal attribute, you can set
fileto automatically switch from one table version to another depending on the file to publish.
cmor_table_subdirs: Default is False. Set True if your CMOR table versions are stored in separate subfolder in the
cmor_table_path. By default, the CMOR table folder is initialized as a git repository with a
git checkoutmechanism to switch to the appriopriate branch depending on the
In case of CV failure during publication process, we recommend to fetch CMOR tables using esgfetchtables and enabled subfolders for table version management:
[config:cmip6] cmor_table_path = /PATH/TO/CMOR/TABLES/ data_specs_version = file cmor_table_subdirs = true
[initialize] initial_models_table = /esg/config/esgcet/esgcet_models_table.txt log_level = INFO
esgcet_models_tableis a separate file for the configuration of all models. The default location of this file is
Format of the models table:
<project> | <model_1> | <model_url_1> | <model_description> <project> | <model_2> | <model_url_2> | <model_description>
cmip5 | MPI-ESM-P | | MPI-ESM-P, Max Planck Institute for Meteorology (MPI-M) cmip5 | MPI-ESM-LR | | MPI-ESM-LR, Max Planck Institute for Meteorology (MPI-M) cmip5 | MPI-ESM-MR | | MPI-ESM-MR, Max Planck Institute for Meteorology (MPI-M)
If you are defining a new project but using an existing model name, you need to add a new entry to the table file for your new pairing as well.
After modifying the models table please run
$ esginitialize -cto update the postgres database.
esgunpublishwill automatically generate or renew your globus certificate using the credentials specified here.
[myproxy] hostname = <openid_server> username = <esgf_user> password = <password>
If this section is not specified and the globus certificate is not present or valid the user will be prompted for the credentials during
This section is not present by default.
Other sections, e.g. for scanning the files and the offline services
[extract] log_level = INFO validate_standard_names = True [srmls] offline_lister_executable = %(home)s/work/Esgcet/esgcet/scripts/srmls.py srm_archive = /garchive.nersc.gov srm_server = srm://somehost.llnl.gov:6288/srm/v2/server srmls = /usr/local/esg/bin/srm-ls [hsi] hsi = /usr/local/bin/hsi
The project specific config files: esg.<project>.ini¶
Set the section name
Each project specific configuration file starts with a section name following the
Please note: The <project_name> is case sensitive and needs to match the file name and the project name you specify with
categoriesto be used for the project
categoriesdefine the facet fields. All facets listed as
enumwill be checked against the facet_options, facet_map or facet_pattern. Facets that are listed as
stringwill not be checked unless they are part of the
Format of the categories:
name | category_type | is_mandatory | is_thredds_property | display_order
If the value for
is_thredds_propertyis set to
truethe facet will appear in the Thredds Catalog and in the Index.
categories = project | enum | true | true | 0 product | enum | true | true | 1 institute | string | true | true | 2 model | enum | true | true | 3 experiment | enum | true | true | 4 time_frequency | enum | true | true | 5 realm | enum | true | true | 6 cmor_table | enum | true | true | 7 ensemble | string | true | true | 8 description | text | false | false | 99
You can also set a default value for particular categories, e.g.:
category_defaults = project | cmip5
Ensure that the
directory_formatis spelled out for the project, check carefully for typos. Data files must be found in the rightmost set of subdirectories specified, the not-project-specific root part in front of the project-specific DRS elements can be specified as
%(root), all project related elements must be defined separately, following the
directory_format = %(root)/%(project)s/%(model)s/%(experiment)s/%(realm)s or directory_format = /some_mountpoint/data/%(project)s/%(model)s/%(experiment)s/%(realm)s
/some_mountpoint/data/cmip5/CESM/historical/atmos/blah.nc - valid /some_mountpoint/data/cmip5/CESM/historical/atmos/1/blah.nc - not valid /some_mountpoint/data/cmip5/CESM/historical/blah.nc - not valid
In the example above,
/some_mountpoint/datamust be included in the
thredds_dataset_rootsentry in the
[DEFAULT]section of esg.ini.
Ensure that you have a
dataset_idand optional a
dataset_idis project specific and may mirror the directory structure to a point.
dataset_id = %(project)s.%(model)s.%(experiment)s.%(realm)s
The facets used for the
dataset_idmust be a subset of those used in the
directory_format. In other words, the facet names for the
dataset_idmust appear as variables within the
directory_formatusing the same corresponding names with the
%(name)ssyntax or must be derived from some other category using a
esg.<project>.ini. An error or undefined behavior, such as the sudden absence of that facet value from the
dataset_id, might result otherwise.
dataset_name_formatis a description of the dataset and will appear in the Thredds catalogs and in the Index.
dataset_name_format = project=%(project_description)s, model=%(model_description)s, experiment=%(experiment_description)s, time_frequency=%(time_frequency)s
<facet>_patternfor each facet
The metadata for each facet that is part of the
directory_format(except for version and variable) is checked against the values in facet_options, facet_map or facet_pattern.
This is a simple list that contains all possible values for a facet, e.g.:
model_options = MPI-ESM-LR, MPI-ESM-MR, MPI-ESM-P time_frequency_options = 3hr, 6hr, day, fx, mon, monClim, subhr, yr
The option list for the experiments does not follow the above syntax. Each experiment has the format:
<project> | <experiment> | <experiment_description>
experiment_options = cmip5 | 1pctCO2 | 1 percent per year CO2 cmip5 | abrupt4xCO2 | Abrupt 4xCO2 cmip5 | amip | AMIP cmip5 | amip4K | AMIP plus 4K anomaly cmip5 | amip4xCO2 | 4xCO2 AMIP cmip5 | amipFuture | AMIP plus patterned anomaly cmip5 | aqua4K | Aqua planet plus 4K anomaly cmip5 | aqua4xCO2 | 4xCO2 aqua planet
<facet>_mapis recommended if the facet is not part of the
directory_structureand needs to be mapped to another value, e.g. for CORDEX:
rcm_name_map = map(project, rcm_model : rcm_name) cordex | AWI-HIRHAM5 | HIRHAM5 cordex | GERICS-REMO2009 | REMO2009 cordex | KNMI-RACMO22E | RACMO22E cordex | MPI-CSC-REMO2009 | REMO2009 cordex | UCLM-PROMES | PROMES
All <facet>_maps needs to be listed in the project ini file, e.g.
maps = rcm_name_map, las_time_delta_map.
A pattern should be used for facets that follow a known syntax, e.g. the ensemble facet:
ensemble_pattern = r%(digit)si%(digit)sp%(digit)s
The <facet>_pattern currently supports
%(digit)smatches any number and
%(string)sone or more character(s).
You can either use the publisher’s default handler, a pre-installed project handler or generate a custom handler.
The setup and configuration of a custom handler needs expert knowledge. For most projects the default handler will be sufficient. The handlers for major projects like CMIP5 are pre-installed and for some minor projects you can find customized handlers on github.
To use the default handler please add the following to your project configuration file:
project_handler_name = basic_builtin
For the pre-installed project handler for CMIP5 add the following:
handler = esgcet.config.ipcc5_handler:IPCC5Handler
For creating a new customized handler you can run the following command that will generate the basic package:
$ esgsetup --handler
Now you can customize the handler by editing the
project_handler.pyfile and install the handler package with:
$ cd <handler_name> $ python setup.py install
esg.<project>.inifile simply add whatever you have specified for the
project_handler_nameduring the setup.
project_handler_name = <project_handler_name>
As mentioned above it is not needed to create a
variable_optionslist. Instead we need to add a
thredds_exclude_variableslist that lists all variables that might be part of the file content but are not the target variable.
thredds_exclude_variables = a, a_bnds, alev1, alevel, alevhalf, alt40, b, ...
variable_per_fileshould be always set to
true. If this is set to false no aggregations will be generated and all variables that are part of the dataset are wrongly assigned to every file.
variable_per_file = true
If a excludes variable is missing in the
variable_per_fileis set to true this might result in publishing the same file multiple times to Thredds.
If a variable can be taget variable and exclude variable it must be listed in the
variable_locateis a list of variable and begin-of-filename pairs, following the syntax:
variable_locate = <var1>,<begin_of_filename1> | <var2>,<begin_of_filename2>
variable_locate = ps,ps_ | basin,basin_
Enable and disable the LAS access
The Live Access Server (LAS) is part of the ESGF Installation and can be used to visualize the data.
If LAS is enabled the publisher will generate and publish a LAS-link for each dataset and aggregation.
# disable LAS las_configure = false # enable LAS las_configure = true
For LAS you also need a
las_time_delta_map = map(time_frequency : las_time_delta) yr | 1 year mon | 1 month day | 1 day 6hr | 6 hours 3hr | 3 hours subhr | 1 minute monClim | 1 month fx | fixed
skip_aggregationsis set to
true, aggregations will not be created. By default this option is set to
Prepare user and permissions for publication¶
Publish to an index node at another side¶
Please coordinate with that site’s node administrator.
Publish to your own index node¶
Verify publishing permissions:
Specifications for datasets are given by regular expression. This could include a data_node or a project, institution, model, etc. If you want to publish within a specified collection, ensure that an entry exists for that with a specified ESGF group, publisher role, and Write action.
Example for publication of CMIP5 data only:
<policy resource=".*cmip5.*" attribute_type="cmip5_publisher" attribute_value="publisher" action="Write"/>
Example for publication of all projects from a particular ESGF node:
<policy resource=".*esgf-test.dkrz.de.*" attribute_type="cmip5_publisher" attribute_value="publisher" action="Write"/>
Make sure you have the correct permission for both policies files:
-rw-r----- 1 tomcat tomcat 5840 Aug 8 10:32 /esg/config/esgf_policies_local.xml -rw-r----- 1 tomcat tomcat 1381 Mar 21 2016 /esg/config/esgf_policies_common.xml
Group, role and permission in the Postgres database:
For publication you need to create an ESGF account and add the appropriate role and group to that user. Therefore you have to modify the postgres database:
# login to the escet database $ psql -U dbsuper esgcet # add a new group named cmip5_publisher esgcet=# INSERT INTO esgf_security.group VALUES(3, 'cmip5_publisher', 'CMIP5 Publisher', true, true); # update permission table esgcet=# INSERT INTO esgf_security.permission VALUES(2, 3, 4, true);
For the example above the tables in esgcet should look like:
esgcet=# SELECT * FROM esgf_security.user; id | firstname | middlename | lastname | email | username | ... ---+-----------+------------+----------+---------------+--------------+----- 2 | Publish | | User | email@address | publish_user | ... esgcet=# SELECT * FROM esgf_security.group; id | name | description | visible | automatic_approval ----+-----------------+-----------------+---------+-------------------- 3 | cmip5_publisher | CMIP5 Publisher | t | t esgcet=# SELECT * FROM esgf_security.role; id | name | description ----+-----------+---------------- 4 | publisher | Data Publisher esgcet=# SELECT * FROM esgf_security.permission; user_id | group_id | role_id | approved ---------+----------+---------+---------- 2 | 3 | 4 | t
Ensure that the ESGF group has an entry in the
/esg/config/esgf_ats_static.xmlfile for the attribute service for that group, e.g.:
1 2 3 4
<attribute type="cmip5_publisher" attributeService="https://<fqdn>/esgf-idp/saml/soap/secure/attributeService.htm" description="Publisher group for CMIP5 data" registrationService="https://<fqdn>/esgf-idp/secure/registrationService.htm"/>
For publication to an index node you need to have a valid globus certificate for an user with Write permissions.
$ mkdir $HOME/.globus # if not already present $ myproxy-logon [ -b ] -s <openid_server> -l <esgf_username> -p 7512 -t 72 -o $HOME/.globus/certificate-file
The certificate is valid for 72 hours when specified by
-t. If you are publishing for the first time, you will need to use
-b to bootstrap it’s trustroots with the server.
Please get the
esgf_username from your ESGF OpenID, e.g.
openid: https://pcmdi.llnl.gov/esgf-idp/openid/publish_user openid_server: pcmdi.llnl.gov esgf_username: publish_user