Configuration

Prepare the configuration files

The ESGF publisher needs two sort of configuration files:

esg.ini

This file is the primary means of configuring the default behavior of the publisher.

esg.<project>.ini

The project configuration files contain the project specific sections for publication of particular projects, e.g. esg.cmip5.ini for CMIP5, esg.cordex.ini for CORDEX, etc.

A esg.ini template and several esg.<project>.ini files are available from the ESGF config repo on GitHub.

The default location of all publisher related configuration files is /esg/config/esgcet/. The ESGF publisher will automatically read from the files in this directory. Alternatively, you may specify the location of the ini files via -i option.

Note

Although it is recommended to generate one configuration file per project the publisher will also work for a single esg.ini that contains all necessary project sections.

Warning

After any modification of the configuration files or the models table you need to update the postgreSQL database by running: $ esginitialize -c

The default config file: esg.ini

esg.ini contains all basic configs needed for the ESGF publisher, i.e. the [DEFAULT], [initialize], [extract], [srmls] and [hsi] sections. This file will be set up during the ESGF installation process.

Warning

Ensure the hessian_service_url contains the correct index node as this has been known to be overwritten.

  1. The [DEFAULT] section

    • Section name and default configs

      [DEFAULT]
      checksum = sha256sum | SHA256
      log_format = %(levelname)-10s %(asctime)s %(message)s
      log_level = INFO
      root_id = <your_node>
      
    • Project options

      Ensure that the project you wish to publish is found in the project_options. Also make sure that it has it’s own project specific ini file, esg.<project>.ini, e.g. esg.cmip5.ini.

      Format of the project_options:

      project_options =
          <project_name_1> | <description> | <next_integer_value_from_above_project>
          <project_name_2> | <description> | <next_integer_value_from_above_project>
      

      Example:

      project_options =
          cmip5    | CMIP5    | 1
          cordex   | CORDEX   | 2
          obs4MIPs | obs4MIPs | 3
          test     | TEST     | 4
      

      Note

      The project_options are updated automatically if you fetch the project ini files by running $ esgfetchini.

    • Postgres configuration

      dburl = postgresql://esgcet:<esgcet_password>@localhost:5432/esgcet
      
    • Thredds configuration

      Ensure that the directory/mountpoint where you store the files for publication is set under thredds_dataset_roots.

      Warning

      Dataset roots should never contain one another. If the data for a particular project is contained within a single directory even if the node publishes just that project, it should be a subdirectory of the root, not included in the dataset_root directory.

      thredds_aggregation_services =
              OpenDAP | /thredds/dodsC/            | gridded
              LAS     | http://<fqdn>/las/getUI.do | LASat<your_node>
      thredds_authentication_realm = THREDDS Data Server
      thredds_catalog_basename = %(dataset_id)s.v%(version)s.xml
      thredds_dataset_roots =
              esg_dataroot | /esg/data
              cmip5        | /data/cmip5
      thredds_error_pattern = Catalog init
      thredds_fatal_error_pattern = **Fatal
      thredds_file_services =
              HTTPServer | /thredds/fileServer/          | HTTPServer    | fileservice
              GridFTP    | gsiftp://<fqdn>:2811/         | GRIDFTP       | fileservice
              OpenDAP    | /thredds/dodsC/               | OpenDAPServer | fileservice
              Globus     | globus:#DEFAULTENDPOINTNAME#/ | Globus        | fileservice
      thredds_master_catalog_name = Earth System Grid catalog
      thredds_use_numbered_directories = True
      thredds_max_catalogs_per_directory = 500
      thredds_offline_services =
              SRM | srm://<fqdn>:6288/srm/v2/server?SFN=/archive.sample.gov | HRMatPCMDI
      thredds_password = <thredds_password>
      thredds_reinit_error_url = https://localhost:443/thredds/admin/content/logs/catalogInit.log
      thredds_reinit_success_pattern = reinit ok
      thredds_reinit_url = https://localhost:443/thredds/admin/debug?Catalogs/recheck
      thredds_restrict_access = esg-user
      thredds_root = /esg/content/thredds/esgcet
      thredds_root_catalog_name = Earth System Root catalog
      thredds_url = http://<fqdn>/thredds/catalog/esgcet
      thredds_username = dnode_user
      

      Note

      It is recommended to have all thredds_file_services including HTTPServer, GridFTP, OpenDAP and Globus unless specific node configuration is needed.

    • Index node configuration

      hessian_service_certfile = %(home)s/.globus/certificate-file
      hessian_service_keyfile = %(home)s/.globus/certificate-file
      hessian_service_certs_location = %(home)s/.globus/certificates
      hessian_service_debug = false
      hessian_service_polling_delay = 3
      hessian_service_polling_iterations = 10
      hessian_service_port = 443
      hessian_service_remote_metadata_url = http://host/esgcet/remote/hessian/guest/remoteMetadataService
      hessian_service_url = https://<index_fqdn>/esg-search/remote/secure/client-cert/hessian/publishingService
      
  1. The [config:<project>] section

    To specify project specific configuration in esg.ini you can add a separate config section for each project. If PIDs are used by the project, the PID configs are set in that section, the same applies for the Citation. It also overrides the hessian_service_url, if specified.

    Example:

    [config:cmip6]
    hessian_service_url = https://esgf-data.dkrz.de/esg-search/remote/secure/client-cert/hessian/publishingService
    citation_url = http://cera-www.dkrz.de/WDCC/meta/CMIP6/%(dataset_id)s.%(version)s.json # not mandatory for CMIP6
    pid_prefix = 21.14100                   # not mandatory for CMIP6
    pid_exchange_name = esgffed-exchange    # not mandatory for CMIP6
    pid_credentials =
      # hostname                  | port  | virtual_host |    username    | password | ssl_enabled
      handle-esgf-trusted.dkrz.de | 5671  |   esgf-pid   | esgf-publisher | <secret> |    true
      pcmdi10.llnl.gov            | 5671  |   esgf-pid   | esgf-publisher | <secret> |    true
      207.38.94.86                | 19102 |   esgf-pid   | esgf-publisher | <secret> |    true
      140.208.31.31               | 5671  |   esgf-pid   | esgf-publisher | <secret> |    true
    

    The pid_credentials are available on Confluence. In case you don’t have access to that page please contact your tier1 node admin.

    Note

    Please change the order of the lines – put the host closest to your location first

    Note

    Please ensure that the firewall is open for all PID hosts on port 5671.

    Note

    This option is optional for most projects, except CMIP6.

    If your project requires ES-DOC documentation access, the ES-DOC configs are set in that section too. ES-DOC authorization is controlled using GitHub’s organizations invitational based structure. The authentication part is set by creating the personal access token.

    1. A verified GitHub account is required, as well as a personal access token generated through your GitHub profile setting page.

    2. Go on the bottom of the left menu to access to “Developer Settings”

    3. Click on “Personal Access Token”:

    4. Click on “Generate new token”

    5. Generate your token

    6. Make sure you associate a meaningful name and description for your newly generated token, to help you manage your tokens.

    7. The next important step is to set the minimum required scope for your personal access token: orgs:READ. Limiting the number of scopes increases the security of your own personal data associated with your github account.

    8. Add your GitHub username and token into your esg.ini section as follow:

    [config:cmip6]
    CDF2CIM_CLIENT_WS_HOST = https://cdf2cim.es-doc.org
    CDF2CIM_CLIENT_GITHUB_USER = <username>
    CDF2CIM_CLIENT_GITHUB_ACCESS_TOKEN = <secret_token>
    

    Please then ask to your ES-DOC officer to get granted authorization. He is the only person qualified to add GitHub users to the requested teams. For the authorization, a user needs to be part of the organization team specified for the institute and project he/she on behalf of which wishes to publish data.

    If your project requires to deal with CMOR tables, the following attributes could help you to manage tables version:
    1. cmor_table_path: Default is /usr/local/<project>-cmor-table. Use this attribute to change the default root path of the CMOR table for the considered project.

    2. data_specs_version: this is the CMOR table version to take into account during publication process. If your netCDF files includes a data_specs_version global attribute, you can set file to automatically switch from one table version to another depending on the file to publish.

    3. cmor_table_subdirs: Default is False. Set True if your CMOR table versions are stored in separate subfolder in the cmor_table_path. By default, the CMOR table folder is initialized as a git repository with a git checkout mechanism to switch to the appriopriate branch depending on the data_specs_version.

    In case of CV failure during publication process, we recommend to fetch CMOR tables using esgfetchtables and enabled subfolders for table version management:

    [config:cmip6]
    cmor_table_path = /PATH/TO/CMOR/TABLES/
    data_specs_version = file
    cmor_table_subdirs = true
    
  2. The [initialize] section

    [initialize]
    initial_models_table = /esg/config/esgcet/esgcet_models_table.txt
    log_level = INFO
    

    The esgcet_models_table is a separate file for the configuration of all models. The default location of this file is /esg/config/esgcet/esgcet_models_table.txt.

    Format of the models table:

    <project> | <model_1> | <model_url_1> | <model_description>
    <project> | <model_2> | <model_url_2> | <model_description>
    

    Example:

    cmip5 | MPI-ESM-P  | | MPI-ESM-P, Max Planck Institute for Meteorology (MPI-M)
    cmip5 | MPI-ESM-LR | | MPI-ESM-LR, Max Planck Institute for Meteorology (MPI-M)
    cmip5 | MPI-ESM-MR | | MPI-ESM-MR, Max Planck Institute for Meteorology (MPI-M)
    

    If you are defining a new project but using an existing model name, you need to add a new entry to the table file for your new pairing as well.

    Note

    After modifying the models table please run $ esginitialize -c to update the postgres database.

  1. The myproxy section

    esgpublish and esgunpublish will automatically generate or renew your globus certificate using the credentials specified here.

    [myproxy]
    hostname = <openid_server>
    username = <esgf_user>
    password = <password>
    

    Note

    If this section is not specified and the globus certificate is not present or valid the user will be prompted for the credentials during esgpublish and esgunpublish.

    Note

    This section is not present by default.

  2. Other sections, e.g. for scanning the files and the offline services

    [extract]
    log_level = INFO
    validate_standard_names = True
    
    [srmls]
    offline_lister_executable = %(home)s/work/Esgcet/esgcet/scripts/srmls.py
    srm_archive = /garchive.nersc.gov
    srm_server = srm://somehost.llnl.gov:6288/srm/v2/server
    srmls = /usr/local/esg/bin/srm-ls
    
    [hsi]
    hsi = /usr/local/bin/hsi
    

The project specific config files: esg.<project>.ini

  1. Set the section name

    Each project specific configuration file starts with a section name following the [project:<project_name>] syntax.

    Warning

    Please note: The <project_name> is case sensitive and needs to match the file name and the project name you specify with --project, e.g. esg.cmip5.ini, [project:cmip5], --project cmip5.

  2. Set the categories to be used for the project

    The categories define the facet fields. All facets listed as enum will be checked against the facet_options, facet_map or facet_pattern. Facets that are listed as string will not be checked unless they are part of the directory_format.

    Format of the categories:

    name | category_type | is_mandatory | is_thredds_property | display_order
    

    If the value for is_thredds_property is set to true the facet will appear in the Thredds Catalog and in the Index.

    Example:

    categories =
        project        | enum   | true  | true  | 0
        product        | enum   | true  | true  | 1
        institute      | string | true  | true  | 2
        model          | enum   | true  | true  | 3
        experiment     | enum   | true  | true  | 4
        time_frequency | enum   | true  | true  | 5
        realm          | enum   | true  | true  | 6
        cmor_table     | enum   | true  | true  | 7
        ensemble       | string | true  | true  | 8
        description    | text   | false | false | 99
    

    You can also set a default value for particular categories, e.g.:

    category_defaults =
        project | cmip5
    
  3. The directory_format

    Ensure that the directory_format is spelled out for the project, check carefully for typos. Data files must be found in the rightmost set of subdirectories specified, the not-project-specific root part in front of the project-specific DRS elements can be specified as %(root), all project related elements must be defined separately, following the %(name)s syntax, e.g.:

    directory_format = %(root)/%(project)s/%(model)s/%(experiment)s/%(realm)s
    or
    directory_format = /some_mountpoint/data/%(project)s/%(model)s/%(experiment)s/%(realm)s
    

    Example:

    /some_mountpoint/data/cmip5/CESM/historical/atmos/blah.nc   - valid
    /some_mountpoint/data/cmip5/CESM/historical/atmos/1/blah.nc - not valid
    /some_mountpoint/data/cmip5/CESM/historical/blah.nc         - not valid
    

    In the example above, /some_mountpoint/data must be included in the thredds_dataset_roots entry in the [DEFAULT] section of esg.ini.

  4. Ensure that you have a dataset_id and optional a dataset_name_format

    The dataset_id is project specific and may mirror the directory structure to a point.

    dataset_id = %(project)s.%(model)s.%(experiment)s.%(realm)s
    

    Note

    The facets used for the dataset_id must be a subset of those used in the directory_format. In other words, the facet names for the dataset_id must appear as variables within the directory_format using the same corresponding names with the %(name)s syntax or must be derived from some other category using a category_map entry in esg.<project>.ini. An error or undefined behavior, such as the sudden absence of that facet value from the dataset_id, might result otherwise.

    The dataset_name_format is a description of the dataset and will appear in the Thredds catalogs and in the Index.

    dataset_name_format = project=%(project_description)s, model=%(model_description)s, experiment=%(experiment_description)s, time_frequency=%(time_frequency)s
    
  5. Generate a <facet>_options list, a <facet>_map or a <facet>_pattern for each facet

    The metadata for each facet that is part of the directory_format (except for version and variable) is checked against the values in facet_options, facet_map or facet_pattern.

    • <facet>_options

      This is a simple list that contains all possible values for a facet, e.g.:

      model_options = MPI-ESM-LR, MPI-ESM-MR, MPI-ESM-P
      time_frequency_options = 3hr, 6hr, day, fx, mon, monClim, subhr, yr
      

      Warning

      The option list for the experiments does not follow the above syntax. Each experiment has the format: <project> | <experiment> | <experiment_description>

      Example:

      experiment_options =
          cmip5 | 1pctCO2     | 1 percent per year CO2
          cmip5 | abrupt4xCO2 | Abrupt 4xCO2
          cmip5 | amip        | AMIP
          cmip5 | amip4K      | AMIP plus 4K anomaly
          cmip5 | amip4xCO2   | 4xCO2 AMIP
          cmip5 | amipFuture  | AMIP plus patterned anomaly
          cmip5 | aqua4K      | Aqua planet plus 4K anomaly
          cmip5 | aqua4xCO2   | 4xCO2 aqua planet
      
    • <facet>_map

      Using a <facet>_map is recommended if the facet is not part of the directory_structure and needs to be mapped to another value, e.g. for CORDEX:

      rcm_name_map = map(project, rcm_model : rcm_name)
          cordex | AWI-HIRHAM5      | HIRHAM5
          cordex | GERICS-REMO2009  | REMO2009
          cordex | KNMI-RACMO22E    | RACMO22E
          cordex | MPI-CSC-REMO2009 | REMO2009
          cordex | UCLM-PROMES      | PROMES
      

      Note

      All <facet>_maps needs to be listed in the project ini file, e.g. maps = rcm_name_map, las_time_delta_map.

    • <facet>_pattern

      A pattern should be used for facets that follow a known syntax, e.g. the ensemble facet:

      ensemble_pattern = r%(digit)si%(digit)sp%(digit)s
      

      Note

      The <facet>_pattern currently supports %(digit)s and %(string)s where %(digit)s matches any number and %(string)s one or more character(s).

  6. Project Handler

    You can either use the publisher’s default handler, a pre-installed project handler or generate a custom handler.

    Note

    The setup and configuration of a custom handler needs expert knowledge. For most projects the default handler will be sufficient. The handlers for major projects like CMIP5 are pre-installed and for some minor projects you can find customized handlers on github.

    To use the default handler please add the following to your project configuration file:

    project_handler_name = basic_builtin
    

    For the pre-installed project handler for CMIP5 add the following:

    handler = esgcet.config.ipcc5_handler:IPCC5Handler
    

    For creating a new customized handler you can run the following command that will generate the basic package:

    $ esgsetup --handler
    

    Now you can customize the handler by editing the project_handler.py file and install the handler package with:

    $ cd <handler_name>
    $ python setup.py install
    

    In your esg.<project>.ini file simply add whatever you have specified for the project_handler_name during the setup.

    project_handler_name = <project_handler_name>
    
  7. The thredds_exclude_variables and variable_per_file.

    As mentioned above it is not needed to create a variable_options list. Instead we need to add a thredds_exclude_variables list that lists all variables that might be part of the file content but are not the target variable.

    thredds_exclude_variables = a, a_bnds, alev1, alevel, alevhalf, alt40, b, ...
    

    The variable_per_file should be always set to true. If this is set to false no aggregations will be generated and all variables that are part of the dataset are wrongly assigned to every file.

    variable_per_file = true
    

    Warning

    If a excludes variable is missing in the thredds_exclude_variables and variable_per_file is set to true this might result in publishing the same file multiple times to Thredds.

    If a variable can be taget variable and exclude variable it must be listed in the variable_locate. The variable_locate is a list of variable and begin-of-filename pairs, following the syntax:

    variable_locate = <var1>,<begin_of_filename1> | <var2>,<begin_of_filename2>
    

    Example:

    variable_locate = ps,ps_ | basin,basin_
    
  8. Enable and disable the LAS access

    The Live Access Server (LAS) is part of the ESGF Installation and can be used to visualize the data.

    If LAS is enabled the publisher will generate and publish a LAS-link for each dataset and aggregation.

    # disable LAS
    las_configure = false
    
    # enable LAS
    las_configure = true
    

    For LAS you also need a las_time_delta_map, e.g.:

    las_time_delta_map = map(time_frequency : las_time_delta)
        yr      | 1 year
        mon     | 1 month
        day     | 1 day
        6hr     | 6 hours
        3hr     | 3 hours
        subhr   | 1 minute
        monClim | 1 month
        fx      | fixed
    
  9. (Optional) The skip_aggregations option:

    If skip_aggregations is set to true, aggregations will not be created. By default this option is set to false.

Prepare user and permissions for publication

Publish to an index node at another side

Please coordinate with that site’s node administrator.

Publish to your own index node

  1. Verify publishing permissions: /esg/config/esgf_policies_local.xml

    Specifications for datasets are given by regular expression. This could include a data_node or a project, institution, model, etc. If you want to publish within a specified collection, ensure that an entry exists for that with a specified ESGF group, publisher role, and Write action.

    Example for publication of CMIP5 data only:

    1 <policy resource=".*cmip5.*" attribute_type="cmip5_publisher" attribute_value="publisher" action="Write"/>
    

    Example for publication of all projects from a particular ESGF node:

    1 <policy resource=".*esgf-test.dkrz.de.*" attribute_type="cmip5_publisher" attribute_value="publisher" action="Write"/>
    

    Note

    Make sure you have the correct permission for both policies files:

    -rw-r----- 1 tomcat tomcat 5840 Aug  8 10:32 /esg/config/esgf_policies_local.xml
    -rw-r----- 1 tomcat tomcat 1381 Mar 21  2016 /esg/config/esgf_policies_common.xml
    
  2. Group, role and permission in the Postgres database:

    For publication you need to create an ESGF account and add the appropriate role and group to that user. Therefore you have to modify the postgres database:

    # login to the escet database
    $ psql -U dbsuper esgcet
    
    # add a new group named cmip5_publisher
    esgcet=# INSERT INTO esgf_security.group VALUES(3, 'cmip5_publisher', 'CMIP5 Publisher', true, true);
    
    # update permission table
    esgcet=# INSERT INTO esgf_security.permission VALUES(2, 3, 4, true);
    

    For the example above the tables in esgcet should look like:

    esgcet=# SELECT * FROM esgf_security.user;
    
    id | firstname | middlename | lastname |     email     |  username    | ...
    ---+-----------+------------+----------+---------------+--------------+-----
    2  | Publish   |            | User     | email@address | publish_user | ...
    
    esgcet=# SELECT * FROM esgf_security.group;
    
        id | name            | description     | visible | automatic_approval
    ----+-----------------+-----------------+---------+--------------------
        3  | cmip5_publisher | CMIP5 Publisher | t       | t
    
    esgcet=# SELECT * FROM esgf_security.role;
    
        id | name      | description
    ----+-----------+----------------
        4  | publisher | Data Publisher
    
    esgcet=# SELECT * FROM esgf_security.permission;
    
        user_id | group_id | role_id | approved
    ---------+----------+---------+----------
        2       | 3        | 4       | t
    
  3. Ensure that the ESGF group has an entry in the /esg/config/esgf_ats_static.xml file for the attribute service for that group, e.g.:

    1 <attribute type="cmip5_publisher"
    2     attributeService="https://<fqdn>/esgf-idp/saml/soap/secure/attributeService.htm"
    3     description="Publisher group for CMIP5 data"
    4     registrationService="https://<fqdn>/esgf-idp/secure/registrationService.htm"/>
    

Myproxy Logon

For publication to an index node you need to have a valid globus certificate for an user with Write permissions.

$ mkdir $HOME/.globus   # if not already present
$ myproxy-logon [ -b ] -s <openid_server> -l <esgf_username> -p 7512 -t 72 -o $HOME/.globus/certificate-file

Note

The certificate is valid for 72 hours when specified by -t. If you are publishing for the first time, you will need to use -b to bootstrap it’s trustroots with the server.

Note

Please get the openid_server and esgf_username from your ESGF OpenID, e.g.

openid:        https://pcmdi.llnl.gov/esgf-idp/openid/publish_user
openid_server: pcmdi.llnl.gov
esgf_username: publish_user