Usage

This site contains Legacy ESGF Publisher Software documentation. For documentation for recent versions of the software, please see: https://esg-publisher.readthedocs.org/

Preliminary

Set up default environment for scripts:

ESGF v2.5 or later

$ source /usr/local/conda/bin/activate esgf-pub

ESGF v2.4.x or earlier

$ source /etc/esg.env

Publication and unpublication to an index node also requires a valid globus certificate. By running esgpublish or esgunpublish a globus certificate will be generated automatically. Please specify your credentials for the certificate either by filling out the user prompts during the publication or add the credential information to your esg.ini file, see myproxy section.

In case the certificate generation fails for some reason, please create the certificate manually, see Myproxy Logon.

Tip

If your intent is to “script” the publisher, you should diasable the UVCDAT anonymous logging as the user prompt will repeat periodically.

$ export UVCDAT_ANONYMOUS_LOG=no

Publication

The data publication has three components:

  1. A local postgres database

  2. Local Thredds (TDS) catalogs

  3. A Solr Index, either local or on another ESGF node

To be visible in the federation you need to publish to all three components and use one of the federated Indexes.

Some basics:

Show the help message and all options of esgpublish:

$ esgpublish -h

The publisher takes a directory containing mapfiles or a single mapfile as input. If a mapfile directory is specified, it is scanned recursively and all containing mapfiles are published.

$ esgpublish --project <project_name> --map <mapfile or mapfile_directory>

Use esg-node to show or change the index node you are publishing to:

$ esg-node --get-index-peer
$ esg-node --set-index-peer <index_fqdn>

Publish to local postgres database

This step will scan the data files using the uvcdat toolbox , validate the facet values using esg.<project>.ini and publish all data from the mapfile(s) to the local postgres database.

$ esgpublish [optional: -i <path_to_ini_files>] --project <project_name>  --map <input_mapfile or mapfile_directory> --service fileservice [--set-replica]

Note

--service fileservice will add the services specified as fileservice in esg.ini’s thredds_file_services.

Example:

$ esgpublish --project cmip5 --service fileservice --map /esg/mapfiles

 INFO       2016-08-09 13:59:37,073 Creating dataset: cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r1i1p1
 INFO       2016-08-09 13:59:37,079 Scanning /esg/data/cmip5/output1/MPI-M/MPI-ESM-P/historical/day/atmos/day/v20120315/clt/clt_day_MPI-ESM-P_historical_r1i1p1_19500101-19591231.nc
 INFO       2016-08-09 13:59:37,161 Scanning /esg/data/cmip5/output1/MPI-M/MPI-ESM-P/historical/day/atmos/day/v20120315/clt/clt_day_MPI-ESM-P_historical_r1i1p1_19600101-19691231.nc
 ...
 INFO       2016-08-09 13:59:37,383 New dataset version = 20120315
 INFO       2016-08-09 13:59:37,385 Adding file info to database
 INFO       2016-08-09 13:59:37,587 Aggregating variables

For the above example this will add two datasets and several files in the postgres database:

esgcet=# SELECT * FROM dataset WHERE name LIKE 'cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day%';

   id  |                             name                              | project | ...
 ------+---------------------------------------------------------------+---------+-----
  3161 | cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r1i1p1 | cmip5   | ...
  3162 | cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r2i1p1 | cmip5   | ...
 (2 rows)

 esgcet=# SELECT * FROM file WHERE dataset_id=3161 OR dataset_id=3162;

   id   | dataset_id |                              base                        | format
 -------+------------+----------------------------------------------------------+--------
  92804 |       3161 | clt_day_MPI-ESM-P_historical_r1i1p1_19500101-19591231.nc | netCDF
  92805 |       3161 | clt_day_MPI-ESM-P_historical_r1i1p1_19600101-19691231.nc | netCDF
   ...  |       ...  |                         ...                              |  ...
  93912 |       3162 | zg_day_MPI-ESM-P_historical_r2i1p1_19820101-19821231.nc  | netCDF
   ...  |       ...  |                         ...                              |  ...
 (1132 rows)

Publish to local Thredds server

The publication of the Thredds catalogs will use the local postgres database as input and generate one catalog per dataset in XML format, added to the default location /esg/content/thredds/esgcet It is recommended to set a umask so files are world readable, directories accessible, i.e. r-x. Also make sure the (unix-) user you use for publication has write access to the THREDDS catalogs in /esg/content/thredds/esgcet/.

$ esgpublish [optional: -i <path_to_ini_files>] --project <project_name>  --map <input_mapfile or mapfile_directory> --service fileservice --noscan --thredds [--no-thredds-reinit]

Note

--service fileservice is required to publish Globus, GridFTP and OpenDAP urls in default esg-publisher configurations. If omitted, rerunning esgpublish with --thredds can be performed to add those urls.

Note

--noscan skips the netcdf scan of each file. This is useful since the scan was already done in the previous publication step to the database.

Note

If you use a mapfile_directory as input the thredds catalog is reinitialized/rechecked only once, after all mapfiles have been processed. If you prefer to pass only one mapfile per esgpublish call and you are publishing a series of mapfiles its unnecessary to have THREDDS reinitialize the catalog on each call to esgpublish. Use the additional argument --no-thredds-reinit to all calls and finish the publication with $ esgpublish --thredds-reinit to reinitialize/recheck the catalog.

Example:

$ esgpublish --project cmip5 --service fileservice --map /esg/mapfiles --noscan --thredds

INFO       2016-08-09 14:07:21,767 Writing THREDDS catalog /esg/content/thredds/esgcet/13/cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r1i1p1.v20120315.xml
INFO       2016-08-09 14:07:21,767 Writing THREDDS catalog /esg/content/thredds/esgcet/13/cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r2i1p1.v20120315.xml
INFO       2016-08-09 14:07:21,945 Writing THREDDS ESG master catalog /esg/content/thredds/esgcet/catalog.xml
INFO       2016-08-09 14:07:21,993 Reinitializing THREDDS server

For the above example this will generate two Thredds catalogs and add the catalog entry to the postgres database:

$ ls /esg/content/thredds/esgcet/13

/esg/content/thredds/esgcet/13/cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r1i1p1.v20120315.xml
/esg/content/thredds/esgcet/13/cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r2i1p1.v20120315.xml
esgcet=# SELECT * FROM catalog WHERE dataset_name LIKE 'cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day%';

                             dataset_name                      | version  |                                       location                                 | rootpath
---------------------------------------------------------------+----------+--------------------------------------------------------------------------------+----------
 cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r1i1p1 | 20120315 | 13/cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r1i1p1.v20120315.xml | cmip5
 cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r2i1p1 | 20120315 | 13/cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r2i1p1.v20120315.xml | cmip5

Note

You can check for the Thredds catalogs on your local Thredds server: http://<fqdn>/thredds/catalog/esgcet/catalog.html

Publish to index node

The publication to the Index node will read the Thredds catalogs and publish the datasets to Solr using ESGF’s esg-search.

Note

Version v3.4.4 or later: By default the publication will use the REST web service protocol. For the HESSIAN service please use the --hessian-api flag.

Version v3.4.2-3: By default the publication will use the REST web service protocol. The HESSIAN service has been disabled in this version.

Earlier versions: By default the publication will use the HESSIAN web service protocol. For the REST service please use the --rest-api flag.

$ esgpublish [optional: -i <path_to_ini_files>] --project <project_name> --map <input_mapfile or mapfile_directory> --service fileservice --noscan --publish [--hessian-api]

Example:

$ esgpublish --project cmip5 --service fileservice --map /esg/mapfiles --noscan --publish

INFO       2016-08-09 14:10:23,767 Publishing: cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r1i1p1
INFO       2016-08-09 14:10:28,116   Result: SUCCESSFUL
INFO       2016-08-09 14:10:28,767 Publishing: cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r2i1p1
INFO       2016-08-09 14:10:31,116   Result: SUCCESSFUL

Note

The data should now be visible in the Index (http://<index_fqdn>/esg-search/search?) and in CoG: http://<index_fqdn>.

Publish to postgres, Thredds and the Index in one step

Warning

It is not recommended to publish to all components in one step. Please use this call only in case you are sure your configuration is set up correctly.

$ esgpublish [optional: -i <path_to_ini_files>] --project <project_name> --map <input_mapfile or mapfile_directory> --service fileservice --thredds --publish

Adding a Technical Note to a dataset

Some projects require to add a Technical Note to the datasets (e.g. obs4MIPs). This can be done by adding the tech note information to the mapfile, see section Adding a Technical Note to the mapfile. The publisher will automatically use the information in the mapfile to publish the Technical Note to the postgres, Thredds and Solr.

Useful options

  • Echo all SQL commands:

    $ esgpublish --project <project> --map <map> --echo-sql
    
  • Specify the directory containing all configuration files, By default it is set to /esg/config/esgcet.

    $ esgpublish --project <project> --map <map> --i <init_directory>
    
  • Name of output log file. Overrides the configuration log_filename option. Default is standard output.

    $ esgpublish --project <project> --map <map> --log <log_file>
    
  • Specify the version number. This option is only needed if the version is not included in the mapfile (using the dataset_name#version syntax).

    $ esgpublish --project <project> --map <map> --new-version <version_number>
    
  • This will skip the scan of the files. Assumes that the scan has already been done and all information was added to the database. Use this option only with --thredds or --publish.

    $ esgpublish --project <project> --map <map> --noscan [--thredds] [--publish]
    
  • Skip the reinitialization/recheck of the Thredds catalogs. This option can be used if you run a series of esgpublish calls with a single mapfile as input. Finish the publication with --thredds-reinit to reinitialize/recheck the catalog. This option is not necessary if you pass a mapfile_directory as input, in this case the thredds catalog is reinitialized/rechecked only once, after all mapfiles have been processed.

    $ esgpublish --project <project> --map <map> --no-thredds-reinit
    $ esgpublish --thredds-reinit
    
  • Publish the dataset to the index node. Needs Thredds catalogs of the dataset. (Use --noscan to skip the scan of the files.)

    $ esgpublish --project <project> --map <map> --publish [--noscan]
    
  • Set a replica flag to the data.

    $ esgpublish --project <project> --map <map> --set-replica
    
  • Create the Thredds catalogs and reinitialize/recheck the Thredds Server unless --no-thredds-reinit is set. (Use --noscan to skip the scan of the files.)

    $ esgpublish --project <project> --map <map> --thredds [--noscan]
    
  • Publish a single dataset to Thredds or the index, assumes the file information are already in database.

    $ esgpublish --project <project> --use-existing <dataset_name[#version]>
    
  • Like use-existing, but read the list of dataset names from a file, containing one dataset name per line.

    $ esgpublish --project <project> --use-list <dataset_list>
    
  • Use the version indicated in the version_list. version_list is a file, each line of which has the form: dataset_id | version. Not needed if you use the dataset#version syntax in the mapfile(s).

    $ esgpublish --project <project> --map <map> --version-list <version_list>
    

Unpublication

Warning

If you unpublish a dataset passing only the dataset_name it will unpublish all versions of the dataset. To unpublish a single version use the dataset_name#version syntax, e.g.: cmip5.output1.MPI-M.MPI-ESM-P.historical.day.atmos.day.r1i1p1#20120315.

You could either use a mapfile directory, a single mapfile a dataset or a dataset_list as input for the data unpublication:

Note

By default the unpublication from the Solr index will use the REST web service protocol. Thus it is mandatory to specify the version number for each dataset (i.e. dataset_name#version). For the Hessian service please use the --hessian-api flag. When using Hessian API, the version number is not mandatory to specify with the dataset_id.

Note

By default upublication will occur on the index node. In this case, it is required to either specify whether it is desired to retract or delete the dataset. The --retract flag gives the option to leave a publication record of the dataset on the index node, but the data will no longer be available for download. In contrast, --delete completely removes the dataset record from the index. Dataset project guidelines should suggest which of these options should be considered.

  • Using a mapfile directory or a single mapfile; retract the dataset

    $ esgunpublish --project <project> --map <input_mapfile or mapfile_directory> --retract
    
  • Using a list; delete the dataset

    $ esgunpublish --project <project> --use-list <list-of-datasets-filename> --delete
    

    Note

    To obtain the a list of datasets, there are several alternatives. On the command line you can use $ esglist_datasets --no-header --select name <project>

  • Using a single dataset_name; retract

    $ esgunpublish --project <project> dataset_name[#version] --retract
    

Delete from Index and Thredds

Delete the data from Index, remove the THREDDS catalog, reinitialize/recheck the Thredds Server but keep the data on postgres.

$ esgunpublish --project cmip5 --map /esg/mapfiles --delete

Delete from Index

Delete the data from Index but keep the Thredds catalogs and postgres entries.

$ esgunpublish --project cmip5 --map /esg/mapfiles --skip-thredds --delete

Delete from Thredds

Delete the Thredds Catalogs, but keep the data available on the Index node and on the postgres database. In this case the

$ esgunpublish --project cmip5 --map /esg/mapfiles --skip-index

Delete from all components

The data will be removed from postgres, Thredds and the Index node.

$ esgunpublish --project cmip5 --map /esg/mapfiles --database-delete --delete

Warning

Use --database-delete to unpublish test data only. It is highly recommended to keep a history of all production data in postgres.