from datetime import datetime

datetime.now()
datetime.datetime(2025, 3, 21, 11, 3, 21, 470482)

ESGVOC library tutorial#

prerequesite:

pip install esgvoc  
esgvoc install # in order to get the latest CVs

The esgvoc library supports a wide range of use cases, including:

  • Listing:
    All data descriptors from the universe.
    All terms of one data descriptor from the universe.
    All available projects.
    All collections from a project.
    All terms from a project.
    All terms of a collection from a project.

  • Validating an input string against:
    All terms of a project.
    All terms of a collection from a project.
    All terms from all projects (cross-validation).

Universe and projects organization#

The universe CV (Controlled Vocabularies) follows this organizational pattern:

<universe><DataDescriptor><Term>

Similarly, all CVs are organized as:

<project><collection><Term>   

ESGVOC API organization#

The API functions are sorted as follows:

  • get functions return a list of something based on an id (collections from a project, terms from a collection, etc.)

  • find functions try to find terms, data descriptors or collections corresponding to an expression.

  • valid functions check the compliance of an input string to the DRS of terms.

import esgvoc.api as ev

Universe#

Listing#

ev.get_all_data_descriptors_in_universe()
['physic_index',
 'realisation_index',
 'temporal_label',
 'mip_era',
 'horizontal_label',
 'directory_date',
 'initialisation_index',
 'sub_experiment',
 'forcing_index',
 'consortium',
 'license',
 'variable',
 'frequency',
 'source_type',
 'activity',
 'vertical_label',
 'source',
 'date',
 'model_component',
 'product',
 'institution',
 'resolution',
 'time_range',
 'table',
 'variant_label',
 'organisation',
 'experiment',
 'area_label',
 'realm',
 'grid']
ev.get_all_terms_in_data_descriptor(data_descriptor_id="activity")[:3]
# each datadescriptor from the above cell could be use as argument
# [:3] just to limit the result with the 3 first one
[Activity(id='dynvarmip', type='activity', drs_name='DynVarMIP', name='DynVarMIP', long_name='Dynamics and Variability Model Intercomparison Project', url=None, @context='000_context.jsonld', cmip_acronym='DynVarMIP'),
 Activity(id='lumip', type='activity', drs_name='LUMIP', name='LUMIP', long_name='Land-Use Model Intercomparison Project', url=None, @context='000_context.jsonld', cmip_acronym='LUMIP'),
 Activity(id='pmip', type='activity', drs_name='PMIP', name='PMIP', long_name='Palaeoclimate Modelling Intercomparison Project', url=None, @context='000_context.jsonld', cmip_acronym='PMIP')]
ev.get_term_in_data_descriptor(data_descriptor_id="activity", term_id="aerchemmip")
Activity(id='aerchemmip', type='activity', drs_name='AerChemMIP', name='AerChemMIP', long_name='Aerosols and Chemistry Model Intercomparison Project', url=None, @context='000_context.jsonld', cmip_acronym='AerChemMIP')

Little detour: pydantic model instance return#

The result of the previous call is a list of instances of a pydantic model of the requested data descriptor. From the above example, the result is an Activity object that can be query directly in Python.

my_activity = ev.get_term_in_data_descriptor(data_descriptor_id="activity", term_id="aerchemmip")
print(my_activity.id)
print(my_activity.drs_name)
print(my_activity.long_name)
print(my_activity)
aerchemmip
AerChemMIP
Aerosols and Chemistry Model Intercomparison Project
id='aerchemmip' type='activity' drs_name='AerChemMIP' name='AerChemMIP' long_name='Aerosols and Chemistry Model Intercomparison Project' url=None @context='000_context.jsonld' cmip_acronym='AerChemMIP'
ev.get_term_in_universe(term_id="aerchemmip") # give the same result as above
Activity(id='aerchemmip', type='activity', drs_name='AerChemMIP', name='AerChemMIP', long_name='Aerosols and Chemistry Model Intercomparison Project', url=None, @context='000_context.jsonld', cmip_acronym='AerChemMIP')

Find terms in universe#

The find functions perform full text search (FTS) over terms or data descriptor specs. They accept expressions composed not only of keywords but boolean operators that relate them together. The result is sorted according to the hit rank (bm25): the first term in the list is the better match (index zero).

# The headquarter of the institution IPSL and the CNES are both located in Paris.
# We want to find the term which corresponds to the IPSL institution, but not the CNES one:
ev.find_terms_in_data_descriptor(expression='pArIs NOT CNES',
                                 data_descriptor_id='institution',
                                 selected_term_fields=['location'])
[DataDescriptorSubSet(id='ipsl', type='institution', location={'city': 'Paris', 'country': ['France', 'FR'], 'lat': 48.855675, 'lon': 2.332105})]
# We can also search in the whole universe, but expect to find many more terms:
ev.find_terms_in_universe(expression='pArIs NOT CNES',
                          selected_term_fields=['location'])
[DataDescriptorSubSet(id='ipsl', type='institution', location={'city': 'Paris', 'country': ['France', 'FR'], 'lat': 48.855675, 'lon': 2.332105}),
 DataDescriptorSubSet(id='institution/ipsl', type='institution', location={'city': 'Paris', 'country': ['France', 'FR'], 'lat': 48.855675, 'lon': 2.332105})]

Find terms or data decriptors in universe#

# We want to find the data descriptors time_range and its terms:
ev.find_items_in_universe(expression='time_range')
[Item(id='daily', kind=<ItemKind.TERM: 'term'>, parent_id='time_range'),
 Item(id='monthly', kind=<ItemKind.TERM: 'term'>, parent_id='time_range'),
 Item(id='hourly', kind=<ItemKind.TERM: 'term'>, parent_id='time_range'),
 Item(id='time_range', kind=<ItemKind.DATA_DESCRIPTOR: 'data_descriptor'>, parent_id='universe')]

Project example: CMIP6plus#

The API provides the same functions for the projects (get, find) and adds the validation functions.

ev.get_all_projects()
['cmip6', 'cmip6plus']
ev.get_all_collections_in_project(project_id="cmip6plus")
['member_id',
 'activity_id',
 'mip_era',
 'institution_id',
 'source_id',
 'time_range',
 'version',
 'table_id',
 'grid_label',
 'experiment_id',
 'variable_id']
ev.get_all_terms_in_collection(project_id="cmip6plus", collection_id="activity_id")
[Activity(id='cmip', type='activity', drs_name='CMIP', name='CMIP', long_name='CMIP DECK: 1pctCO2, abrupt4xCO2, amip, esm-piControl, esm-historical, historical, and piControl experiments', url='https://gmd.copernicus.org/articles/9/1937/2016/gmd-9-1937-2016.pdf', @context='000_context.jsonld', cmip_acronym='CMIP'),
 Activity(id='lesfmip', type='activity', drs_name='LESFMIP', name='LESFMIP', long_name='The Large Ensemble Single Forcing Model Intercomparison Project', url='https://www.frontiersin.org/articles/10.3389/fclim.2022.955414/full', @context='000_context.jsonld', cmip_acronym='LESFMIP')]
ev.get_term_in_collection(project_id="cmip6plus", collection_id="activity_id", term_id="cmip")
Activity(id='cmip', type='activity', drs_name='CMIP', name='CMIP', long_name='CMIP DECK: 1pctCO2, abrupt4xCO2, amip, esm-piControl, esm-historical, historical, and piControl experiments', url='https://gmd.copernicus.org/articles/9/1937/2016/gmd-9-1937-2016.pdf', @context='000_context.jsonld', cmip_acronym='CMIP')

Find terms in a project#

# We want to find all the term related to miroc:
ev.find_terms_in_project(expression='mir*', project_id='cmip6plus', selected_term_fields=[])
[DataDescriptorSubSet(id='miroc6', type='source'),
 DataDescriptorSubSet(id='miroc', type='organisation')]

Find terms and collections#

# We want to find the collection named 'institution_id'
items_found = ev.find_items_in_project(expression='instit*', project_id='cmip6plus')
print(f'number of items: {len(items_found)}')
for item in items_found:
    if item.kind == 'collection':
        break
print(item)
number of items: 41
id='institution_id' kind=<ItemKind.COLLECTION: 'collection'> parent_id='cmip6plus'
# But we probably should execute this function:
ev.find_collections_in_project(expression='instit*', project_id='cmip6plus')
[('institution_id',
  {'@context': {'id': '@id',
    'type': '@type',
    '@base': 'https://espri-mod.github.io/mip-cmor-tables/organisation/',
    'organisation': 'https://espri-mod.github.io/mip-cmor-tables/organisation',
    'myprop': 'http://TEST',
    'established': {'@id': 'https://schema.org/foundingDate'}}})]

Validating string against the project CV#

valid_string = "IPSL" # the standard name of the institution : "Institut Pierre Simon Laplace"
unvalid_string = "ipsl" # NOT the DRS name ! but in that case it is the 'id' of the term

Queries based on the project and the collection ids#

ev.valid_term_in_collection(value=valid_string, project_id="cmip6plus", collection_id="institution_id")
[MatchingTerm(project_id='cmip6plus', collection_id='institution_id', term_id='ipsl')]
ev.valid_term_in_collection(value=unvalid_string, project_id="cmip6plus", collection_id="institution_id")
[]
if ev.valid_term_in_collection(value=valid_string, project_id="cmip6plus", collection_id="institution_id"):
    print("Valid")
else:
    print("Unvalid")
Valid
if ev.valid_term_in_collection(value=unvalid_string, project_id="cmip6plus", collection_id="institution_id"):
    print("Valid")
else:
    print("Unvalid")
Unvalid

Queries based only on the project id#

ev.valid_term_in_project(value=valid_string, project_id="cmip6plus")
[MatchingTerm(project_id='cmip6plus', collection_id='institution_id', term_id='ipsl')]

Across all projects#

print(ev.valid_term_in_all_projects(value=valid_string))
print(ev.valid_term_in_all_projects(value=unvalid_string))
[MatchingTerm(project_id='cmip6', collection_id='institution_id', term_id='ipsl'), MatchingTerm(project_id='cmip6plus', collection_id='institution_id', term_id='ipsl')]
[]