Skip to content

Conversation

@mishaschwartz
Copy link
Collaborator

Overview

New data deploy scheduler jobs no longer need to copy/paste lots of boilerplate code to create a new job.
Instead they can simply define specific environment variables and then the can now use the
optional-components/scheduler-job-deploy_data job will automatically generate a new data deploy job.

For example, if XXXX is added to the SCHEDULER_JOB_DEPLOY_DATA_JOB_IDS variable and the following
variables are defined:

  • SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_NAME
  • SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_COMMENT
  • SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_CHECKOUT_CACHE
  • SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_LOG_FILENAME
  • SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_SCHEDULE
  • SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_CONFIG_FILE

a deploy data job will be automatically created.

See optional-components/scheduler-job-deploy_raven_testdata/default.env and
optional-components/scheduler-job-deploy_raven_testdata/default.env for examples.

See birdhouse/deployment/deploy-data for details on how the deploy data job works.

Changes

Non-breaking changes

  • Reconfigures deploy data jobs for scheduler componenents

Breaking changes

Related Issue / Discussion

Additional Information

CI Operations

birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false

@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 27, 2025
Base automatically changed from configurable-crontab to master May 27, 2025 15:10
Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I much like your approach for all extra jobs to drive everything through their default.env ! Looking really clean !

I had started a previous consolidation of both new (using components) and old (sourcing deploy_data_job.env) so that both new and old will be coming from exactly the same source template job, see ae66ef1...24c3869. My approach is less clean than yours since I have a pre-docker-compose-up.include, in addition to default.env, for each new component.

I never got around to test my attempt since all the back-compat vars issues were preventing me. So do not assume the code there actually works but the idea is there.

Since we are retaining deploy_data_job.env for backward compatibility, how about changing your optional-components/scheduler-job-deploy_data/pre-docker-compose-up.include to leave deploy_data_job.env to perform the actual job template generation? This will not only avoid duplicate job definition for all the "new component styles" but also ensure consistency with all the existing "source deploy_data_job.env" style.

The deploy_data_job.env will have to be slightly modified to be used by pre-docker-compose-up.include like in my untested branch. But all the configurable var names should be unchanged to not break existing jobs.

Each new component default.env will look like https://github.com/bird-house/birdhouse-deploy/blob/2.14.0/birdhouse/components/scheduler/deploy_raven_testdata_to_thredds.env.

All specific mapping (ex: SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_NAME to DEPLOY_DATA_JOB_JOB_NAME used by default_data_job.env) are performed by optional-components/scheduler-job-deploy_data/pre-docker-compose-up.include.

All default values should be defined in default_data_job.env to keep the old style standalone, without having to enable optional-components/scheduler-job-deploy_data.

This way we have the best of both worlds:

  1. new components style are clean with only default.env
  2. old "source deploy_data_job.env" style are fully backward-compatible
  3. any future template or default value changes are in one single place for both old and new styles, avoiding inconsistency bugs between the styles.

- `SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_CHECKOUT_CACHE`
- `SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_LOG_FILENAME`
- `SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_SCHEDULE`
- `SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_CONFIG_FILE`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice interface !

@@ -1,29 +1,20 @@
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_NAME=deploy_raven_testdata_to_thredds
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_COMMENT="Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks."
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE='${BIRDHOUSE_DATA_PERSIST_ROOT}/deploy_data_cache/deploy_raven_testdata_to_thredds'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JOB_CHECKOUT_CACHE default generated from JOB_NAME if unset, see

# Location for local cache of git clone to save bandwidth and time from always
# re-cloning from scratch.
if [ -z "$DEPLOY_DATA_JOB_CHECKOUT_CACHE" ]; then
DEPLOY_DATA_JOB_CHECKOUT_CACHE="${BIRDHOUSE_DATA_PERSIST_ROOT:-/data}/deploy_data_cache/${DEPLOY_DATA_JOB_JOB_NAME}"
fi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your new commit ec9e810, I think this job specific SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE and the one below SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_LOG_FILENAME can be deleted since they will derive from the default.

export SCHEDULER_JOB_DEPLOY_DATA_JOB_VERSION='19.03.6-git'
export SCHEDULER_JOB_DEPLOY_DATA_JOB_IMAGE='${SCHEDULER_JOB_DEPLOY_DATA_JOB_DOCKER}:${SCHEDULER_JOB_DEPLOY_DATA_JOB_VERSION}'

export SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS='$([ -n "$DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE" ] && echo "--volume ${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}:${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}:ro --env DEPLOY_DATA_GIT_SSH_IDENTITY_FILE=${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE} ")'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This JOB_GIT_SSH_IDENTITY_FILE can be set per job as well, because different private repos can potentially have different keys, see

# Location of ssh private key for git clone over ssh, useful for private repos.
#DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE="/path/to/id_rsa"
#DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE=/home/vagrant/.ssh/id_rsa_git_ssh_read_only

This var is reset at the end, allowing for different key per job, see

# Reset all config vars to prevent cross-contamination between successive invocations.
DEPLOY_DATA_JOB_SCHEDULE=""
DEPLOY_DATA_JOB_JOB_NAME=""
DEPLOY_DATA_JOB_CONFIG=""
DEPLOY_DATA_JOB_CHECKOUT_CACHE=""
DEPLOY_DATA_JOB_LOGFILE=""
DEPLOY_DATA_JOB_JOB_DESCRIPTION=""
DEPLOY_DATA_JOB_DOCKER_IMAGE=""
DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE=""
DEPLOY_DATA_JOB_DOCKER_RUN_EXTRA_OPTS=""

Now looking back at this list, DEPLOY_DATA_JOB_DOCKER_IMAGE was also possible to have a different image per job, with a default if unset !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your new commit ec9e810, I think this SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS is unused and can be deleted.

--volume ${checkout_cache}:${checkout_cache}:rw
--volume ${BIRDHOUSE_LOG_DIR}:${BIRDHOUSE_LOG_DIR}:rw
--env DEPLOY_DATA_CHECKOUT_CACHE=${checkout_cache}
--env DEPLOY_DATA_LOGFILE=${log_file_name:-"deploy-data-${name}.log"} ${SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS} ${extra_args}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing support for DEPLOY_DATA_JOB_DOCKER_RUN_EXTRA_OPTS, see https://github.com/bird-house/birdhouse-deploy/blob/b5cd7f6501793ea8e67f76d2208e02dc797ba031/birdhouse/components/scheduler/deploy_data_job.env#L99C81-L99C118

and

# Docker run extra opts.
# 4 spaces in front of --env very important to respect.
#DEPLOY_DATA_JOB_DOCKER_RUN_EXTRA_OPTS="
# --env ENV1=val1
# --env ENV2=val2"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality is supported with the SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_EXTRA_ARGS variable but I'll change the name to SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_EXTRA_OPTIONS since you're right that they're options, not arguments.

export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_NAME=deploy_raven_testdata_to_thredds
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_COMMENT="Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks."
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE='${BIRDHOUSE_DATA_PERSIST_ROOT}/deploy_data_cache/deploy_raven_testdata_to_thredds'
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_LOG_FILENAME='deploy_raven_testdata_to_thredds.log'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JOB_LOG_FILENAME also have default generated from JOB_NAME, see

# Log file location. Default location under /var/log/birdhouse/ has built-in logrotate.
if [ -z "$DEPLOY_DATA_JOB_LOGFILE" ]; then
DEPLOY_DATA_JOB_LOGFILE="${BIRDHOUSE_LOG_DIR}/${DEPLOY_DATA_JOB_JOB_NAME}.log"
fi

The previous deploy_raven_testdata_to_thredds.env, only 4 vars need to be set, the rest have generated defaults. This default.env should be similar. See

# Source this file in env.local before sourcing deploy_data_job.env.
# This will configure deploy_data_job.env.
DEPLOY_DATA_JOB_SCHEDULE="*/30 * * * *" # UTC
DEPLOY_DATA_JOB_JOB_NAME="deploy_raven_testdata_to_thredds"
DEPLOY_DATA_JOB_CONFIG="${COMPOSE_DIR}/deployment/deploy-data-raven-testdata-to-thredds.yml"
DEPLOY_DATA_JOB_JOB_DESCRIPTION="Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup this is the same: only name, schedule and config file need to be specified.

@mishaschwartz
Copy link
Collaborator Author

Since we are retaining deploy_data_job.env for backward compatibility, how about changing your optional-components/scheduler-job-deploy_data/pre-docker-compose-up.include to leave deploy_data_job.env to perform the actual job template generation?

I don't think that's a good idea, that script does something different by appending to the BIRDHOUSE_AUTODEPLOY_EXTRA_SCHEDULER_JOBS variable and I don't think that will play nicely with the new configurable cron job framework. I'm happy to leave that there so that you can still use it to build data deploy jobs like you used to though.

But all the configurable var names should be unchanged to not break existing jobs.

None of this will break existing jobs, everything is backwards compatible with how you're doing it at PAVICS right now. This will only apply to new jobs or existing jobs that you want to convert to using the new method of defining them as optional components.

This way we have the best of both worlds:

  1. new components style are clean with only default.env

Already done

  1. old "source deploy_data_job.env" style are fully backward-compatible

Already done

  1. any future template or default value changes are in one single place for both old and new styles, avoiding inconsistency bugs between the styles.

I disagree, the old style can be kept for backwards compatibility for PAVICS but the old style is deprecated and shouldn't be encouraged for any new jobs.

@tlvu
Copy link
Collaborator

tlvu commented May 28, 2025

that script does something different by appending to the BIRDHOUSE_AUTODEPLOY_EXTRA_SCHEDULER_JOBS variable and I don't think that will play nicely with the new configurable cron job framework.

Why would it not play nicely with the new framework? The new framework do not use BIRDHOUSE_AUTODEPLOY_EXTRA_SCHEDULER_JOBS variable, so no conflict, and write all jobs to a different file optional-components/scheduler-job-deploy_data/config.yml, so no conflict again with the previous framework.

Both frameworks will generate exactly the same boilerplate so wouldn't it be less code duplication and debugging errors to have one single source of boilerplate instead of 2?

But all the configurable var names should be unchanged to not break existing jobs.

None of this will break existing jobs, everything is backwards compatible with how you're doing it at PAVICS right now. This will only apply to new jobs or existing jobs that you want to convert to using the new method of defining them as optional components.

I was referring "keep some config var names" when adapting deploy_data_job.env to work with optional-components/scheduler-job-deploy_data/pre-docker-compose-up.include to not break compat with existing usage of deploy_data_job.env. I did not meant the new framework also need to have the same config var name.

the old style is deprecated and shouldn't be encouraged for any new jobs.

Agreed and since docs and examples of the old style are removed, we are absolutely not encouraging new jobs. My fear is all existing jobs might suddenly break if we make a change in the new style boilerplate and forgot to make the corresponding same update to the old style boilerplate. By having one boilerplate instead of 2 identical ones, we will save us this trouble in the future.

Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review after your recent changes. This is exactly what I was trying to avoid: developing another identical boilerplate and having to fix bugs (wrong default) and regressions (missing features) while the existing boilerplate was already battle tested for 4 years. Had we reused the existing boilerplate, we won't have wrong defaults and missing features.

I was going to submit this kind of PR myself because I need it for all my external jobs relying on deploy-data.

I am very grateful you started this and I do like very much your approach that the new components only need default.env.

I can take over this PR if you don't mind. I am in a much better position to test this and to ensure existing features and defaults are not breaking for all the existing jobs.

[ -z "$config_file" ] && log ERROR "$(sed $error_msg | 's/XXX/config_file/')" && return 1

comment="${comment:-"${name}"}"
checkout_cache="${checkout_cache:-"${name}"}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for checkout_cache is not just $name, it should be ${BIRDHOUSE_DATA_PERSIST_ROOT:-/data}/deploy_data_cache/${name}, see

# Location for local cache of git clone to save bandwidth and time from always
# re-cloning from scratch.
if [ -z "$DEPLOY_DATA_JOB_CHECKOUT_CACHE" ]; then
DEPLOY_DATA_JOB_CHECKOUT_CACHE="${BIRDHOUSE_DATA_PERSIST_ROOT:-/data}/deploy_data_cache/${DEPLOY_DATA_JOB_JOB_NAME}"
fi

export SCHEDULER_JOB_DEPLOY_DATA_JOB_VERSION='19.03.6-git'
export SCHEDULER_JOB_DEPLOY_DATA_JOB_IMAGE='${SCHEDULER_JOB_DEPLOY_DATA_JOB_DOCKER}:${SCHEDULER_JOB_DEPLOY_DATA_JOB_VERSION}'

export SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS='$([ -n "$DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE" ] && echo "--volume ${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}:${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}:ro --env DEPLOY_DATA_GIT_SSH_IDENTITY_FILE=${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE} ")'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your new commit ec9e810, I think this SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS is unused and can be deleted.

@@ -1,29 +1,20 @@
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_NAME=deploy_raven_testdata_to_thredds
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_COMMENT="Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks."
export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE='${BIRDHOUSE_DATA_PERSIST_ROOT}/deploy_data_cache/deploy_raven_testdata_to_thredds'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your new commit ec9e810, I think this job specific SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE and the one below SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_LOG_FILENAME can be deleted since they will derive from the default.

checkout_cache="${checkout_cache:-"${name}"}"
image="${image:-"${SCHEDULER_JOB_DEPLOY_DATA_JOB_IMAGE}"}"
git_ssh_id_file="${git_ssh_id_file:-"${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}"}"
log_file_name="${log_file_name:-"${name}.log"}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_file_name default should be ${BIRDHOUSE_LOG_DIR}/${name}.log, see

# Log file location. Default location under /var/log/birdhouse/ has built-in logrotate.
if [ -z "$DEPLOY_DATA_JOB_LOGFILE" ]; then
DEPLOY_DATA_JOB_LOGFILE="${BIRDHOUSE_LOG_DIR}/${DEPLOY_DATA_JOB_JOB_NAME}.log"
fi

@mishaschwartz
Copy link
Collaborator Author

I can take over this PR if you don't mind. I am in a much better position to test this and to ensure existing features and defaults are not breaking for all the existing jobs.

Sounds good. You can take over please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants