-
Notifications
You must be signed in to change notification settings - Fork 7
Create template component for data deploy jobs #538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I much like your approach for all extra jobs to drive everything through their default.env ! Looking really clean !
I had started a previous consolidation of both new (using components) and old (sourcing deploy_data_job.env) so that both new and old will be coming from exactly the same source template job, see ae66ef1...24c3869. My approach is less clean than yours since I have a pre-docker-compose-up.include, in addition to default.env, for each new component.
I never got around to test my attempt since all the back-compat vars issues were preventing me. So do not assume the code there actually works but the idea is there.
Since we are retaining deploy_data_job.env for backward compatibility, how about changing your optional-components/scheduler-job-deploy_data/pre-docker-compose-up.include to leave deploy_data_job.env to perform the actual job template generation? This will not only avoid duplicate job definition for all the "new component styles" but also ensure consistency with all the existing "source deploy_data_job.env" style.
The deploy_data_job.env will have to be slightly modified to be used by pre-docker-compose-up.include like in my untested branch. But all the configurable var names should be unchanged to not break existing jobs.
Each new component default.env will look like https://github.com/bird-house/birdhouse-deploy/blob/2.14.0/birdhouse/components/scheduler/deploy_raven_testdata_to_thredds.env.
All specific mapping (ex: SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_NAME to DEPLOY_DATA_JOB_JOB_NAME used by default_data_job.env) are performed by optional-components/scheduler-job-deploy_data/pre-docker-compose-up.include.
All default values should be defined in default_data_job.env to keep the old style standalone, without having to enable optional-components/scheduler-job-deploy_data.
This way we have the best of both worlds:
- new components style are clean with only
default.env - old "source
deploy_data_job.env" style are fully backward-compatible - any future template or default value changes are in one single place for both old and new styles, avoiding inconsistency bugs between the styles.
| - `SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_CHECKOUT_CACHE` | ||
| - `SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_LOG_FILENAME` | ||
| - `SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_SCHEDULE` | ||
| - `SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_CONFIG_FILE` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice interface !
birdhouse/optional-components/scheduler-job-deploy_raven_testdata/default.env
Show resolved
Hide resolved
| @@ -1,29 +1,20 @@ | |||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_NAME=deploy_raven_testdata_to_thredds | |||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_COMMENT="Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks." | |||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE='${BIRDHOUSE_DATA_PERSIST_ROOT}/deploy_data_cache/deploy_raven_testdata_to_thredds' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JOB_CHECKOUT_CACHE default generated from JOB_NAME if unset, see
birdhouse-deploy/birdhouse/components/scheduler/deploy_data_job.env
Lines 39 to 43 in b5cd7f6
| # Location for local cache of git clone to save bandwidth and time from always | |
| # re-cloning from scratch. | |
| if [ -z "$DEPLOY_DATA_JOB_CHECKOUT_CACHE" ]; then | |
| DEPLOY_DATA_JOB_CHECKOUT_CACHE="${BIRDHOUSE_DATA_PERSIST_ROOT:-/data}/deploy_data_cache/${DEPLOY_DATA_JOB_JOB_NAME}" | |
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your new commit ec9e810, I think this job specific SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE and the one below SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_LOG_FILENAME can be deleted since they will derive from the default.
| export SCHEDULER_JOB_DEPLOY_DATA_JOB_VERSION='19.03.6-git' | ||
| export SCHEDULER_JOB_DEPLOY_DATA_JOB_IMAGE='${SCHEDULER_JOB_DEPLOY_DATA_JOB_DOCKER}:${SCHEDULER_JOB_DEPLOY_DATA_JOB_VERSION}' | ||
|
|
||
| export SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS='$([ -n "$DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE" ] && echo "--volume ${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}:${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}:ro --env DEPLOY_DATA_GIT_SSH_IDENTITY_FILE=${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE} ")' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This JOB_GIT_SSH_IDENTITY_FILE can be set per job as well, because different private repos can potentially have different keys, see
birdhouse-deploy/birdhouse/components/scheduler/deploy_data_job.env
Lines 55 to 57 in b5cd7f6
| # Location of ssh private key for git clone over ssh, useful for private repos. | |
| #DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE="/path/to/id_rsa" | |
| #DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE=/home/vagrant/.ssh/id_rsa_git_ssh_read_only |
This var is reset at the end, allowing for different key per job, see
birdhouse-deploy/birdhouse/components/scheduler/deploy_data_job.env
Lines 106 to 115 in b5cd7f6
| # Reset all config vars to prevent cross-contamination between successive invocations. | |
| DEPLOY_DATA_JOB_SCHEDULE="" | |
| DEPLOY_DATA_JOB_JOB_NAME="" | |
| DEPLOY_DATA_JOB_CONFIG="" | |
| DEPLOY_DATA_JOB_CHECKOUT_CACHE="" | |
| DEPLOY_DATA_JOB_LOGFILE="" | |
| DEPLOY_DATA_JOB_JOB_DESCRIPTION="" | |
| DEPLOY_DATA_JOB_DOCKER_IMAGE="" | |
| DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE="" | |
| DEPLOY_DATA_JOB_DOCKER_RUN_EXTRA_OPTS="" |
Now looking back at this list, DEPLOY_DATA_JOB_DOCKER_IMAGE was also possible to have a different image per job, with a default if unset !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your new commit ec9e810, I think this SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS is unused and can be deleted.
| --volume ${checkout_cache}:${checkout_cache}:rw | ||
| --volume ${BIRDHOUSE_LOG_DIR}:${BIRDHOUSE_LOG_DIR}:rw | ||
| --env DEPLOY_DATA_CHECKOUT_CACHE=${checkout_cache} | ||
| --env DEPLOY_DATA_LOGFILE=${log_file_name:-"deploy-data-${name}.log"} ${SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS} ${extra_args} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing support for DEPLOY_DATA_JOB_DOCKER_RUN_EXTRA_OPTS, see https://github.com/bird-house/birdhouse-deploy/blob/b5cd7f6501793ea8e67f76d2208e02dc797ba031/birdhouse/components/scheduler/deploy_data_job.env#L99C81-L99C118
and
birdhouse-deploy/birdhouse/components/scheduler/deploy_data_job.env
Lines 59 to 63 in b5cd7f6
| # Docker run extra opts. | |
| # 4 spaces in front of --env very important to respect. | |
| #DEPLOY_DATA_JOB_DOCKER_RUN_EXTRA_OPTS=" | |
| # --env ENV1=val1 | |
| # --env ENV2=val2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality is supported with the SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_EXTRA_ARGS variable but I'll change the name to SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_EXTRA_OPTIONS since you're right that they're options, not arguments.
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_NAME=deploy_raven_testdata_to_thredds | ||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_COMMENT="Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks." | ||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE='${BIRDHOUSE_DATA_PERSIST_ROOT}/deploy_data_cache/deploy_raven_testdata_to_thredds' | ||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_LOG_FILENAME='deploy_raven_testdata_to_thredds.log' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JOB_LOG_FILENAME also have default generated from JOB_NAME, see
birdhouse-deploy/birdhouse/components/scheduler/deploy_data_job.env
Lines 45 to 48 in b5cd7f6
| # Log file location. Default location under /var/log/birdhouse/ has built-in logrotate. | |
| if [ -z "$DEPLOY_DATA_JOB_LOGFILE" ]; then | |
| DEPLOY_DATA_JOB_LOGFILE="${BIRDHOUSE_LOG_DIR}/${DEPLOY_DATA_JOB_JOB_NAME}.log" | |
| fi |
The previous deploy_raven_testdata_to_thredds.env, only 4 vars need to be set, the rest have generated defaults. This default.env should be similar. See
birdhouse-deploy/birdhouse/components/scheduler/deploy_raven_testdata_to_thredds.env
Lines 1 to 7 in 3a1fa93
| # Source this file in env.local before sourcing deploy_data_job.env. | |
| # This will configure deploy_data_job.env. | |
| DEPLOY_DATA_JOB_SCHEDULE="*/30 * * * *" # UTC | |
| DEPLOY_DATA_JOB_JOB_NAME="deploy_raven_testdata_to_thredds" | |
| DEPLOY_DATA_JOB_CONFIG="${COMPOSE_DIR}/deployment/deploy-data-raven-testdata-to-thredds.yml" | |
| DEPLOY_DATA_JOB_JOB_DESCRIPTION="Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup this is the same: only name, schedule and config file need to be specified.
I don't think that's a good idea, that script does something different by appending to the
None of this will break existing jobs, everything is backwards compatible with how you're doing it at PAVICS right now. This will only apply to new jobs or existing jobs that you want to convert to using the new method of defining them as optional components.
Already done
Already done
I disagree, the old style can be kept for backwards compatibility for PAVICS but the old style is deprecated and shouldn't be encouraged for any new jobs. |
Why would it not play nicely with the new framework? The new framework do not use Both frameworks will generate exactly the same boilerplate so wouldn't it be less code duplication and debugging errors to have one single source of boilerplate instead of 2?
I was referring "keep some config var names" when adapting
Agreed and since docs and examples of the old style are removed, we are absolutely not encouraging new jobs. My fear is all existing jobs might suddenly break if we make a change in the new style boilerplate and forgot to make the corresponding same update to the old style boilerplate. By having one boilerplate instead of 2 identical ones, we will save us this trouble in the future. |
tlvu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review after your recent changes. This is exactly what I was trying to avoid: developing another identical boilerplate and having to fix bugs (wrong default) and regressions (missing features) while the existing boilerplate was already battle tested for 4 years. Had we reused the existing boilerplate, we won't have wrong defaults and missing features.
I was going to submit this kind of PR myself because I need it for all my external jobs relying on deploy-data.
I am very grateful you started this and I do like very much your approach that the new components only need default.env.
I can take over this PR if you don't mind. I am in a much better position to test this and to ensure existing features and defaults are not breaking for all the existing jobs.
| [ -z "$config_file" ] && log ERROR "$(sed $error_msg | 's/XXX/config_file/')" && return 1 | ||
|
|
||
| comment="${comment:-"${name}"}" | ||
| checkout_cache="${checkout_cache:-"${name}"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default for checkout_cache is not just $name, it should be ${BIRDHOUSE_DATA_PERSIST_ROOT:-/data}/deploy_data_cache/${name}, see
birdhouse-deploy/birdhouse/components/scheduler/deploy_data_job.env
Lines 39 to 43 in b5cd7f6
| # Location for local cache of git clone to save bandwidth and time from always | |
| # re-cloning from scratch. | |
| if [ -z "$DEPLOY_DATA_JOB_CHECKOUT_CACHE" ]; then | |
| DEPLOY_DATA_JOB_CHECKOUT_CACHE="${BIRDHOUSE_DATA_PERSIST_ROOT:-/data}/deploy_data_cache/${DEPLOY_DATA_JOB_JOB_NAME}" | |
| fi |
| export SCHEDULER_JOB_DEPLOY_DATA_JOB_VERSION='19.03.6-git' | ||
| export SCHEDULER_JOB_DEPLOY_DATA_JOB_IMAGE='${SCHEDULER_JOB_DEPLOY_DATA_JOB_DOCKER}:${SCHEDULER_JOB_DEPLOY_DATA_JOB_VERSION}' | ||
|
|
||
| export SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS='$([ -n "$DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE" ] && echo "--volume ${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}:${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}:ro --env DEPLOY_DATA_GIT_SSH_IDENTITY_FILE=${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE} ")' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your new commit ec9e810, I think this SCHEDULER_JOB_DEPLOY_EXTRA_DOCKER_ARGS is unused and can be deleted.
| @@ -1,29 +1,20 @@ | |||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_NAME=deploy_raven_testdata_to_thredds | |||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_COMMENT="Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks." | |||
| export SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE='${BIRDHOUSE_DATA_PERSIST_ROOT}/deploy_data_cache/deploy_raven_testdata_to_thredds' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your new commit ec9e810, I think this job specific SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_CHECKOUT_CACHE and the one below SCHEDULER_JOB_RAVEN_DEPLOY_DATA_JOB_LOG_FILENAME can be deleted since they will derive from the default.
| checkout_cache="${checkout_cache:-"${name}"}" | ||
| image="${image:-"${SCHEDULER_JOB_DEPLOY_DATA_JOB_IMAGE}"}" | ||
| git_ssh_id_file="${git_ssh_id_file:-"${DEPLOY_DATA_JOB_GIT_SSH_IDENTITY_FILE}"}" | ||
| log_file_name="${log_file_name:-"${name}.log"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_file_name default should be ${BIRDHOUSE_LOG_DIR}/${name}.log, see
birdhouse-deploy/birdhouse/components/scheduler/deploy_data_job.env
Lines 45 to 48 in b5cd7f6
| # Log file location. Default location under /var/log/birdhouse/ has built-in logrotate. | |
| if [ -z "$DEPLOY_DATA_JOB_LOGFILE" ]; then | |
| DEPLOY_DATA_JOB_LOGFILE="${BIRDHOUSE_LOG_DIR}/${DEPLOY_DATA_JOB_JOB_NAME}.log" | |
| fi |
Sounds good. You can take over please |
Overview
New data deploy scheduler jobs no longer need to copy/paste lots of boilerplate code to create a new job.
Instead they can simply define specific environment variables and then the can now use the
optional-components/scheduler-job-deploy_datajob will automatically generate a new data deploy job.For example, if
XXXXis added to theSCHEDULER_JOB_DEPLOY_DATA_JOB_IDSvariable and the followingvariables are defined:
SCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_NAMESCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_COMMENTSCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_CHECKOUT_CACHESCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_LOG_FILENAMESCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_SCHEDULESCHEDULER_JOB_XXXX_DEPLOY_DATA_JOB_CONFIG_FILEa deploy data job will be automatically created.
See
optional-components/scheduler-job-deploy_raven_testdata/default.envandoptional-components/scheduler-job-deploy_raven_testdata/default.envfor examples.See
birdhouse/deployment/deploy-datafor details on how the deploy data job works.Changes
Non-breaking changes
Breaking changes
Related Issue / Discussion
Additional Information
CI Operations
birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false