-
Notifications
You must be signed in to change notification settings - Fork 0
Add lvk-alerts-to-storage module
#290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from 12 commits
8638b5b
6d7fae4
1024ee7
5648b1d
e967712
268d8f7
30bd436
06c21cf
b4f241d
0076075
a754bc3
d379e16
304e56c
05b9f9a
c65aa9c
5f9f563
16a1da0
24fde97
a817ac3
ceded5a
8e8b5a7
ce17c69
173aa5b
b7d590b
359db16
c05e02f
18b13eb
f2d2985
38875e1
63eb4df
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # Use the official lightweight Python image. | ||
| # https://hub.docker.com/_/python | ||
| FROM python:3.12-slim | ||
|
|
||
| # Allow statements and log messages to immediately appear in the Knative logs | ||
| ENV PYTHONUNBUFFERED True | ||
|
|
||
| # Copy local code to the container image. | ||
| ENV APP_HOME /app | ||
| WORKDIR $APP_HOME | ||
| COPY . ./ | ||
|
|
||
| # Install production dependencies. | ||
| RUN pip install --no-cache-dir -r requirements.txt | ||
|
|
||
| # Run the web service on container startup. Here we use the gunicorn | ||
| # webserver, with one worker process and 8 threads. | ||
| # For environments with multiple CPU cores, increase the number of workers | ||
| # to be equal to the cores available. | ||
| # Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling. | ||
| CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # https://cloud.google.com/build/docs/deploying-builds/deploy-cloud-run | ||
| # containerize the module and deploy it to Cloud Run | ||
| steps: | ||
| # Build the image | ||
| - name: 'gcr.io/cloud-builders/docker' | ||
| args: ['build', '-t', '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_MODULE_IMAGE_NAME}', '.'] | ||
| # Push the image to Artifact Registry | ||
| - name: 'gcr.io/cloud-builders/docker' | ||
| args: ['push', '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_MODULE_IMAGE_NAME}'] | ||
| # Deploy image to Cloud Run | ||
| - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk' | ||
| entrypoint: gcloud | ||
| args: ['run', 'deploy', '${_MODULE_NAME}', '--image', '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_MODULE_IMAGE_NAME}', '--region', '${_REGION}', '--set-env-vars', '${_ENV_VARS}'] | ||
| images: | ||
| - '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_MODULE_IMAGE_NAME}' | ||
| substitutions: | ||
| _SURVEY: 'lvk' | ||
| _TESTID: 'testid' | ||
| _MODULE_NAME: '${_SURVEY}-alerts-to-storage-${_TESTID}' | ||
| _MODULE_IMAGE_NAME: 'gcr.io/${PROJECT_ID}/${_REPOSITORY}/${_MODULE_NAME}' | ||
| _REPOSITORY: 'cloud-run-services' | ||
| # cloud functions automatically sets the projectid env var using the name "GCP_PROJECT" | ||
| # use the same name here for consistency | ||
| # [TODO] PROJECT_ID is set in setup.sh. this is confusing and we should revisit the decision. | ||
| # i (Raen) think i didn't make it a substitution because i didn't want to set a default for it. | ||
| _ENV_VARS: 'GCP_PROJECT=${PROJECT_ID},SURVEY=${_SURVEY},TESTID=${_TESTID},VERSIONTAG=${_VERSIONTAG}' | ||
| _REGION: 'us-central1' | ||
| options: | ||
| dynamic_substitutions: true |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| #! /bin/bash | ||
| # Deploys or deletes broker Cloud Run service | ||
| # This script will not delete Cloud Run services that are in production | ||
|
|
||
| # "False" uses production resources | ||
| # any other string will be appended to the names of all resources | ||
| testid="${1:-test}" | ||
| # "True" tearsdown/deletes resources, else setup | ||
| teardown="${2:-False}" | ||
| # name of the survey this broker instance will ingest | ||
| survey="${3:-lvk}" | ||
| region="${4:-us-central1}" | ||
| versiontag="${5:-v1_0}" | ||
| # get the environment variable | ||
| PROJECT_ID=$GOOGLE_CLOUD_PROJECT | ||
|
|
||
| MODULE_NAME="alerts-to-storage" # lower case required by cloud run | ||
| ROUTE_RUN="/" # url route that will trigger main.run() | ||
|
|
||
| define_GCP_resources() { | ||
| local base_name="$1" | ||
| local separator="${2:--}" | ||
| local testid_suffix="" | ||
|
|
||
| if [ "$testid" != "False" ] && [ -n "$testid" ]; then | ||
| testid_suffix="${separator}${testid}" | ||
| fi | ||
| echo "${base_name}${testid_suffix}" | ||
| } | ||
|
|
||
| #--- GCP resources used in this script | ||
| artifact_registry_repo=$(define_GCP_resources "${survey}-cloud-run-services") | ||
| cr_module_name=$(define_GCP_resources "${survey}-${MODULE_NAME}") # lower case required by cloud run | ||
| gcs_json_bucket=$(define_GCP_resources "${PROJECT_ID}-${survey}_alerts") | ||
| ps_deadletter_topic=$(define_GCP_resources "${survey}-deadletter") | ||
| ps_input_subscrip=$(define_GCP_resources "${survey}-alerts_raw") # pub/sub subscription used to trigger cloud run module | ||
| ps_topic_alert_in_bucket=$(define_GCP_resources "projects/${PROJECT_ID}/topics/${survey}-alert_in_bucket") | ||
hernandezc1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ps_trigger_topic=$(define_GCP_resources "${survey}-alerts_raw") | ||
| runinvoker_svcact="cloud-run-invoker@${PROJECT_ID}.iam.gserviceaccount.com" | ||
|
|
||
| if [ "${teardown}" = "True" ]; then | ||
| # ensure that we do not teardown production resources | ||
| if [ "${testid}" != "False" ]; then | ||
| echo | ||
| echo "Deleting resources for ${MODULE_NAME} module..." | ||
| gsutil rm -r "gs://${gcs_json_bucket}" | ||
| gcloud pubsub subscriptions delete "${ps_input_subscrip}" | ||
| gcloud pubsub topics delete "${ps_topic_alert_in_bucket}" | ||
| gcloud run services delete "${cr_module_name}" --region "${region}" | ||
| fi | ||
|
Comment on lines
44
to
58
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add an |
||
| else | ||
| echo | ||
| echo "Creating gcs_json_bucket and uploading files..." | ||
hernandezc1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| if ! gsutil ls -b "gs://${gcs_json_bucket}" >/dev/null 2>&1; then | ||
| gsutil mb -b on -l "${region}" "gs://${gcs_json_bucket}" | ||
| gsutil uniformbucketlevelaccess set on "gs://${gcs_json_bucket}" | ||
| gsutil requesterpays set on "gs://${gcs_json_bucket}" | ||
| gcloud storage buckets add-iam-policy-binding "gs://${gcs_json_bucket}" \ | ||
| --member="allUsers" \ | ||
| --role="roles/storage.objectViewer" | ||
| else | ||
| echo "${gcs_json_bucket} already exists." | ||
| fi | ||
|
|
||
| echo | ||
| echo "Configuring Pub/Sub notifications on GCS bucket..." | ||
| trigger_event=OBJECT_FINALIZE | ||
| format=json # json or none; if json, file metadata sent in message body | ||
| gsutil notification create \ | ||
| -t "$ps_topic_alert_in_bucket" \ | ||
| -e "$trigger_event" \ | ||
| -f "$format" \ | ||
| "gs://${gcs_json_bucket}" | ||
|
|
||
| #--- Deploy Cloud Run service | ||
| echo | ||
| echo "Creating container image for ${MODULE_NAME} module and deploying to Cloud Run..." | ||
| moduledir="." # assumes deploying what's in our current directory | ||
| config="${moduledir}/cloudbuild.yaml" | ||
| url=$(gcloud builds submit --config="${config}" \ | ||
troyraen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --substitutions="_SURVEY=${survey},_TESTID=${testid},_MODULE_NAME=${cr_module_name},_REPOSITORY=${artifact_registry_repo},_VERSIONTAG=${versiontag}" \ | ||
| "${moduledir}" | sed -n 's/^Step #2: Service URL: \(.*\)$/\1/p') | ||
| echo | ||
| echo "Creating trigger subscription for ${MODULE_NAME} Cloud Run service..." | ||
| gcloud pubsub subscriptions create "${ps_input_subscrip}" \ | ||
| --topic "${ps_trigger_topic}" \ | ||
| --topic-project "${PROJECT_ID}" \ | ||
| --ack-deadline=600 \ | ||
| --push-endpoint="${url}${ROUTE_RUN}" \ | ||
| --push-auth-service-account="${runinvoker_svcact}" \ | ||
| --dead-letter-topic="${ps_deadletter_topic}" \ | ||
| --max-delivery-attempts=5 | ||
| fi | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,104 @@ | ||||||||||||||
| #!/usr/bin/env python3 | ||||||||||||||
| # -*- coding: UTF-8 -*- | ||||||||||||||
|
|
||||||||||||||
| """This module stores LVK alert data as a JSON file in Cloud Storage.""" | ||||||||||||||
|
|
||||||||||||||
| import os | ||||||||||||||
| import flask | ||||||||||||||
| import pittgoogle | ||||||||||||||
| from google.cloud import logging, storage | ||||||||||||||
| from google.cloud.exceptions import PreconditionFailed | ||||||||||||||
|
|
||||||||||||||
| # [FIXME] Make this helpful or else delete it. | ||||||||||||||
| # Connect the python logger to the google cloud logger. | ||||||||||||||
| # By default, this captures INFO level and above. | ||||||||||||||
| # pittgoogle uses the python logger. | ||||||||||||||
| # We don't currently use the python logger directly in this script, but we could. | ||||||||||||||
| logging.Client().setup_logging() | ||||||||||||||
|
|
||||||||||||||
| PROJECT_ID = os.getenv("GCP_PROJECT") | ||||||||||||||
| TESTID = os.getenv("TESTID") | ||||||||||||||
| SURVEY = os.getenv("SURVEY") | ||||||||||||||
| VERSIONTAG = os.getenv("VERSIONTAG") | ||||||||||||||
|
|
||||||||||||||
| # Variables for incoming data | ||||||||||||||
| # A url route is used in setup.sh when the trigger subscription is created. | ||||||||||||||
| # It is possible to define multiple routes in a single module and trigger them using different subscriptions. | ||||||||||||||
| ROUTE_RUN = "/" # HTTP route that will trigger run(). Must match deploy.sh | ||||||||||||||
|
|
||||||||||||||
| # Variables for outgoing data | ||||||||||||||
| HTTP_204 = 204 # HTTP code: Success | ||||||||||||||
| HTTP_400 = 400 # HTTP code: Bad Request | ||||||||||||||
|
|
||||||||||||||
| # GCP resources used in this module | ||||||||||||||
| TOPIC_ALERTS = pittgoogle.Topic.from_cloud( | ||||||||||||||
| "alerts", survey=SURVEY, testid=TESTID, projectid=PROJECT_ID | ||||||||||||||
| ) | ||||||||||||||
| bucket_name = f"{PROJECT_ID}-{SURVEY}_alerts" | ||||||||||||||
| if TESTID != "False": | ||||||||||||||
| bucket_name = f"{bucket_name}-{TESTID}" | ||||||||||||||
|
|
||||||||||||||
| client = storage.Client() | ||||||||||||||
| bucket = client.get_bucket(client.bucket(bucket_name, user_project=PROJECT_ID)) | ||||||||||||||
|
|
||||||||||||||
| app = flask.Flask(__name__) | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| @app.route(ROUTE_RUN, methods=["POST"]) | ||||||||||||||
| def run(): | ||||||||||||||
| """Uploads alert data to a GCS bucket. Publishes a de-duplicated JSON-serialized "alerts" stream | ||||||||||||||
| (${survey}-alerts) containing the original alert bytes. A BigQuery subscription is used to write alert data to | ||||||||||||||
| the appropriate BigQuery table. | ||||||||||||||
|
|
||||||||||||||
| This module is intended to be deployed as a Cloud Run service. It will operate as an HTTP endpoint | ||||||||||||||
| triggered by Pub/Sub messages. This function will be called once for every message sent to this route. | ||||||||||||||
| It should accept the incoming HTTP request and return a response. | ||||||||||||||
|
|
||||||||||||||
| Returns | ||||||||||||||
| ------- | ||||||||||||||
| response : tuple(str, int) | ||||||||||||||
| Tuple containing the response body (string) and HTTP status code (int). Flask will convert the | ||||||||||||||
| tuple into a proper HTTP response. Note that the response is a status message for the web server. | ||||||||||||||
| """ | ||||||||||||||
| # extract the envelope from the request that triggered the endpoint | ||||||||||||||
| # this contains a single Pub/Sub message with the alert to be processed | ||||||||||||||
| envelope = flask.request.get_json() | ||||||||||||||
| try: | ||||||||||||||
| alert = pittgoogle.Alert.from_cloud_run(envelope, "lvk") | ||||||||||||||
| except pittgoogle.exceptions.BadRequest as exc: | ||||||||||||||
| return str(exc), HTTP_400 | ||||||||||||||
|
|
||||||||||||||
| blob = bucket.blob(_name_in_bucket(alert)) | ||||||||||||||
hernandezc1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
| blob.metadata = _create_file_metadata(alert, event_id=envelope["message"]["messageId"]) | ||||||||||||||
|
|
||||||||||||||
| # raise a PreconditionFailed exception if filename already exists in the bucket using "if_generation_match=0" | ||||||||||||||
| try: | ||||||||||||||
| blob.upload_from_string(alert.msg.data, if_generation_match=0) | ||||||||||||||
| except PreconditionFailed: | ||||||||||||||
| # this alert is a duplicate. drop it. | ||||||||||||||
| return "", HTTP_204 | ||||||||||||||
|
|
||||||||||||||
| # publish the same alert as JSON | ||||||||||||||
| TOPIC_ALERTS.publish(alert) | ||||||||||||||
|
|
||||||||||||||
| return "", HTTP_204 | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def _create_file_metadata(alert: pittgoogle.Alert, event_id: str) -> dict: | ||||||||||||||
| """Return key/value pairs to be attached to the file as metadata.""" | ||||||||||||||
| # https://git.ligo.org/emfollow/igwn-gwalert-schema/-/blob/main/igwn.alerts.v1_0.Alert.schema.json | ||||||||||||||
| metadata = {"file_origin_message_id": event_id} | ||||||||||||||
| metadata["_".join("time_created")] = alert.dict["time_created"] | ||||||||||||||
| metadata["_".join("alert_type")] = alert.dict["alert_type"] | ||||||||||||||
| metadata["_".join("id")] = alert.dict["superevent_id"] | ||||||||||||||
|
||||||||||||||
| metadata["_".join("time_created")] = alert.dict["time_created"] | |
| metadata["_".join("alert_type")] = alert.dict["alert_type"] | |
| metadata["_".join("id")] = alert.dict["superevent_id"] | |
| metadata["time_created"] = alert.dict["time_created"] | |
| metadata["alert_type"] = alert.dict["alert_type"] | |
| metadata["superevent_id"] = alert.dict["superevent_id"] |
"_".join("time_created") == "t_i_m_e___c_r_e_a_t_e_d", obviously not what's intended.- In general, use the survey's field name. In particular, "id" is way too vague for a name.
In terms of other metadata would could add, the only other option I see is their "urls" field and I am not inclined to add that here. All of their other fields depend on the type of alert so we can't rely on them being available for this. I suppose we could add a try/except to deal with that but I think it's not worth the effort right now.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. What we decided on zoom was:
f"{VERSIONTAG}/{_id}/{_alert_type}-{_date}.json"BUT, I went to patch our client Alert.name_in_bucket and realized that would impossible. These alert packets do not contain the version, so there is no way to construct the name_in_bucket using only the alert data. I don't know the best solution here. Let's think and discuss. Options off the top of my head are:
- Our pipeline knows the schema version, so this module could simply add it to the alert packet. We'd need to propagate that all the way through in pubsub messages, bucket objects, and bigquery schemas. LVK alerts from us would be different than from LVK itself or any other broker. That may occasionally be a pain but not a huge problem, I think. I haven't fully thought things through but don't see any other problems with this at the moment. So this is seeming like the best option.
- Maybe there's a way to get the consumer VM to add the schema version to the metadata (it does not open the alert packets, so it cannot add to the actual data). A downside is that it's relatively easy for data and metadata get disconnected. Also, I am very reluctant to mess with the consumer. It performs so well. I don't want to interfere.
- If our pipeline never drops fields from LVK alerts (so, no lite module, etc.) maybe it's not crucial for the client to be able to construct the name_in_bucket. I can still think of edge cases where we'd regret that and users would hate us for it.
- We could drop VERSIONTAG from the name in bucket. I really don't like the idea of making superevent_id the top level folder so maybe we use the LVK run for that, like "O4". Problem is that bigquery tables MUST be separated by schema version. There's no guarantee of a clean mapping between run and schema version. So this would make it really hard to regenerate bigquery tables from the bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this reminds me that it would be much better to remove this function altogether and just use Alert.name_in_bucket so that we don't have to keep the two in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided yesterday on zoom to implement option 1. I'll wait for that before reviewing again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troyraen I've implemented option 1! Ready for your review.
def run()
...
# the schema version is not defined in the schema
# add it manually using the environment variable defined in this script
alert.attributes["schema_version"] = VERSIONTAG
# publish the same alert as JSON
TOPIC_ALERTS.publish(alert)
return "", HTTP_204
def _create_file_metadata(alert: pittgoogle.Alert, event_id: str) -> dict:
...
metadata["schema_version"] = VERSIONTAG
return metadata
def _name_in_bucket(alert: pittgoogle.Alert) -> str:
...
_id = alert.sourceid
return f"{VERSIONTAG}/{_id}/{_alert_type}-{_date}.json"
The schema version is already propagated for BigQuery schemas
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| # As explained here | ||
| # https://cloud.google.com/functions/docs/writing/specifying-dependencies-python | ||
| # dependencies for a Cloud Function must be specified in a `requirements.txt` | ||
| # file (or packaged with the function) in the same directory as `main.py` | ||
|
|
||
| google-cloud-logging | ||
| google-cloud-storage | ||
| pittgoogle-client>=0.3.15 | ||
|
|
||
| # for Cloud Run | ||
| # https://cloud.google.com/run/docs/quickstarts/build-and-deploy/deploy-python-service | ||
| Flask | ||
| gunicorn | ||
| Werkzeug |
Uh oh!
There was an error while loading. Please reload this page.