Data Lifecycle Monitoring table #1083

sf-dcp · 2024-08-15T16:16:27Z

sf-dcp
Aug 15, 2024
Maintainer

Motivation

A need for a promotions logging table came up in the PLUTO DE QA project to monitor how long it takes from product build to publish. And overall, we would like to have an easy way to get a product status (though our whiteboard is nice too 😊).

@fvankrieken created a feature a while back to scrape s3 which enables us to build a feature on top to estimate a given product lifecycle. A limitation of this approach is: if an s3 file gets automatically/manually overwritten, deleted, or modified, we won't see it unless we store a historical information from scraping. Additionally, there may be costs with making too many calls to s3 (couldn't confirm our plan details due to no access to billing info in DO). From what I'm reading about DO s3 plans, there is a set amount of outbound traffic per plan; any additional calls outside of the limit charged extra.

Proposal

Some questions I would like to answer with the future logging table are:

❓How long did it take from build to publish for product X?
❓What is the current status for product X version V?
❓How long did QA take for product X? What's the average QA timeline?

Sample table

The table below represents what it would look like us publishing a product and then later patching the published version. The questions above could be answered via creating custom views in DB. For example, one view could be for product status, one view for the whole lifecycle timeline per product, or we could have one view per product.

product name	version	build name	product path	GH Action	GH Action link	timestamp
db-pluto	24v3	sf-build-1	db-pluto/build/sf-build-1	build	`link`	`timestamp`
db-pluto	24v3	sf-build-1	db-pluto/draft/24v3/1	promote_to_draft	`link`	`timestamp`
db-pluto	24v3	sf-build-1	db-pluto/publish/24v3	publish	`link`	`timestamp`
db-pluto	24v3	sf-build-2	db-pluto/build/sf-build-2	build	`link`	`timestamp`
db-pluto	24v3	sf-build-2	db-pluto/draft/24v3/2-fix-zoning	promote_to_draft	`link`	`timestamp`
db-pluto	24v3	sf-build-2	db-pluto/publish/24v3.0.1	publish	`link`	`timestamp`
db-pluto	24v3	sf-build-2	db-pluto/publish/latest	publish	`link`	`timestamp`

Implementation

📌 Log what: build, promote to draft, publish, and distribute GH actions. Later we could add packaging when it's automated.

📌Where: in our database. Either create a separate db or use default-db, public schema. We could also dump a copy of the table somewhere in S3 occasionally as a backup.

📌 How:

Add build-name attribute to product metadata. Having this attribute helps sort out any draft-draft builds like nightly qa. Will be added to metadata during build action.
create a GH action to log an event. Inputs to the action are logging table columns.
call the GH action from step 2 as a part of build/promote/publish/distribute action.

Feedback needed

What kind of questions you would like to be answered with the logging table?
Should nightly QA actions to be included in the logging table?
Should we log failed actions?
Thoughts about the type of data to be included
Thoughts about the implementation proposal

It would be nice to get feedback asap for this to be implemented in this sprint.

alexrichey · 2024-08-19T16:23:09Z

alexrichey
Aug 19, 2024
Maintainer

@sf-dcp I like where this is going!

My immediate thought is to orient this around a product/dataset stage in our lifecycle, and anything in dcpy.lifecycle that changes the status of a products lifecycle should also add a record to the logging table. So the columns would look like:

Product
(Optional) Dataset Name
Version
New Stage (e.g. build, promoted, distributed)
Old Stage
timestamp
custom attributes (JSON field)

I love the idea of being able to track builds etc, but maybe we just chuck those specifics in a JSON field. I'm thinking we'll probably always do things manually sometimes... so maybe someone distributes to Socrata from their local machine. I'd want that to be logged, but there wouldn't be an action to track back to.

And maybe we set up some automation around Draft Review issues in Github, where changes to the issue trigger something to add to your table.

4 replies

sf-dcp Aug 19, 2024
Maintainer Author

Thanks for reviewing the proposal.

Do we actually run production tasks locally? Speaking of build/promote/publish/distribute.
Do we want to track events from nightly qa?
What is the value of old stage and dataset name columns?

I would still like to add build-name to metadata json as well as to the logging table because this field is needed for calculating a view for a product timeline. Specifically, identifying which build was a starting point in the lifecycle.

sf-dcp Aug 19, 2024
Maintainer Author

Also, I intended to have the table for logging our products. Are you suggesting to track datasets here as well?

alexrichey Aug 19, 2024
Maintainer

What is the value of old stage name columns?

Good question... I was thinking about the case when we have a product that moves back and forth between GIS Review and Build. But I suppose we know what the prior stage was based on the last logged event for that product-version?

Do we want to track events from nightly qa?

Might be nice, especially if we tracked build times? But I don't know. No strong thoughts.

Do we actually run production tasks locally? Speaking of build/promote/publish/distribute.

I think we probably always will. More broadly, I there's value in dcpy.lifecycle working the same regardless of where run code from. (ie they should log the events whether running in GHA or your local machine)

Are you suggesting to track datasets here as well? / What is the value of the dataset name columns?

Yeah, my though was that if we're eventually including distribution, then we'd need dataset granularity, since we distribute them individually. But maybe that lives in a different table?

sf-dcp Aug 19, 2024
Maintainer Author

Makes sense.

To the first question, yes, we will be able to tell the previous stage from another logging event.

Yeah, I'm not sure why we would want to track nightly builds. I'm leaning towards logging nightly builds being more of a liability (new records increasing table size quickly) than adding a value. But I can be convinced.

Should we call a logging function inside of dcpy.lifecycle CLI call?

If we need more granularity for distribution, I think it makes sense creating a separate table for the ease of use rather than constantly unpacking custom attributes column... If we go that route, does this mean all distribution, dataset or product, would be only logged in a separate table?

fvankrieken · 2024-08-20T16:23:19Z

fvankrieken
Aug 20, 2024
Maintainer

Not hugely related to this issue but just noting that my "s3" scraping was really

a db logging framework for library (to log when events occur in realtime)
a util to scrape edm-recipes to populate the db table for runs that occurred in the past

The source data page in the QA app is aimed at this db table, not s3. The scraping only happened once!

Back to actual discussion. I like this a lot. A couple random notes

I think logging should happen via dcpy utility - long term, if we had full orchestration of build steps in python, this gives us a lot of ability to incorporate things easily (timing steps, etc) and log directly during each of these actions. Since right now a "build" doesn't really happen in python, it's a little clunkier. So a CLI target makes sense. Can still have a GHA to call it!
for now I'd say no nightly qa? If we agree/enforce that we don't promote from nightly_qa
need to think a little about structure of table in terms of how we'd query it to answer your questions. Let's start with the first. Currently, if there's a "publish" row, I don't think we'd be able to query the table to see which draft revision it came from (can be figured out logically if we assume latest). This could be done with either
- having column for draft_revision
- having link to previous (as per Alex's comment) - be it "path", or db pk, etc. I sort of prefer this, as then we can create an index on just (product, version, path) and then could performantly join a row we have to its "previous_path" via that index (and can do so with a recursive query so that for any stage - be it distributed, published, whatever - we can look up its whole tree quickly - at that point we don't necessarily need build name as its own column)
- slight augmentation on above - could just have "product", "stage/action", and "stage/action identifier" as the main columns - this effectively is a dcpy ProductKey So "db-pluto", "draft", "23v2/2-fix-zoning" as an example. Though in our case, it works a little nicer I think to have path as the unique identifier rather than a combo of columns so we can more easily join to "prev" (and not need to store 2 prev columns)
Thinking about question 2
- I think having a view that recursively queries the table to give us "leaves" of the tree helps us here too - for a given product, we can see it's max version and it's max "stage/action" - maybe this is an enum so that we can sort it. If that makes sense - nice to quickly be able to query pluto, see 23v3 is max, and that the latest step we have is publish (and see the whole upstream history of that publish row, with timestamps, draft revisions, etc)

5 replies

damonmcc Aug 20, 2024
Maintainer

I think logging should happen via dcpy utility

had this thought too. and a new GHA with a lot of inputs seems like something to avoid (actions have a max of 10 inputs and we probably shouldn't have an input for each logging table column)

sf-dcp Aug 21, 2024
Maintainer Author

Thank you all for the feedback. This is really helpful.

To answer the question how long it took for a product X version V using build-name attribute, you could have a query like this:

WITH filtered_product_version_events AS (
    SELECT *
    FROM product_lifecycle
    WHERE product_name = 'db-pluto'
      AND version = '24v3'
)
filtered_product_version_builds AS (
    SELECT build_name
    FROM product_lifecycle
    WHERE product_name = 'db-pluto'
         AND version = '24v3'
         AND GH_Action = 'promote_to_draft'
)
SELECT 
    * 
FROM 
    filtered_product_version_events
WHERE
    build_name IN filtered_product_version_builds

This query would return all relevant events to PLUTO 24v3. Taking min and max of timestamp values would tell you how long the lifecycle took. In this case, the link between 2 events (ex: draft and publish) consists of product name, version, and build name.

Regarding product + version + path db unique key: it does make sense to use this instead, esp for the index benefits. How would we deal with publishing to latest folder? When you re-publish same version (and push to latest again), the unique key would be the same for multiple logging events like db-pluto/24v3/publish/latest.

damonmcc Aug 22, 2024
Maintainer

your example of pushing to latest twice is one reason product + version + path isn't unique. I think another example is just running a build twice

seems like product + version + path + timestamp would be unique. maybe another good key would be product + version + event + timestamp, so that any other columns are considered details about this unique event

fvankrieken Aug 22, 2024
Maintainer

Good point on the builds not being unique, Damon. Hmm.

And I think you're sort of right, @sf-dcp - I was thinking that we'd have this simple tree linking a published version to its upstream folders, but given that every draft of a published version is really relevant to the published version, it's a little tricker to actually get all the relevant info and I was oversimplifying it a little in my head.

Patches are a little bit of a headache - currently if we have 2 drafts for 24v3, then publish, then build again, then patch, we'd have 3 draft versions for 24v3, but the published 24v3 came from the second one, and then we need the patched version to be linked as well. Need to think for a sec about what I would want out of the db here.

Do we want latest rows? If we think of a row as an action (in the sense of build, promote, publish, distribute, I don't think pushing to latest should be its own row. To me this is maybe something that could go in a freeform json field ({'latest': True}) or something - it's more of a decorator to the action than an action in and of itself in my opinion.

With that, every action other than a build is unique by path at least.

sf-dcp Aug 22, 2024
Maintainer Author

Yeah, that makes sense for latest to be in custom attributes column. Regarding patches: I don't think they are a headache because version will be the same (i.e. 24v3) while paths and timestamps would indicate something to be a patch.

you could still grab all events relevant to a specific version and sort accordingly (refer to my table above).

damonmcc · 2024-08-20T20:46:53Z

damonmcc
Aug 20, 2024
Maintainer

seems like we do use python for all the events we'd wanna log

build: dcpy.connectors.edm.publishing upload
promote to draft: dcpy.connectors.edm.publishing promote_to_draft
publish: dcpy.connectors.edm.publishing publish
package: dcpy.connectors edm packaging package
distribute: dcpy.cli lifecycle distribute socrata from_s3

so maybe it'd be cleanest to just call a new python function for logging during those existing python steps? @fvankrieken you said "right now a "build" doesn't really happen in python" and I agreed at first. but looking at things now it doesn't seem like we need a new GHA or CLI to log relevant events

0 replies

damonmcc · 2024-08-20T20:59:28Z

damonmcc
Aug 20, 2024
Maintainer

What kind of questions you would like to be answered with the logging table?

What's the status of a data update?
What was the time between each event in a data update?

Should nightly QA actions to be included in the logging table?

I don't think so. part of me likes the idea of including nightly QA, but we never promote those builds to draft so we might as well treat them like test artifacts

might be nice to have a "blacklist" of build names to ignore in the logging function. that way we don't have to add any logic to any GHAs

Should we log failed actions?

I don't think so. since we'll only care about things that happened (e.g. files in DO changed), logs of failed actions in this type of table don't seem worth it

3 replies

sf-dcp Aug 21, 2024
Maintainer Author

might be nice to have a "blacklist" of build names to ignore in the logging function. that way we don't have to add any logic to any GHAs

Could you give an example of a product to ignore and why to ignore it?

damonmcc Aug 22, 2024
Maintainer

rather than ignoring an entire product, I was thinking we could ignore certain build_name values like nightly_qa

we may also wanna ignore events related to Template DB builds triggered by PR tests (see template_test.yml. but I think it'd be beneficial to have this logging happen during our only end-to-end test

sf-dcp Aug 22, 2024
Maintainer Author

oh gotcha. makes sense.

sf-dcp · 2024-08-22T14:13:21Z

sf-dcp
Aug 22, 2024
Maintainer Author

Thank you all for the feedback. I think I have enough info to keep in mind to start coding :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Lifecycle Monitoring table #1083

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Data Lifecycle Monitoring table #1083

sf-dcp Aug 15, 2024 Maintainer

Motivation

Proposal

Sample table

Implementation

Feedback needed

Replies: 5 comments · 12 replies

alexrichey Aug 19, 2024 Maintainer

sf-dcp Aug 19, 2024 Maintainer Author

sf-dcp Aug 19, 2024 Maintainer Author

alexrichey Aug 19, 2024 Maintainer

sf-dcp Aug 19, 2024 Maintainer Author

fvankrieken Aug 20, 2024 Maintainer

damonmcc Aug 20, 2024 Maintainer

sf-dcp Aug 21, 2024 Maintainer Author

damonmcc Aug 22, 2024 Maintainer

fvankrieken Aug 22, 2024 Maintainer

sf-dcp Aug 22, 2024 Maintainer Author

damonmcc Aug 20, 2024 Maintainer

damonmcc Aug 20, 2024 Maintainer

sf-dcp Aug 21, 2024 Maintainer Author

damonmcc Aug 22, 2024 Maintainer

sf-dcp Aug 22, 2024 Maintainer Author

sf-dcp Aug 22, 2024 Maintainer Author

sf-dcp
Aug 15, 2024
Maintainer

Replies: 5 comments 12 replies

alexrichey
Aug 19, 2024
Maintainer

sf-dcp Aug 19, 2024
Maintainer Author

sf-dcp Aug 19, 2024
Maintainer Author

alexrichey Aug 19, 2024
Maintainer

sf-dcp Aug 19, 2024
Maintainer Author

fvankrieken
Aug 20, 2024
Maintainer

damonmcc Aug 20, 2024
Maintainer

sf-dcp Aug 21, 2024
Maintainer Author

damonmcc Aug 22, 2024
Maintainer

fvankrieken Aug 22, 2024
Maintainer

sf-dcp Aug 22, 2024
Maintainer Author

damonmcc
Aug 20, 2024
Maintainer

damonmcc
Aug 20, 2024
Maintainer

sf-dcp Aug 21, 2024
Maintainer Author

damonmcc Aug 22, 2024
Maintainer

sf-dcp Aug 22, 2024
Maintainer Author

sf-dcp
Aug 22, 2024
Maintainer Author