Skip to content

Conversation

@hernandezc1
Copy link
Collaborator

@hernandezc1 hernandezc1 commented Aug 21, 2025

PR Summary

Changed

  • /setup_broker/lsst/setup_broker.sh
    • updated the bq mk command to include the flag: --time_partitioning_type=DAY for the LSST alerts table

Partitioning the alert table improves a query's performance by only scanning a portion of a table. The maximum number of partitions allowed for a single BigQuery table is 10,000. Additional types of partitioning are outlined here.

@hernandezc1 hernandezc1 self-assigned this Aug 21, 2025
@hernandezc1 hernandezc1 added Enhancement New feature or request Pipeline: Storage Components whose primary function is to store data labels Aug 21, 2025
@hernandezc1 hernandezc1 requested a review from troyraen August 21, 2025 20:13
Copy link
Collaborator

@troyraen troyraen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree that we should partition and cluster our tables. I'll leave the specific decision about what to partition this one by up to you since you're working much more closely with the data and use cases these days. My input is the following. I'm doing a lot of guessing about how BigQuery works. You'd need to look into that more if/when you feel it's relevant.

BigQuery can probably execute a query by parallelizing over partitions, and in that case could return results faster from a table that has partitioning vs one that doesn't. That could be faster regardless of what the table is partitioned by and without any requirements on the specifics of a given query.

However, the monetary cost of a query is determined by the amount of data that needs to be processed. So I'm guessing that a partitioned table should facilitate cheaper queries by allowing the search engine to skip entire partitions, but it needs some way to determine which partitions can be skipped. To do this, it can probably use a constraint on the partitioning column itself (ie, the given query has to include something like "WHERE partition_column < value") and/or compare other constraints in the given query with partition metadata (eg, min/max of any column). Because of the distributions of data in these alerts tables, if we partition by time/day, I think that users will typically have to include a constraint on the partitioning column itself in order to pay significantly less for a query. Things like magnitude won't be correlated with time, so they won't be useful. Spatial constraints will have some correlation with time because of LSST's survey strategy. That could potentially mean significant cost savings when compared to no partitioning at all. But by contrast, if we partition spatially (eg, by HEALPix), our users will be much more likely to pay even less for their queries. 1) Spatial constraints are the most common need shared across time-domain astronomy use cases. 2) All alerts for a given astronomical object will be in the same partition. That's why we've always intended to partition these tables spatially.

Our buckets are organized by time and that's where we've intended to direct users who have a strong dependence on time-based searches. I know it's possible to configure a BigQuery table to use a bucket as it's underlying data store. I have no idea how easy that is to set up. I'm pretty sure that queries will take longer and it might cost more as well, but by how much, I don't know. You could look into it as a potential additional service we could offer and/or suggest it to a user who wants a time-partitioned BigQuery table.

If we only have one potential user right now (that we know of) and they strongly prefer a BigQuery table with time-based partitioning, of course it makes sense to consider that. But also consider what you're going to do when we have more potential users who want spatial lookups.

@hernandezc1 hernandezc1 merged commit f3d9ed9 into develop Aug 27, 2025
4 checks passed
@hernandezc1 hernandezc1 deleted the u/ch/bq/partition branch August 27, 2025 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement New feature or request Pipeline: Storage Components whose primary function is to store data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants