-
Notifications
You must be signed in to change notification settings - Fork 750
Description
Describe the bug
I am seeing a performance bottleneck with the Flyte database. With my workload, the query SELECT * FROM "tags" WHERE ("tags"."artifact_id","tags"."dataset_uuid") IN (($1,$2))
is getting executed frequently (380,000 times in the last day) against the datacatalog
database. The workload I'm executing does have ~380k tasks, so the number of queries makes sense. On average, it takes 2 seconds to complete and returns 0 rows. I believe this query is executed as part of the task cache, which I do have enabled (but I expect all cache misses). I am using v1.0.1 of Datacatalog, and v1.1.47 of FlytePropeller. This is a database with 32 vCPU and 64GB of memory, 200GB storage with ~50GB of that used.
Expected behavior
I expect that the database would not be a performance bottleneck, and that the datacatalog.tags
table is properly indexed in order to support the queries that are executed against it.
Additional context to reproduce
If I add an index (CREATE INDEX tags_dataset_uuid_artifact_id_idx ON tags (dataset_uuid, artifact_id);
), the query gets significantly faster (1000x in my local testing). The postgres planner seems to vary the way it actually executes the query. The attached screenshots are from a different DB instance under slightly lesser load, but I'm adding them to show the 3 different ways I've seen the planner choose to execute the query. (Note that these are all without the index).
Context: https://flyte-org.slack.com/archives/CP2HDHKE1/p1701797879836439
Screenshots



Are you sure this issue hasn't been raised already?
- Yes
Have you read the Code of Conduct?
- Yes