-
Couldn't load subscription status.
- Fork 3.4k
Description
I am trying to understand a performance degradation that has happened on upgrading from Trino 419 to Trino 463. Querying hive tables with zstd compressed data in s3 seem to run significantly slower in Trino 463 than in Trino 419.
I have symlink Hive table built on top of zstd compressed data in s3. Querying this table is relatively fast in Trino 419 however when I tried to upgrade to the most recent version of Trino I saw a significant decrease in speed of execution and a spike in CPU.
It is a simple group by query like
select date, count(*)
FROM <table_name>
WHERE date in ('2024-11-07')
AND column1='foo'
AND column2='bar'
group by 1
order by 1
These are the query stats in Trino 419


And these are the query stats in Trino 463


The same query now takes double the amount of time and the CPU is much higher.
I tried to pinpoint where in the upgrade path this performance degredation occurred but when I tried running on Trino 430 for example I get errors like
io.trino.spi.TrinoException: Unsupported input format: serde=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
I am running this self hosted on AWS EC2 instances. The coordinator is of type r7g.4xlarge and there are 5 workers of type are r7g.8xlarge.
I don't think the fact that this is symlink table has anything to do with the performance issue. The reason that this is a symlink table is because the data is stored in s3 in a funky way that does not lend itself to the hive partitioning scheme.
This is some info about one of zstd file's
$> zstd -lv file.csv.zst
# Zstandard Frames: 1
DictID: 0
Window Size: 8.00 MiB (8388608 B)
Compressed Size: 26.4 MiB (27635730 B)
Check: XXH64 109a2c1c