Replies: 8 comments 1 reply
-
Here is the previous discussion in Slack: https://prestodb.slack.com/archives/C07JH9WMQ/p1590046937223500
|
Beta Was this translation helpful? Give feedback.
-
I have done very limited testing of LBM with spill and it doesn't outright fail, but spill is quite broken. Broadcast join doesn't work it just generates an error. Aggregation doesn't yet support distinct or sorted accumulators. It's looking like spill will be getting attention soon though. Materializing to local disk to for exchange is basically re-implementing a distributed filesystem or shuffle service. I think it's questionable whether that is a good idea vs just using a distributed filesystem or shuffle service. Would doing it inside Presto be too "Not invented here"? To reframe the approach, one thing you might consider instead is re-using already materialized CTEs? |
Beta Was this translation helpful? Give feedback.
-
@aweisberg Thank you for the quick response.
Could you please share some more details on why spill is incompatible with materialized exchanges and the work required to get materialized exchanges to work with spill? Do we have Github issues on this? What is LBM?
I have a couple of questions:
Many Presto users use it with cloud storage services like S3 as the distributed file system, which has different performance characteristics. For instance, S3 provides very high aggregate throughput, but higher latencies to the first byte. I don't understand the I/O patterns of the intermediate data well enough to see if storing the intermediate data such as materialized exchanges on S3 it could be a problem. If you have some thoughts on this, I would appreciate that.
Sorry, I did not understand and would appreciate some context. Does Presto support materializing CTEs? |
Beta Was this translation helpful? Give feedback.
-
@aweisberg :
Since it's just materializing an intermediate CTE table, it doesn't have to be a disaggregated shuffle. Essentially a disaggregated storage (like S3/HDFS is fine). Think about we just need an Connector that supports temporary table write (e.g. |
Beta Was this translation helpful? Give feedback.
-
The general question here is whether we want to make sure all "intermediate table" is in disaggregated storage, or it can be in local disk. In general the treading in distributed computation is to have disaggregated service, so each service can scale itself, we have seen disaggregated storage (S3/HDFS) becomes prevail in the last decade, even for low-latency service such as Raptor, it's moving to disaggregated flash via RaptorX (#13205) . Also as explained in this very nice blog by Dipti Borkar and Steven Mih, "The database stack is completely disaggregated ": https://ahana.io/blog/introducing-ahana/ . Thus I believe disaggregating temporary storage is also the right direction. Besides that, materializing CTE as an Hive temporary table might actually be easier to implement since it can leverage most of the existing Presto Unlimited infrastructure. I remember @tdcmeehan had a prototype on planner side change to allow generating "multiple query plans" when materializing CTE last year -- it probably need some revision now but I believe most of the code should still work. |
Beta Was this translation helpful? Give feedback.
-
They are solving the similar problem from different angles. So they are "compatible" in a sense they don't block each other -- you can both have materialized exchange and spill to disk both enabled for a single query. |
Beta Was this translation helpful? Give feedback.
-
This makes sense. Still, to use local disk, you will still need to implementing a Connector (e.g. What about the following: start the implementation with HiveConnector since all the temporary table API is already implemented (need small tuning for materialized CTE). In the meanwhile, we can see other committer's thoughts about whether we want to do it with a connector that writes temporary table to local disk. |
Beta Was this translation helpful? Give feedback.
-
cc @viczhang861 |
Beta Was this translation helpful? Give feedback.
-
I was thinking about reusing materialized exchanges to speed up queries by reuse the exchanges that happen multiple times in the same query. Have a couple of questions about materialized exchanges:
Currently the materialized exchanges write the output to a DFS (such as Crail). Is it possible to materialize to local disks instead (similar to spill-to-disk)?
Are materialized exchanges compatible with the spill to disk feature?
Beta Was this translation helpful? Give feedback.
All reactions