Materialized Exchanges #14619

anoopj · 2020-06-08T00:59:43Z

anoopj
Jun 8, 2020

I was thinking about reusing materialized exchanges to speed up queries by reuse the exchanges that happen multiple times in the same query. Have a couple of questions about materialized exchanges:

Currently the materialized exchanges write the output to a DFS (such as Crail). Is it possible to materialize to local disks instead (similar to spill-to-disk)?
Are materialized exchanges compatible with the spill to disk feature?

wenleix · 2020-06-08T06:10:02Z

wenleix
Jun 8, 2020

Here is the previous discussion in Slack: https://prestodb.slack.com/archives/C07JH9WMQ/p1590046937223500

@apc999 : assume you mean this design doc #12387 . Maybe @wenleix can help here

@anoopj : Yes, that is the design doc that was written by @wenleix and would like his inputs. Let me explain a bit further. Even though materialized exchange was built for large scale joins and aggs as in ETL workloads, I was wondering if it would be a good idea to use this feature for improving performance by reusing exchanges. Some of the queries like TPCDS# 64 could benefit from reusing exchanges.

@wenleix : @anoopj : For your original question:
Yes it's DFS. But it can also be some intermediate data storage, such as Crail.
They are compatible. @aweisberg have we tested using materialized exchange and spill together ?

I was wondering if it would be a good idea to use this feature for improving performance by reusing exchanges.
Do you mean materializing intermediate data that will be reused multiple times (similar to materialized view) -- That's definitely a good idea. It might require some work on the planner to support DAG style plan (today Presto's plan is always a tree)

@anoopj :

Do you mean materializing intermediate data that will be reused multiple times (similar to materialized view) -- That’s definitely a good idea. It might require some work on the planner to support DAG style plan (today Presto’s plan is always a tree)
@wenleix Yes, the intention was to reuse the exchange multiple times in the same query. Would it be feasible to decouple the need for a DFS (such as Crail) and materialize to local disks instead?

0 replies

aweisberg · 2020-06-09T20:32:30Z

aweisberg
Jun 9, 2020

I have done very limited testing of LBM with spill and it doesn't outright fail, but spill is quite broken. Broadcast join doesn't work it just generates an error. Aggregation doesn't yet support distinct or sorted accumulators.

It's looking like spill will be getting attention soon though.

Materializing to local disk to for exchange is basically re-implementing a distributed filesystem or shuffle service. I think it's questionable whether that is a good idea vs just using a distributed filesystem or shuffle service.

Would doing it inside Presto be too "Not invented here"?

To reframe the approach, one thing you might consider instead is re-using already materialized CTEs?

0 replies

anoopj · 2020-06-11T21:02:09Z

anoopj
Jun 11, 2020
Author

@aweisberg Thank you for the quick response.

I have done very limited testing of LBM with spill and it doesn't outright fail, but spill is quite broken.

Could you please share some more details on why spill is incompatible with materialized exchanges and the work required to get materialized exchanges to work with spill? Do we have Github issues on this? What is LBM?

I think it's questionable whether that is a good idea vs just using a distributed filesystem or shuffle service.

I have a couple of questions:

Does Presto support external shuffle services?
Are there shuffle services that are open source? I am only aware of Crail, which does not seem to be actively developed.

Would doing it inside Presto be too "Not invented here"?

Many Presto users use it with cloud storage services like S3 as the distributed file system, which has different performance characteristics. For instance, S3 provides very high aggregate throughput, but higher latencies to the first byte. I don't understand the I/O patterns of the intermediate data well enough to see if storing the intermediate data such as materialized exchanges on S3 it could be a problem. If you have some thoughts on this, I would appreciate that.

To reframe the approach, one thing you might consider instead is re-using already materialized CTEs?

Sorry, I did not understand and would appreciate some context. Does Presto support materializing CTEs?

0 replies

wenleix · 2020-07-05T20:53:11Z

wenleix
Jul 5, 2020

@aweisberg :
For materialized CTE, since the data doesn't necessarily need to be partitioned, writing to local disk is much simpler than spilling (thinking about writing out whatever on each node to disk)

@anoopj

Are there shuffle services that are open source? I am only aware of Crail, which does not seem to be actively developed.

Since it's just materializing an intermediate CTE table, it doesn't have to be a disaggregated shuffle. Essentially a disaggregated storage (like S3/HDFS is fine). Think about we just need an Connector that supports temporary table write (e.g. HiveMetadata#createTemporaryTable). -- Note today it requires the table to be bucketed for materializing shuffle, but for materialized CTEs, we can make the table bucket optional.

0 replies

wenleix · 2020-07-05T21:02:39Z

wenleix
Jul 5, 2020

@aweisberg , @anoopj

The general question here is whether we want to make sure all "intermediate table" is in disaggregated storage, or it can be in local disk.

In general the treading in distributed computation is to have disaggregated service, so each service can scale itself, we have seen disaggregated storage (S3/HDFS) becomes prevail in the last decade, even for low-latency service such as Raptor, it's moving to disaggregated flash via RaptorX (#13205) . Also as explained in this very nice blog by Dipti Borkar and Steven Mih, "The database stack is completely disaggregated ": https://ahana.io/blog/introducing-ahana/ . Thus I believe disaggregating temporary storage is also the right direction.

Besides that, materializing CTE as an Hive temporary table might actually be easier to implement since it can leverage most of the existing Presto Unlimited infrastructure. I remember @tdcmeehan had a prototype on planner side change to allow generating "multiple query plans" when materializing CTE last year -- it probably need some revision now but I believe most of the code should still work.

0 replies

wenleix · 2020-07-05T21:03:54Z

wenleix
Jul 5, 2020

Are materialized exchanges compatible with the spill to disk feature?

They are solving the similar problem from different angles. So they are "compatible" in a sense they don't block each other -- you can both have materialized exchange and spill to disk both enabled for a single query.

1 reply

anoopj Jul 6, 2020
Author

@aweisberg , @wenleix

The general question here is whether we want to make sure all "intermediate table" is in disaggregated storage, or it can be in local disk.

In general the treading in distributed computation is to have disaggregated service, so each service can scale itself, we have seen disaggregated storage (S3/HDFS) becomes prevail in the last decade, even for low-latency service such as Raptor, it's moving to disaggregated flash via RaptorX (#13205) . Also as explained in this very nice blog by Dipti Borkar and Steven Mih, "The database stack is completely disaggregated ": https://ahana.io/blog/introducing-ahana/ . Thus I believe disaggregating temporary storage is also the right direction.

I agree with the principle that a disaggregated service is a better long term choice, but for most Presto users on AWS, S3 is the only choice and S3 may or may not be a fit for short-lived temporary storage. Do you have other recommendations or have thoughts on the performance characteristics desired from this temporary storage system? I have seen some of the talks on Cosco and Crail and will go deeper into them.

Besides that, materializing CTE as an Hive temporary table might actually be easier to implement since it can leverage most of the existing Presto Unlimited infrastructure. I remember @tdcmeehan had a prototype on planner side change to allow generating "multiple query plans" when materializing CTE last year -- it probably need some revision now but I believe most of the code should still work.

I see.

Are materialized exchanges compatible with the spill to disk feature?

They are solving the similar problem from different angles. So they are "compatible" in a sense they don't block each other -- you can both have materialized exchange and spill to disk both enabled for a single query.

I remember running into issues while running with spill to disk and materialized exchanges. I will try to repro it and file an issue.

wenleix · 2020-07-07T15:27:12Z

wenleix
Jul 7, 2020

@anoopj

I agree with the principle that a disaggregated service is a better long term choice, but for most Presto users on AWS, S3 is the only choice and S3 may or may not be a fit for short-lived temporary storage. Do you have other recommendations or have thoughts on the performance characteristics desired from this temporary storage system? I have seen some of the talks on Cosco and Crail and will go deeper into them.

This makes sense. Still, to use local disk, you will still need to implementing a Connector (e.g. LocalDiskConnector) that allow temporary table creation, etc.

What about the following: start the implementation with HiveConnector since all the temporary table API is already implemented (need small tuning for materialized CTE). In the meanwhile, we can see other committer's thoughts about whether we want to do it with a connector that writes temporary table to local disk.

0 replies

wenleix · 2020-07-09T21:09:35Z

wenleix
Jul 9, 2020

cc @viczhang861

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presto

Materialized Exchanges #14619

{{title}}

Replies: 8 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Materialized Exchanges #14619

Replies: 8 comments · 1 reply

anoopj Jun 11, 2020 Author

anoopj Jul 6, 2020 Author

Replies: 8 comments 1 reply

anoopj
Jun 11, 2020
Author

anoopj Jul 6, 2020
Author