Integrating VegaFusion with distributed DataFrames #567

OlegWock · 2025-07-04T17:49:30Z

OlegWock
Jul 4, 2025

I use VegaFusion to chart big DataFrames. In addition to Pandas, I want to support PySpark DataFrames. But since they might be too big to fit into the memory of a single machine, I want to outsource actual aggregation from VegaFusion directly to Spark. I'm aware that VegaFusion doesn't support this natively (as far as I understand), but I was wondering if you could point me in a direction I should look into?

As I understand, in VF 1.x it could be done by using "SQL API" (SqlConnection and SqlDataset), but it was removed from VF 2.x.

Initially I was thinking about using build_pre_transform_spec_plan and then manually translating transformations from server spec into calls to PySpark API, but then I realized I'll be re-implementing a lot of functionality already implemented in VF (e.g. Vega expressions parser). I saw you mentioning new DataFusion's unparse API and that it would power new DuckDB integration. But as far as I understand, it's not the case currently, DuckDB tables are just converted to Arrow and passed to VF. Was there any issue with the unparse API, or did you decide to not implement it due to other factors?

Answered by jonmmease

Jul 5, 2025

Hi @OlegWock, as you noted we had some architecture in place in version 1 for generating SQL for different dialects. This was removed for v2 because it was a significant source of complexity (given how it worked) and wasn't being used beyond duckdb. And after the optimizations in VegaFusion 2 (and probably in DataFusion as well), the duckdb path wasn't faster, so I didn't put the work into exploring the unparse workflow after all.

That said, I'm definitely interested in bringing this functionality back if there's demand and we develop a cleaner architecture.

Here's what I'm thinking might work.

We still convert the Vega transforms to DataFusion DataFrame operations, but instead of only ev…

View full answer

jonmmease · 2025-07-05T15:02:55Z

jonmmease
Jul 5, 2025
Collaborator

Hi @OlegWock, as you noted we had some architecture in place in version 1 for generating SQL for different dialects. This was removed for v2 because it was a significant source of complexity (given how it worked) and wasn't being used beyond duckdb. And after the optimizations in VegaFusion 2 (and probably in DataFusion as well), the duckdb path wasn't faster, so I didn't put the work into exploring the unparse workflow after all.

That said, I'm definitely interested in bringing this functionality back if there's demand and we develop a cleaner architecture.

Here's what I'm thinking might work.

We still convert the Vega transforms to DataFusion DataFrame operations, but instead of only evaluating these to arrow at the end of the transform pipeline with collect_to_table() here, we would extract the DataFusion LogicalPlan from the DataFrame. Given this logical plan, there are a few things you could do with it for evaluation.

Convert to sql with DataFusion's unparser support (though I don't see support for the Spark dialect here right now)
Convert to subtrait, and look into using the substrait-java to create a spark plan.
Process the logical plan directly and map it to the Spark DataFrame API

I think what I'd propose is having VegaFusion restore the Datasets interface that was present in version 1. The sql dialects would be limited to what DataFusion's unparse supports, but there would be other variants that return the DataFusion logical plan, or substrait.

Take a look and see what you think!

4 replies

OlegWock Jul 6, 2025
Author

I have high hopes I can make unparser work (even though it doesn't support Spark explicitly), that would be the easiest one I think.

Thank you for the reply

OlegWock Jul 14, 2025
Author

Hello, Jon! I was able to do a very shaky proof-of-concept, but it works! Generated SQL needs to be edited to adjust for Spark specifics, but the semantics seem to be right. I believe with a bit of tinkering with LogicalPlan (before converting to SQL) and/or making custom Spark dialect for DataFusion would make it work.

I noticed that in some places VF works with raw bytes in a synchronous manner through the VegaTable struct. Since this wouldn't work for cases when we want to outsource aggregation to another engine (Spark or not), I had to update it where possible to work with DataFrame.

So in short, I basically built a parallel flow. I added the DataFrame option to the TaskValue enum to allow passing datafusion DFs between tasks (instead of tables with materialized data). At the beginning, Python will pass an inline dataset's name and schema, and VF creates an empty dataframe using the dataset name as the source table name. And from there it will go very similarly to the usual flow, by preparing a spec plan and then trying to execute server spec. But instead of operating on tables, it will try to keep DF without materializing as long as possible to preserve lineage. After executing server spec, we get a set of dataframes that are converted into a logical plan and then into SQL and returned to Python.

With this approach I had to introduce a lot of repeating code, the new flow was almost identical to the existing one, just operating on different types. So this made me think that two flows might be unified. VF could operate on DataFrames by default and return DataFrame(s) as a result of spec execution. And then, depending on the original goal, the caller method could materialize them (e.g., to embed into spec) or convert to a logical plan. Maybe that could get a small performance boost due to materializing DataFusion DFs less often (but that's just a guess).

Couple of things I'm not sure about are VegaFusionCallable::Data functions. Those seem to operate directly on data, and in the case of outsourcing aggregations, there won't be any data. So if I understand correctly, this might require updating the signal compiler to work on DataFrames and potentially returning DataFrame as well (instead of scalars) which then can be materialized/converted to plan by calling function.

What do you think of this approach?

jonmmease Jul 14, 2025
Collaborator

Hi @OlegWock, that sounds like great progress!

I noticed that in some places VF works with raw bytes in a synchronous manner through the VegaTable struct.

Yes, today for each individual Vega dataset we build a query that consists of the pipeline of transforms. Then evaluate the result to arrow (The VegaFusionTable) and store it in the cache. Derived datasets then pull this value from the cache.

I think the cleanest extension would be to add DataFusion's LogicalPlan as a new enum variant to TaskValue. A LogicalPlan is the declarative data structure that a DataFrame builds up. DataFrame's add additional info about the connection. Importantly, LogicalPlan's are hashable, and so will be easier to integrate into TaskValue than DataFrames. You can create a LogicalPlan from a DataFrame and vice versa.

So the flow would still be that a derived dataset creates a DataFrame based on it's input, but this DataFrame could be based on either a VegaFusionTable (as it is today), or a LogicalPlan.

With this approach I had to introduce a lot of repeating code, the new flow was almost identical to the existing one, just operating on different types. So this made me think that two flows might be unified. VF could operate on DataFrames by default and return DataFrame(s) as a result of spec execution.

Each transform in VegaFusion operates on, and returns, a DataFrame. So the only code that should need to change is the code that currently creates a DataFrame from VegaFusionTable (as this will also need to handle LogicalPlans), and the code that currently uses df.collect() to extract the in-memory Arrow record batches (this will need to optionally extract the logical plan rather than arrow). We'll need to work out how to decide which approach to use, but we could probably start with something simple where we only use LogicalPlan's for remote DataFrames.

Couple of things I'm not sure about are VegaFusionCallable::Data functions. Those seem to operate directly on data, and in the case of outsourcing aggregations, there won't be any data. So if I understand correctly, this might require updating the signal compiler to work on DataFrames and potentially returning DataFrame as well (instead of scalars) which then can be materialized/converted to plan by calling function.

The data functions are typically only called on relatively small inline datasets (in particular for how Vega-Lite manages selections). I picture that we will continue to use DataFusion's regular arrow approach for datasets that don't original as a spark tables, and in this case I think it would be pretty unusual for a Vega specs to try to call a data function on a dataset derived from these. And when it does happen, we can decide whether it's an error, or it results in pulling data into memory.

I believe with a bit of tinkering with LogicalPlan (before converting to SQL) and/or making custom Spark dialect for DataFusion would make it work.

I'd prefer going down the path of contributing a custom Spark dialect to DataFusion. I'd like to keep dialect specific code out of VegaFusion if at all possible (this got overwhelming in VegaFusion 1.x). But we can certainly use a hack to get the sql working during development so that we can test things end-to-end.

What do you think of this approach?

I think your general approach above makes sense, and I'm impressed you got a POC working so quickly! See if my comments above make sense, and then we can work through a plan.

OlegWock Jul 14, 2025
Author

I'd prefer going down the path of contributing a custom Spark dialect to DataFusion. I'd like to keep dialect specific code out of VegaFusion if at all possible (this got overwhelming in VegaFusion 1.x). But we can certainly use a hack to get the sql working during development so that we can test things end-to-end.

Oh yeah, definitely. We might want to include those tweaks in our fork of VF, but there are no reason to put it into upstream. I think it makes sense to support most generic usecase (i.e. logical plans), at least for the start

The approach with logical plan sounds good. I'll let you know how it goes. Thanks for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integrating VegaFusion with distributed DataFrames #567

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Integrating VegaFusion with distributed DataFrames #567

Uh oh!

OlegWock Jul 4, 2025

Replies: 1 comment · 4 replies

Uh oh!

jonmmease Jul 5, 2025 Collaborator

Uh oh!

OlegWock Jul 6, 2025 Author

Uh oh!

OlegWock Jul 14, 2025 Author

Uh oh!

jonmmease Jul 14, 2025 Collaborator

Uh oh!

OlegWock Jul 14, 2025 Author

OlegWock
Jul 4, 2025

Replies: 1 comment 4 replies

jonmmease
Jul 5, 2025
Collaborator

OlegWock Jul 6, 2025
Author

OlegWock Jul 14, 2025
Author

jonmmease Jul 14, 2025
Collaborator

OlegWock Jul 14, 2025
Author