Comparing read times between dense and sparse parquet files #4258

mikeprince4 · 2025-04-26T22:32:52Z

mikeprince4
Apr 26, 2025

I wrote a test to compare the time to read a dense vs sparse parquet file. I was expecting the times to be very similar, but was surprised when reading the sparse columns took much longer (despite filtering out the nulls). This is not what I was expecting, as I had been told that daft would be able to perform this type of operation efficiently. I'm curious if this behavior is in fact expected, or perhaps I am doing something wrong?

The code and results table are below

Thanks in advance

import daft
import timeit
import os
import shutil


def write_dense_parquet():
    # Create an example daft dataframe with 100_000 rows and 2 columns
    column_a = list(range(1, 100_001))
    column_b = list(range(1, 100_001))
    df = daft.from_pydict({"a": column_a, "b": column_b})

    # Save to parquet
    # Delete the file if it exists
    if os.path.exists("data/test_dense_df.parquet"):
        shutil.rmtree("data/test_dense_df.parquet")
    df.write_parquet("data/test_dense_df.parquet")
    print(f"# of rows in dense df: {len(df)}")


def write_sparse_parquet(sparsity):
    # Create the same dataframe but make it 10x larger every 9 rows being null
    column_a = []
    column_b = []

    number_of_nones_per_non_none = round(1.00000 / (1 - sparsity) - 1.00)
    none_vec = [None] * number_of_nones_per_non_none

    for i in range(1, 100_001):
        column_a.extend([i] + none_vec)
        column_b.extend([i] + none_vec)
    df = daft.from_pydict({"a": column_a, "b": column_b})

    # Save to parquet
    # Delete the file if it exists
    if os.path.exists("data/test_sparse_df.parquet"):
        shutil.rmtree("data/test_sparse_df.parquet")
    df.write_parquet("data/test_sparse_df.parquet")
    print(
        f"sparsity: {sparsity}, # of Nones per non-None: {number_of_nones_per_non_none}; # of rows in sparse df: {len(df)}"
    )


def read_dense_parquet():
    # Return non-null rows, collected
    df = daft.read_parquet("data/test_dense_df.parquet")
    return df.filter((df["a"] > 0) & (df["b"] > 0)).collect()


def read_sparse_parquet():
    # Return non-null rows, collected
    df = daft.read_parquet("data/test_sparse_df.parquet")
    return df.filter((df["a"] > 0) & (df["b"] > 0)).collect()


def run_test(iterations: int):
    # Write the dense parquet -- only need to do it once since it's the same for all sparse parquets
    write_dense_parquet()
    for sparsity in [0.5, 0.9, 0.95, 0.99, 0.999]:
        write_sparse_parquet(sparsity)

        read_dense_parquet_time = timeit.timeit(read_dense_parquet, number=iterations)
        read_sparse_parquet_time = timeit.timeit(read_sparse_parquet, number=iterations)

        sparse_read_factor = read_sparse_parquet_time / read_dense_parquet_time
        print(
            f"dense_read_time: {read_dense_parquet_time:.3f}, sparse_read_time: {read_sparse_parquet_time:.3f}; factor: {sparse_read_factor:.3f} \n"
        )

Here are the results of the experiment with iterations=100

hongbo-miao · 2025-04-26T23:12:01Z

hongbo-miao
Apr 26, 2025

I would expect reading data from a sparse table while filtering out NULL values to be much faster due to predicate pushdown. Hopefully, the Daft team can provide some suggestions.

7 replies

mikeprince4 Apr 27, 2025
Author

That's a good call -- I was not clearing the directory before each run. That does seem to have some impact. After adding that (I updated the code above accordingly), the multiples (factors) reduced. It's only when the sparsity is >= 95% that we really start to see much of an increase. Perhaps these numbers are more what you'd expect?

sparsity: 0.5, # of Nones per non-None: 1; # of rows in sparse df: 200000
dense_read_time: 1.864, sparse_read_time: 1.971; factor: 1.057 

sparsity: 0.9, # of Nones per non-None: 9; # of rows in sparse df: 1000000
dense_read_time: 1.817, sparse_read_time: 2.335; factor: 1.285 

sparsity: 0.95, # of Nones per non-None: 19; # of rows in sparse df: 2000000
dense_read_time: 1.814, sparse_read_time: 2.998; factor: 1.652 

sparsity: 0.99, # of Nones per non-None: 99; # of rows in sparse df: 10000000
dense_read_time: 1.809, sparse_read_time: 5.350; factor: 2.958 

sparsity: 0.999, # of Nones per non-None: 999; # of rows in sparse df: 100000000
dense_read_time: 1.811, sparse_read_time: 27.286; factor: 15.064

colin-ho Apr 27, 2025
Maintainer

Yes, those numbers are much closer with what i got!

colin-ho Apr 27, 2025
Maintainer

With the high sparsity files, I find that most of the time is spent on the filter. There is an optimization we could potentially do where we move the computation of the predicate pushdown into the execution engine instead of in the reader. Here's some experimental results:

Pushdown in reader (current behaviour)

sparsity: 0.999, # of Nones per non-None: 999; # of rows in sparse df: 100000000
dense_read_time: 0.158, sparse_read_time: 16.724; factor: 105.734

Pushdown in engine

sparsity: 0.999, # of Nones per non-None: 999; # of rows in sparse df: 100000000
dense_read_time: 0.179, sparse_read_time: 9.751; factor: 54.546

Reason for the improvement is because the engine's parallelism is based on CPU cores, while parquet reading parallelism will be bottlenecked by # of row groups or # of fields, and in this particular example i found that there were only 3 row groups in the parquet file.

cc @desmondcheongzx

mikeprince4 Apr 27, 2025
Author

Thanks for following up, and thanks for the explanation. Would your suggested optimization make sense in all cases or are there reasons you would not end up implementing it?

BTW, I also checked 0.9999 and 0.99999 for fun. It looks like at these larger sizes, the time starts to increase at the same rate as the # of rows.

sparsity: 0.9999, # of Nones per non-None: 9999; # of rows in sparse df: 1000000000
dense_read_time: 1.847, sparse_read_time: 285.729; factor: 154.678 

sparsity: 0.99999, # of Nones per non-None: 99999; # of rows in sparse df: 10000000000
dense_read_time: 1.832, sparse_read_time: 2638.680; factor: 1440.463

colin-ho Apr 28, 2025
Maintainer

Pushdowns in the reader still make sense if we can reduce the amount of data being introduced into the system, such as eliding entire row groups / files.

Comparing read times between dense and sparse parquet files #4258

Uh oh!

Uh oh!

mikeprince4 Apr 26, 2025

Replies: 1 comment · 7 replies

Uh oh!

hongbo-miao Apr 26, 2025

Uh oh!

Uh oh!

mikeprince4 Apr 27, 2025 Author

Uh oh!

colin-ho Apr 27, 2025 Maintainer

Uh oh!

Uh oh!

colin-ho Apr 27, 2025 Maintainer

Uh oh!

mikeprince4 Apr 27, 2025 Author

Uh oh!

colin-ho Apr 28, 2025 Maintainer

mikeprince4
Apr 26, 2025

Replies: 1 comment 7 replies

hongbo-miao
Apr 26, 2025

mikeprince4 Apr 27, 2025
Author

colin-ho Apr 27, 2025
Maintainer

colin-ho Apr 27, 2025
Maintainer

mikeprince4 Apr 27, 2025
Author

colin-ho Apr 28, 2025
Maintainer