Skip to content

[Variant] Speedup validation #7878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jul 7, 2025

Rationale for this change

This PR contains algorithmic modifications to the validation logic and the associated benchmarks, specifically targeting complex object and list validation.

Previously, the approach involved iterating over each element and repeatedly fetching the same slice of the backing buffer, then slicing into that buffer again for each individual element. This led to redundant buffer access.

This validation approach is done in multiple passes that take advantage of the variant's memory layout. For example, dictionary field names are stored contiguously; instead of checking whether a field name is UTF8-encoded separately, we now validate the entire field name buffer in a single pass.

The benchmark cases were adapted from test_json_to_variant_object_very_large, test_json_to_variant_object_complex, and test_json_to_variant_array_nested_large test cases.

Compared to #7871, we observe a significant improvement in performance:

Screenshot 2025-07-07 at 10 25 07 AM

@scovich @alamb

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 7, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch from f2ff2a3 to 1a9ff6c Compare July 7, 2025 16:02
@friendlymatthew
Copy link
Contributor Author

friendlymatthew commented Jul 7, 2025

Note: this is still a POC. There's some code movement to do, but I figured it would be best to leave the diff like this to make it more readable.

But the core concept is fully fleshed out.

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch from 1a9ff6c to b6af863 Compare July 7, 2025 18:12
@alamb
Copy link
Contributor

alamb commented Jul 7, 2025

BTW I ran the test from this ticket locally before this PR

Previously

        PASS [  21.886s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large
        PASS [  15.711s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large

With this PR

        PASS [   6.570s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large

Only 6.5 seconds 🤗 -- nice work

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting ideas. Made a partial pass with some ideas for improved control flow.

let offset_bytes = slice_from_slice(self.value, byte_range)?;

let offset_chunks = offset_bytes.chunks_exact(self.header.offset_size());
assert!(offset_chunks.remainder().is_empty()); // guaranteed by shallow validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're in a method that returns Result... why not use it instead of having to worry whether shallow validation covered the corner case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this a debug_assert over an explicit branch.

The way ranges are created, we can always guarantee it is a multiple of self.header.offset_size().

Shallow validation does the work of deriving these offsets (.checked_mul(self.header.offset_size())) so I want to avoid any extraneous work as possible.

u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]) as usize
}
OffsetSizeBytes::Four => {
u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]) as usize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this faster and/or simpler than try_into from slice to array?

let end_offset = offsets[i + 1];

let value_bytes = slice_from_slice(value_buffer, start_offset..end_offset)?;
Variant::try_new_with_metadata(self.metadata, value_bytes)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, copying that fat metadata over and over will be a meaningful contributor to the total cost.

Are we sure it's ~free?

If not sure, should we somehow allow creating variant instances with metadata references, at least internally for validation purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've been thinking about this. Any time we need to recursively validate a variant, by way of object fields or list elements, we will eagerly validate the VariantMetadata again and again.

We can definitely avoid work here. I will think about this more and follow up in a later PR

Comment on lines +241 to +244
.collect::<Vec<_>>();

// (3)
let monotonic_offsets = offsets.is_sorted_by(|a, b| a < b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Iterator::is_sorted_by?

let offsets_monotonically_increasing = offset_chunks
    .map(...)
    .is_sorted_by(...);

(again below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The code below uses the materialized offsets.

But for large objects, wouldn't it be cheaper (and simpler) to combine the checks instead of materializing the offset array? Most likely inside a helper method so the code can leverage ?:

    let mut offsets = offset_chunks.map(|chunk| match chunk {
        ...
    });
    let Some(mut prev_offset) = offset_chunks.next().transpose()? else {
        return Ok(());
    };
    for offset in offsets {
        let offset = offset?;
        if prev_offset >= offset {
            return Err(... not monotonic ...);
        }

        let value_bytes = slice_from_slice(value_buffer, prev_offset..offset)?;
        prev_offset = offset;
        let _ = Variant::try_new_with_metadata(self.metadata, value_bytes)?;
    };
    Ok(())
}

Comment on lines +269 to +271
.collect::<Vec<usize>>();

let offsets_monotonically_increasing = offsets.is_sorted_by(|a, b| a < b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, no need to materialize the iterator if we combine the checks into a single loop?
(but also see comment below)

Comment on lines +314 to +316
if let Some(prev_field_name) = prev_field_name {
is_sorted = is_sorted && prev_field_name < field_name;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by deferring the check to the next iteration, we risk missing the case where the last pair breaks ordering? Combined with the above, a better control flow might be (again, in a helper function so we can use ?):

    let Some(prev_offset) = offset_chunks.next().transpose()? else {
        return Ok(());
    };
    let field_names = offsets.scan(prev_offset, |prev_offset, offset| {
        let offset = offset?;
        if *prev_offset >= offset {
            return Err(... offset not monotonic ...);
        }

        let offset_range = prev_offset..offset;
        *prev_offset = offset;

        value_str
            .get(offset_range)
            .ok_or_else(|| overflow_error("overflowed"))
    };

    let Some(mut prev_field_name) = field_names.next().transpose()? else {
        return Ok(()); // empty dictionary, nothing to validate
    };
    
    for field_name in field_names {
        let field_name = field_name?;
        if self.header.is_sorted && prev_field_name >= field_name {
            return Err(... not sorted or has a duplicate key ...);
        }
        prev_field_name = field_name;
    } 

u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]) as usize
}
})
.collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I expect it will be straightforward to avoid materializing these iterators
(again below)

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch from 7a581d1 to 1f461da Compare July 8, 2025 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] test_json_to_variant_object_very_large takes over 20s
3 participants