[Variant] Speedup validation #7878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

friendlymatthew wants to merge 3 commits into apache:main from pydantic:friendlymatthew/meta-validation

Contributor

friendlymatthew commented Jul 7, 2025 •

edited

Loading

Rationale for this change

This PR contains algorithmic modifications to the validation logic and the associated benchmarks, specifically targeting complex object and list validation.

Previously, the approach involved iterating over each element and repeatedly fetching the same slice of the backing buffer, then slicing into that buffer again for each individual element. This led to redundant buffer access.

This validation approach is done in multiple passes that take advantage of the variant's memory layout. For example, dictionary field names are stored contiguously; instead of checking whether a field name is UTF8-encoded separately, we now validate the entire field name buffer in a single pass.

The benchmark cases were adapted from test_json_to_variant_object_very_large, test_json_to_variant_object_complex, and test_json_to_variant_array_nested_large test cases.

Compared to #7871, we observe a significant improvement in performance:

Screenshot 2025-07-07 at 10 25 07 AM

@scovich @alamb

github-actions bot added the parquet label

friendlymatthew force-pushed the friendlymatthew/meta-validation branch from f2ff2a3 to 1a9ff6c Compare

July 7, 2025 16:02

Contributor Author

friendlymatthew commented Jul 7, 2025 •

edited

Loading

Note: this is still a POC. There's some code movement to do, but I figured it would be best to leave the diff like this to make it more readable.

But the core concept is fully fleshed out.

friendlymatthew force-pushed the friendlymatthew/meta-validation branch from 1a9ff6c to b6af863 Compare

July 7, 2025 18:12

Contributor

alamb commented Jul 7, 2025 •

edited

Loading

BTW I ran the test from this ticket locally before this PR

[Variant] test_json_to_variant_object_very_large takes over 20s #7872

Previously

        PASS [  21.886s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large

Before this PR, after [Variant] Remove superflous validate call and rename methods #7871

        PASS [  15.711s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large

With this PR

        PASS [   6.570s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large

Only 6.5 seconds 🤗 -- nice work

scovich reviewed

View reviewed changes

Contributor

scovich left a comment

Very interesting ideas. Made a partial pass with some ideas for improved control flow.

parquet-variant/src/variant/list.rs Outdated

    
                          let offset_bytes = slice_from_slice(self.value, byte_range)?;

                          let offset_chunks = offset_bytes.chunks_exact(self.header.offset_size());

                          assert!(offset_chunks.remainder().is_empty()); // guaranteed by shallow validation

Contributor

scovich Jul 7, 2025

We're in a method that returns Result... why not use it instead of having to worry whether shallow validation covered the corner case?

Contributor Author

friendlymatthew Jul 8, 2025

I made this a debug_assert over an explicit branch.

The way ranges are created, we can always guarantee it is a multiple of self.header.offset_size().

Shallow validation does the work of deriving these offsets (.checked_mul(self.header.offset_size())) so I want to avoid any extraneous work as possible.

parquet-variant/src/variant/list.rs

    
                                      u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]) as usize

                                  }

                                  OffsetSizeBytes::Four => {

                                      u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]) as usize

Contributor

scovich Jul 7, 2025

Is this faster and/or simpler than try_into from slice to array?

parquet-variant/src/variant/list.rs Outdated Show resolved Hide resolved

parquet-variant/src/variant/list.rs Outdated Show resolved Hide resolved

parquet-variant/src/variant/list.rs

    
                              let end_offset = offsets[i + 1];

                              let value_bytes = slice_from_slice(value_buffer, start_offset..end_offset)?;

                              Variant::try_new_with_metadata(self.metadata, value_bytes)?;

Contributor

scovich Jul 7, 2025

At some point, copying that fat metadata over and over will be a meaningful contributor to the total cost.

Are we sure it's ~free?

If not sure, should we somehow allow creating variant instances with metadata references, at least internally for validation purposes?

Contributor Author

friendlymatthew Jul 8, 2025

Yeah, I've been thinking about this. Any time we need to recursively validate a variant, by way of object fields or list elements, we will eagerly validate the VariantMetadata again and again.

We can definitely avoid work here. I will think about this more and follow up in a later PR

parquet-variant/src/variant/list.rs

Comment on lines +241 to +244

    
                              .collect::<Vec<_>>();

                          // (3)

                          let monotonic_offsets = offsets.is_sorted_by(|a, b| a < b);

Contributor

scovich Jul 7, 2025

Why not Iterator::is_sorted_by?

let offsets_monotonically_increasing = offset_chunks
    .map(...)
    .is_sorted_by(...);

(again below)

Contributor

scovich Jul 7, 2025

Update: The code below uses the materialized offsets.

But for large objects, wouldn't it be cheaper (and simpler) to combine the checks instead of materializing the offset array? Most likely inside a helper method so the code can leverage ?:

    let mut offsets = offset_chunks.map(|chunk| match chunk {
        ...
    });
    let Some(mut prev_offset) = offset_chunks.next().transpose()? else {
        return Ok(());
    };
    for offset in offsets {
        let offset = offset?;
        if prev_offset >= offset {
            return Err(... not monotonic ...);
        }

        let value_bytes = slice_from_slice(value_buffer, prev_offset..offset)?;
        prev_offset = offset;
        let _ = Variant::try_new_with_metadata(self.metadata, value_bytes)?;
    };
    Ok(())
}

parquet-variant/src/variant/metadata.rs

Comment on lines +269 to +271

		.collect::<Vec<usize>>();

		let offsets_monotonically_increasing = offsets.is_sorted_by(\|a, b\| a < b);

Contributor

scovich Jul 7, 2025

As above, no need to materialize the iterator if we combine the checks into a single loop?
(but also see comment below)

parquet-variant/src/variant/metadata.rs Outdated Show resolved Hide resolved

parquet-variant/src/variant/metadata.rs

Comment on lines +314 to +316

+                                  if let Some(prev_field_name) = prev_field_name {
+                                      is_sorted = is_sorted && prev_field_name < field_name;
+                                  }

Contributor

scovich Jul 7, 2025

by deferring the check to the next iteration, we risk missing the case where the last pair breaks ordering? Combined with the above, a better control flow might be (again, in a helper function so we can use ?):

    let Some(prev_offset) = offset_chunks.next().transpose()? else {
        return Ok(());
    };
    let field_names = offsets.scan(prev_offset, |prev_offset, offset| {
        let offset = offset?;
        if *prev_offset >= offset {
            return Err(... offset not monotonic ...);
        }

        let offset_range = prev_offset..offset;
        *prev_offset = offset;

        value_str
            .get(offset_range)
            .ok_or_else(|| overflow_error("overflowed"))
    };

    let Some(mut prev_field_name) = field_names.next().transpose()? else {
        return Ok(()); // empty dictionary, nothing to validate
    };
    
    for field_name in field_names {
        let field_name = field_name?;
        if self.header.is_sorted && prev_field_name >= field_name {
            return Err(... not sorted or has a duplicate key ...);
        }
        prev_field_name = field_name;
    }

parquet-variant/src/variant/object.rs

    
                                      u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]) as usize

                                  }

                              })

                              .collect::<Vec<_>>();

Contributor

scovich Jul 7, 2025

As above, I expect it will be straightforward to avoid materializing these iterators
(again below)


          Improve validation perf

1f461da

friendlymatthew force-pushed the friendlymatthew/meta-validation branch from 7a581d1 to 1f461da Compare

July 8, 2025 02:39

friendlymatthew added 2 commits

July 7, 2025 22:44


          Use debug assert, preventative looping

7748cda


          More defensive programming

838e546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet