-
Notifications
You must be signed in to change notification settings - Fork 965
[Variant] Speedup validation #7878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Variant] Speedup validation #7878
Conversation
f2ff2a3
to
1a9ff6c
Compare
Note: this is still a POC. There's some code movement to do, but I figured it would be best to leave the diff like this to make it more readable. But the core concept is fully fleshed out. |
1a9ff6c
to
b6af863
Compare
BTW I ran the test from this ticket locally before this PR Previously
With this PR
Only 6.5 seconds 🤗 -- nice work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting ideas. Made a partial pass with some ideas for improved control flow.
parquet-variant/src/variant/list.rs
Outdated
let offset_bytes = slice_from_slice(self.value, byte_range)?; | ||
|
||
let offset_chunks = offset_bytes.chunks_exact(self.header.offset_size()); | ||
assert!(offset_chunks.remainder().is_empty()); // guaranteed by shallow validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're in a method that returns Result
... why not use it instead of having to worry whether shallow validation covered the corner case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this a debug_assert
over an explicit branch.
The way ranges are created, we can always guarantee it is a multiple of self.header.offset_size()
.
Shallow validation does the work of deriving these offsets (.checked_mul(self.header.offset_size())
) so I want to avoid any extraneous work as possible.
u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]) as usize | ||
} | ||
OffsetSizeBytes::Four => { | ||
u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]) as usize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this faster and/or simpler than try_into from slice to array?
let end_offset = offsets[i + 1]; | ||
|
||
let value_bytes = slice_from_slice(value_buffer, start_offset..end_offset)?; | ||
Variant::try_new_with_metadata(self.metadata, value_bytes)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, copying that fat metadata over and over will be a meaningful contributor to the total cost.
Are we sure it's ~free?
If not sure, should we somehow allow creating variant instances with metadata references, at least internally for validation purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've been thinking about this. Any time we need to recursively validate a variant, by way of object fields or list elements, we will eagerly validate the VariantMetadata
again and again.
We can definitely avoid work here. I will think about this more and follow up in a later PR
.collect::<Vec<_>>(); | ||
|
||
// (3) | ||
let monotonic_offsets = offsets.is_sorted_by(|a, b| a < b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Iterator::is_sorted_by
?
let offsets_monotonically_increasing = offset_chunks
.map(...)
.is_sorted_by(...);
(again below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: The code below uses the materialized offsets
.
But for large objects, wouldn't it be cheaper (and simpler) to combine the checks instead of materializing the offset array? Most likely inside a helper method so the code can leverage ?
:
let mut offsets = offset_chunks.map(|chunk| match chunk {
...
});
let Some(mut prev_offset) = offset_chunks.next().transpose()? else {
return Ok(());
};
for offset in offsets {
let offset = offset?;
if prev_offset >= offset {
return Err(... not monotonic ...);
}
let value_bytes = slice_from_slice(value_buffer, prev_offset..offset)?;
prev_offset = offset;
let _ = Variant::try_new_with_metadata(self.metadata, value_bytes)?;
};
Ok(())
}
.collect::<Vec<usize>>(); | ||
|
||
let offsets_monotonically_increasing = offsets.is_sorted_by(|a, b| a < b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, no need to materialize the iterator if we combine the checks into a single loop?
(but also see comment below)
if let Some(prev_field_name) = prev_field_name { | ||
is_sorted = is_sorted && prev_field_name < field_name; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by deferring the check to the next iteration, we risk missing the case where the last pair breaks ordering? Combined with the above, a better control flow might be (again, in a helper function so we can use ?
):
let Some(prev_offset) = offset_chunks.next().transpose()? else {
return Ok(());
};
let field_names = offsets.scan(prev_offset, |prev_offset, offset| {
let offset = offset?;
if *prev_offset >= offset {
return Err(... offset not monotonic ...);
}
let offset_range = prev_offset..offset;
*prev_offset = offset;
value_str
.get(offset_range)
.ok_or_else(|| overflow_error("overflowed"))
};
let Some(mut prev_field_name) = field_names.next().transpose()? else {
return Ok(()); // empty dictionary, nothing to validate
};
for field_name in field_names {
let field_name = field_name?;
if self.header.is_sorted && prev_field_name >= field_name {
return Err(... not sorted or has a duplicate key ...);
}
prev_field_name = field_name;
}
u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]) as usize | ||
} | ||
}) | ||
.collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I expect it will be straightforward to avoid materializing these iterators
(again below)
7a581d1
to
1f461da
Compare
Rationale for this change
test_json_to_variant_object_very_large
takes over 20s #7872This PR contains algorithmic modifications to the validation logic and the associated benchmarks, specifically targeting complex object and list validation.
Previously, the approach involved iterating over each element and repeatedly fetching the same slice of the backing buffer, then slicing into that buffer again for each individual element. This led to redundant buffer access.
This validation approach is done in multiple passes that take advantage of the variant's memory layout. For example, dictionary field names are stored contiguously; instead of checking whether a field name is UTF8-encoded separately, we now validate the entire field name buffer in a single pass.
The benchmark cases were adapted from
test_json_to_variant_object_very_large
,test_json_to_variant_object_complex
, andtest_json_to_variant_array_nested_large
test cases.Compared to #7871, we observe a significant improvement in performance:
@scovich @alamb