Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: re-export name mapping #1116

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jdockerty
Copy link
Contributor

@jdockerty jdockerty commented Mar 19, 2025

Which issue does this PR close?

Likely helps towards #919 and this was also discussed in Slack.

What changes are included in this PR?

This publicly re-exports the name_mapping module to iceberg::spec. Prior to this, it is private and inaccessible outside of this crate.

This also includes a MappedFields structure, which borrows heavily from the Java implementation.

Are these changes tested?

The main changes here are not functional changes except to visibility.

The new MappedFields structure has basic test coverage.

@jdockerty jdockerty marked this pull request as ready for review March 19, 2025 13:34
@jdockerty jdockerty changed the title chore: re-export name mapping feat: re-export name mapping Mar 19, 2025
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdockerty for this pr, generally LGTM! Left some comments to improve.

@@ -32,9 +36,12 @@ pub struct NameMapping {
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
#[serde(rename_all = "kebab-case")]
pub struct MappedField {
/// Iceberg field ID when a field's name is present within `names`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that we should add a MappingFields like what we did in java. MappingFields is a list of fields with index.

Copy link
Contributor Author

@jdockerty jdockerty Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this into mapped_fields.rs 🙇

Edit: it looks like my latest commits aren't showing up and are still being processed by GitHub after approx 10minutes. If we're running into a GitHub outage, the current diff is viewable here.This has resolved after about an hour, ignore.

@jdockerty jdockerty force-pushed the chore/expose-name-mapping branch from a9ff2b9 to 85b024e Compare March 20, 2025 12:24
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdockerty for this pr, generally LGTM! Left some comments to improve.

/// Utility mapping which contains field names to IDs and
/// field IDs to the underlying [`MappedField`].
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
pub struct MappedFields {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we are going to add a lot of visitors for NameMapping, how about we create a NameMapping module, and puts everything related there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean, there is already a name_mapping.rs as a separate module. Or do you mean include everything in this file instead and use 👇 ?

mod name_mapping {
 // contents of name_mapping.rs here
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes mapped field should also be included in name_mapping.rs. You can move everything from mapped_fields to name_mapping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we need to have the MappedFields field. In the pyiceberg implementation it doesnt use it (the initial implementation/review is here apache/iceberg-python#212). There isn't a usecase (unless there is) where we just index the first layer for MappedFields without having to later create another index based on a full traversal. cc @liurenjie1024 @Fokko


impl MappedFields {
/// Create a new [`MappedFields`].
pub fn new(fields: Vec<MappedField>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value should be Result, user passed value maybe wrong.


for field in &fields {
if let Some(id) = field.field_id() {
id_to_field.insert(id, field.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check duplication of id and name here.

Copy link
Contributor Author

@jdockerty jdockerty Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me 👍

Does this matter that it differs from the Java impl here?

I modelled this based on the Java impl and it doesn't look like they have duplicate checks there, perhaps I'm missing something very obvious though from not doing much Java 😆

Edit: I've implemented this in dea509b for now, it is easy to change if there's something wrong with it 👍

/// Iceberg fallback field name to ID mapping.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
#[serde(transparent)]
pub struct NameMapping {
pub root: Vec<MappedField>,
root: Vec<MappedField>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should contains a MappedFields

#[serde(default)]
#[serde(skip_serializing_if = "Vec::is_empty")]
#[serde_as(deserialize_as = "DefaultOnNull")]
pub fields: Vec<MappedField>,
fields: Vec<MappedField>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a MappedFields.

Copy link
Contributor Author

@jdockerty jdockerty Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing this one to MappedFields, this alters all of the expected output JSON too.

I assume that is expected for now and I'll update the other tests 👍

@jonathanc-n
Copy link
Contributor

jonathanc-n commented Mar 21, 2025

@jdockerty @liurenjie1024 I believe #1072 contains a lot of the functionality in this pr, this got split into #1082 being the first part of it.

@liurenjie1024
Copy link
Contributor

@jdockerty @liurenjie1024 I believe #1072 contains a lot of the functionality in this pr, this got split into #1082 being the first part of it.

Hi, @jonathanc-n I think #1072 added extra functionality into like visitor, indexing into NameMapping. I think this pr should happen before #1072 since it's a refactoring before exposing it to public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants