Skip to content

Set Formatted TableOptions Enum #16166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

mertak-synnada
Copy link

@mertak-synnada mertak-synnada commented May 23, 2025

Which issue does this PR close?

In the current implementation, the TableOptions struct stores information for CSV, JSON, and Parquet file types, and its behavior changes depending on whether the file format is set before or after. This PR introduces a new struct for file-type-specific cases, while keeping TableOptions as the default state, since it is also used in the session before a file type is specified.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate datasource Changes to the datasource crate labels May 23, 2025
berkaysynnada
berkaysynnada previously approved these changes May 27, 2025
Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've worked with @mertak-synnada closely here, and the final state makes sense to me. If there are any points that we miss why TableOptions is used such, please let me know. Otherwise, I'll wait for a few days and merge this PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @berkaysynnada and @mertak-synnada

I am a little confused about the new structure and exactly what problem is being solved with this change.

The PR description says

In the current implementation, the TableOptions struct stores information for CSV, JSON, and Parquet file types, and its behavior changes depending on whether the file format is set before or after.

However, I didn't see any tests or examples that show a different behavior after this PR.

In my initial quick read through it I would say it has made the APIs harder to use (as now there are format specific options on TableOptions AND a new TableFormatOptions). Is there any way to consolidate the number of structures? If not, I think we should clearly document the difference between the two structures

/// Initializes a new `TableOptions` from a hash map of string settings.
#[derive(Debug, Clone)]
#[allow(clippy::large_enum_variant)]
pub enum TableFormatOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a new pub struct, can we please add documentation explaining

  1. what it is used for
  2. How it is different than TableOptions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will resolve these if I can relieve your concerns :)

Ok(self.default_from_output_format(options))
}

fn default_from_output_format(&self, options: OutputFormat) -> Arc<dyn FileFormat>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please document these new public APIs?

@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt + fmt::Debug {
&self,
state: &dyn Session,
format_options: &HashMap<String, String>,
) -> Result<Arc<dyn FileFormat>>;
) -> Result<Arc<dyn FileFormat>> {
let options = match self.options() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this method is part of a trait, I think it can be overridden by other implementations. Is that the intention?

In other words, do you expect anyone implementing FileFormatFactory to implement create() or should they now implement default_from_output_format and options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should implement default_from_output_format and options. This change is not directly related to the actual changes of the PR, but all create() methods show a similar pattern, so I remember we thought like why not avoid duplication


fn default_from_output_format(&self, options: OutputFormat) -> Arc<dyn FileFormat>;

fn options(&self) -> (Option<OutputFormat>, ConfigFileType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it strange that a FileFormatFactory returns an Option<OutputFormat> and then another method takes the output format as input

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this confuses me now too. Having ConfigFileType and OutputFormat at the same time is also a bit weird. I'll take a second look at this part

}
}

/// Sets a configuration value for a specific key within `TableOptions`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in TableFormatOptions I think

}
}

pub fn csv_options_or_default(&self) -> CsvOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, these new public methods should have documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this some more, I don't undertand the usecase for csv_options_or_default (and the other similar methods)

Since this is now an enum, the calling code should directly handle the format in the TableFormatOptions -- this code will silently drop any config options if it isn't CSV which seems like it would potentially mask a bug in the caller

I would expect calling code to do something like

match TableFormatOptions { 
  Self::Csv => {}
  Self::Json => {}
  Self::Parquet => {{
}

@berkaysynnada
Copy link
Contributor

Thank you for this contribution @berkaysynnada and @mertak-synnada

I am a little confused about the new structure and exactly what problem is being solved with this change.

The PR description says

In the current implementation, the TableOptions struct stores information for CSV, JSON, and Parquet file types, and its behavior changes depending on whether the file format is set before or after.

However, I didn't see any tests or examples that show a different behavior after this PR.

In my initial quick read through it I would say it has made the APIs harder to use (as now there are format specific options on TableOptions AND a new TableFormatOptions). Is there any way to consolidate the number of structures? If not, I think we should clearly document the difference between the two structures

The confusion we are trying to address is that the TableOptions struct is currently being used in two different ways:

  1. As a sub-configuration field in SessionState, similar to other configuration fields.
  2. For directly configuring the specific file currently being worked on, typically by cloning it from the SessionState and overriding fields as needed.,

In other words, it seems like we actually need two distinct objects for these two different use cases:

  • One for storing global TableOptions configuration across the session, applicable to all file types.
  • Another for holding format-specific options derived from the global configuration, intended for the currently active file.
    Another key point is that once you set a specific file format, you don’t need the configuration fields for the others in the existing TableOptions. So for the second use case, the appropriate design might be an enum rather than a struct.

In summary, this PR is more of a code cleanup than a functional change. It aims to clarify the separation of concerns and improve maintainability, in my opinion.

@berkaysynnada berkaysynnada dismissed their stale review May 29, 2025 06:30

I'd like to go over this PR one more time and dropping the approve to not cause someone merge this prematurely

/// # Returns
///
/// A result indicating success or failure in applying the settings.
pub fn alter_with_string_hash_map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe update would be clearer as alter might imply some change other than updating the map

}
}

pub fn csv_options_or_default(&self) -> CsvOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this some more, I don't undertand the usecase for csv_options_or_default (and the other similar methods)

Since this is now an enum, the calling code should directly handle the format in the TableFormatOptions -- this code will silently drop any config options if it isn't CSV which seems like it would potentially mask a bug in the caller

I would expect calling code to do something like

match TableFormatOptions { 
  Self::Csv => {}
  Self::Json => {}
  Self::Parquet => {{
}

@alamb
Copy link
Contributor

alamb commented May 30, 2025

However, I didn't see any tests or examples that show a different behavior after this PR.
In my initial quick read through it I would say it has made the APIs harder to use (as now there are format specific options on TableOptions AND a new TableFormatOptions). Is there any way to consolidate the number of structures? If not, I think we should clearly document the difference between the two structures

The confusion we are trying to address is that the TableOptions struct is currently being used in two different ways:

  1. As a sub-configuration field in SessionState, similar to other configuration fields.
  2. For directly configuring the specific file currently being worked on, typically by cloning it from the SessionState and overriding fields as needed.,

Yes I agree this is confusing

In other words, it seems like we actually need two distinct objects for these two different use cases:

  • One for storing global TableOptions configuration across the session, applicable to all file types.
  • Another for holding format-specific options derived from the global configuration, intended for the currently active file.
    Another key point is that once you set a specific file format, you don’t need the configuration fields for the others in the existing TableOptions. So for the second use case, the appropriate design might be an enum rather than a struct.

In summary, this PR is more of a code cleanup than a functional change. It aims to clarify the separation of concerns and improve maintainability, in my opinion.

I see. Thank you.

I think we could largely resolve the confusion with some better documentation (perhaps documentation that explained the differences clearly so that when someone stumbled on the very similarly named TableOptions and TableFormatOptions there would be some context to help them understand

Here are some other ideas:

Different Names

Use names that make the difference between the structs clearer.

Perhaps something like the following:

  • DefaultTableOptions for the structure on ConfigOptions (currently called TableOptions)
  • ResolvedTableOptions or SpecificTableOptions for configuring a file specific option (called TableFormatOptions in this PR)

Use the inner structures directly

Once we know the format, could we potentially then change the code to use the ParquetOptions or CsvOptions directly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants