-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Set Formatted TableOptions Enum #16166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
separate FileSpecificTableOptions as enum
…hore/table-options-enum
# Conflicts: # datafusion/core/src/datasource/file_format/arrow.rs # datafusion/datasource/src/file_format.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've worked with @mertak-synnada closely here, and the final state makes sense to me. If there are any points that we miss why TableOptions is used such, please let me know. Otherwise, I'll wait for a few days and merge this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @berkaysynnada and @mertak-synnada
I am a little confused about the new structure and exactly what problem is being solved with this change.
The PR description says
In the current implementation, the TableOptions struct stores information for CSV, JSON, and Parquet file types, and its behavior changes depending on whether the file format is set before or after.
However, I didn't see any tests or examples that show a different behavior after this PR.
In my initial quick read through it I would say it has made the APIs harder to use (as now there are format specific options on TableOptions
AND a new TableFormatOptions
). Is there any way to consolidate the number of structures? If not, I think we should clearly document the difference between the two structures
/// Initializes a new `TableOptions` from a hash map of string settings. | ||
#[derive(Debug, Clone)] | ||
#[allow(clippy::large_enum_variant)] | ||
pub enum TableFormatOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a new pub struct, can we please add documentation explaining
- what it is used for
- How it is different than
TableOptions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will resolve these if I can relieve your concerns :)
Ok(self.default_from_output_format(options)) | ||
} | ||
|
||
fn default_from_output_format(&self, options: OutputFormat) -> Arc<dyn FileFormat>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please document these new public APIs?
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt + fmt::Debug { | |||
&self, | |||
state: &dyn Session, | |||
format_options: &HashMap<String, String>, | |||
) -> Result<Arc<dyn FileFormat>>; | |||
) -> Result<Arc<dyn FileFormat>> { | |||
let options = match self.options() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this method is part of a trait, I think it can be overridden by other implementations. Is that the intention?
In other words, do you expect anyone implementing FileFormatFactory to implement create()
or should they now implement default_from_output_format
and options
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should implement default_from_output_format
and options
. This change is not directly related to the actual changes of the PR, but all create() methods show a similar pattern, so I remember we thought like why not avoid duplication
|
||
fn default_from_output_format(&self, options: OutputFormat) -> Arc<dyn FileFormat>; | ||
|
||
fn options(&self) -> (Option<OutputFormat>, ConfigFileType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it strange that a FileFormatFactory returns an Option<OutputFormat>
and then another method takes the output format as input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this confuses me now too. Having ConfigFileType and OutputFormat at the same time is also a bit weird. I'll take a second look at this part
} | ||
} | ||
|
||
/// Sets a configuration value for a specific key within `TableOptions`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is in TableFormatOptions
I think
} | ||
} | ||
|
||
pub fn csv_options_or_default(&self) -> CsvOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, these new public methods should have documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading this some more, I don't undertand the usecase for csv_options_or_default
(and the other similar methods)
Since this is now an enum, the calling code should directly handle the format in the TableFormatOptions -- this code will silently drop any config options if it isn't CSV which seems like it would potentially mask a bug in the caller
I would expect calling code to do something like
match TableFormatOptions {
Self::Csv => {}
Self::Json => {}
Self::Parquet => {{
}
The confusion we are trying to address is that the TableOptions struct is currently being used in two different ways:
In other words, it seems like we actually need two distinct objects for these two different use cases:
In summary, this PR is more of a code cleanup than a functional change. It aims to clarify the separation of concerns and improve maintainability, in my opinion. |
I'd like to go over this PR one more time and dropping the approve to not cause someone merge this prematurely
/// # Returns | ||
/// | ||
/// A result indicating success or failure in applying the settings. | ||
pub fn alter_with_string_hash_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe update
would be clearer as alter
might imply some change other than updating the map
} | ||
} | ||
|
||
pub fn csv_options_or_default(&self) -> CsvOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading this some more, I don't undertand the usecase for csv_options_or_default
(and the other similar methods)
Since this is now an enum, the calling code should directly handle the format in the TableFormatOptions -- this code will silently drop any config options if it isn't CSV which seems like it would potentially mask a bug in the caller
I would expect calling code to do something like
match TableFormatOptions {
Self::Csv => {}
Self::Json => {}
Self::Parquet => {{
}
Yes I agree this is confusing
I see. Thank you. I think we could largely resolve the confusion with some better documentation (perhaps documentation that explained the differences clearly so that when someone stumbled on the very similarly named Here are some other ideas: Different NamesUse names that make the difference between the structs clearer. Perhaps something like the following:
Use the inner structures directlyOnce we know the format, could we potentially then change the code to use the |
Which issue does this PR close?
In the current implementation, the TableOptions struct stores information for CSV, JSON, and Parquet file types, and its behavior changes depending on whether the file format is set before or after. This PR introduces a new struct for file-type-specific cases, while keeping TableOptions as the default state, since it is also used in the session before a file type is specified.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?