Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1558347: Fix PARSE_HEADER bug in session.read.csv #2028

Merged
merged 5 commits into from
Aug 16, 2024

Conversation

sfc-gh-jrose
Copy link
Contributor

@sfc-gh-jrose sfc-gh-jrose commented Aug 3, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1558347

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.

    Previously if a user specified PARSE_HEADER = True in an externally defined file format it would not be able to be used with session.read.csv without throwing an exception. This exception was caused due to how we infer csv schemas in the client. The client separates out csv loading into a schema inference step and a query step. Parsing the header is valid in the inference step, but should be skipped in the query step.

This change modifies the logic to the read step so that a temporary file format is always created. The external file format is queried and merged with the local options giving the local options precedence. The local options are then modified as needed such as for the parse header issue above.

I've added a test that provides a scenario that would have previously failed.

These parts may be contentious:

  • create_file_format_statement is changed to only set the format if it wasn't specified in the options
  • _merge_file_format_options does some parsing of the options itself. I didn't see any other logic that currently handles that format. It should work for all currently existing formats, but it could break for future options if their formatting is significantly different.
  • The new logic for read_file prefers the client specified options over the file_format options whereas previously the client format options would have been ignored. This is trivial to change, but I think the new style should make more sense to the user.

}
new_options = {**file_format_options}
file_format = self.session.sql(
f"DESCRIBE FILE FORMAT {options['FORMAT_NAME']}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be careful about eagerly triggering such a query 1) it will be visible to users 2) maybe add a try/catch because it's still possible that it will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think I can do this better in a way that maintains lazy evaluation. I'll refactor a bit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a jira and add a todo here for doing it lazily in the future?

Copy link

github-actions bot commented Aug 8, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

1 similar comment
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jrose sfc-gh-jrose requested a review from a team August 12, 2024 19:17
@sfc-gh-jrose sfc-gh-jrose force-pushed the jrose_snow_1558347_parse_header_bug branch from 1ee6a80 to ef02131 Compare August 13, 2024 19:03
@sfc-gh-jrose sfc-gh-jrose requested a review from a team as a code owner August 13, 2024 19:03
@sfc-gh-jrose sfc-gh-jrose force-pushed the jrose_snow_1558347_parse_header_bug branch from ef02131 to 1032856 Compare August 13, 2024 19:05
@sfc-gh-mvashishtha
Copy link
Contributor

@sfc-gh-jrose are there any changes for Snowpark pandas reviewers to review here?

@sfc-gh-jrose
Copy link
Contributor Author

@sfc-gh-jrose are there any changes for Snowpark pandas reviewers to review here?

No. I merged main incorrectly earlier which pulled in a snowpark pandas change. I've since removed that commit, but git still pulled in the group automatically.

Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jrose sfc-gh-jrose merged commit 1a66d79 into main Aug 16, 2024
34 checks passed
@sfc-gh-jrose sfc-gh-jrose deleted the jrose_snow_1558347_parse_header_bug branch August 16, 2024 15:56
@github-actions github-actions bot locked and limited conversation to collaborators Aug 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants