-
Notifications
You must be signed in to change notification settings - Fork 610
Feat: Enhance CSV parser for robust timestamp handling #3463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Feat: Enhance CSV parser for robust timestamp handling #3463
Conversation
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request enhances the CSV parser by adding support for microsecond epoch timestamps, flexible timestamp handling, robust mixed-precision timestamp parsing, and guaranteed timestamp consistency. The code changes include modifications to the read_and_validate_csv function to handle different timestamp formats and the addition of a new helper function _convert_timestamp_to_datetime. The code review identified opportunities to improve type hinting and avoid unnecessary type conversions.
@jaegeral can you please fix the merge conflicts? |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
logger.warning( | ||
"{} rows skipped since they were missing datetime field " | ||
"or it was empty ".format(len(skipped_rows)) | ||
"Chunk %d skipped because it is missing a datetime field.", idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How useful is this error information? idx
- does this value represent the line in the CSV so people can go back and check?
Feat: Enhance CSV parser for robust timestamp handling
This pull request introduces a series of significant improvements to the CSV importer, making it more robust, flexible, and resilient to a wider variety of timestamp formats. The changes bring its capabilities in line with the JSONL parser and fix several underlying bugs.
Key Changes and Improvements
Support for Microsecond Epoch Timestamps
datetime
column contains microsecond-precision epoch timestamps and parses them correctly.Flexible Timestamp Handling (Parity with JSONL)
datetime
column, unlike the JSONL importer which could fall back to atimestamp
column.timestamp
column to be used in place of adatetime
column. The parser will automatically generate thedatetime
field from the numeric timestamp.Robust Mixed-Precision Timestamp Parsing
_convert_timestamp_to_datetime
, infers the correct unit for each timestamp individually based on its magnitude, allowing for mixed precisions in the same file.Guaranteed Timestamp Consistency
datetime
andtimestamp
columns, inconsistencies could arise if thetimestamp
was in an unexpected unit.timestamp
field (in microsecond epoch format) is now always recalculated from the parseddatetime
field, makingdatetime
the single source of truth for time.Code Quality and Testing
_convert_timestamp_to_datetime
) for improved clarity and maintainability.read_and_validate_csv
has been significantly expanded to detail its new capabilities and arguments.timestamp
column with mixed precisions (s, ms, µs) instead of adatetime
column._convert_timestamp_to_datetime
helper function to verify its precision-handling logic.These changes collectively make the CSV import process more powerful and user-friendly.
Chained branch after #3462 is merged