Skip to content

Feat: Enhance CSV parser for robust timestamp handling #3463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jaegeral
Copy link
Collaborator

@jaegeral jaegeral commented Jul 9, 2025

Feat: Enhance CSV parser for robust timestamp handling

This pull request introduces a series of significant improvements to the CSV importer, making it more robust, flexible, and resilient to a wider variety of timestamp formats. The changes bring its capabilities in line with the JSONL parser and fix several underlying bugs.

Key Changes and Improvements

  1. Support for Microsecond Epoch Timestamps

    • Solution: The parser now uses a heuristic to detect if a numeric datetime column contains microsecond-precision epoch timestamps and parses them correctly.
  2. Flexible Timestamp Handling (Parity with JSONL)

    • Problem: The CSV importer strictly required a datetime column, unlike the JSONL importer which could fall back to a timestamp column.
    • Solution: The CSV header validation now allows a timestamp column to be used in place of a datetime column. The parser will automatically generate the datetime field from the numeric timestamp.
  3. Robust Mixed-Precision Timestamp Parsing

    • Problem: The initial implementation for handling numeric timestamps used a chunk-based median to guess the time unit, which failed when a file contained a mix of seconds, milliseconds, and microseconds.
    • Solution: The logic has been refactored to operate on a per-row basis. A new helper function, _convert_timestamp_to_datetime, infers the correct unit for each timestamp individually based on its magnitude, allowing for mixed precisions in the same file.
  4. Guaranteed Timestamp Consistency

    • Problem: If a user provided both datetime and timestamp columns, inconsistencies could arise if the timestamp was in an unexpected unit.
    • Solution: To ensure data integrity, the final timestamp field (in microsecond epoch format) is now always recalculated from the parsed datetime field, making datetime the single source of truth for time.

Code Quality and Testing

  • Refactoring: The per-row timestamp conversion logic was extracted into a dedicated, well-documented helper method (_convert_timestamp_to_datetime) for improved clarity and maintainability.
  • Docstrings: The docstring for read_and_validate_csv has been significantly expanded to detail its new capabilities and arguments.
  • Unit Tests: To validate these changes and prevent regressions, three new unit tests have been added:
    • A test for parsing CSVs with microsecond epoch timestamps.
    • A test for parsing CSVs that use a timestamp column with mixed precisions (s, ms, µs) instead of a datetime column.
    • A dedicated test for the _convert_timestamp_to_datetime helper function to verify its precision-handling logic.

These changes collectively make the CSV import process more powerful and user-friendly.

Chained branch after #3462 is merged

@jaegeral jaegeral requested a review from jkppr July 9, 2025 12:54
@jaegeral jaegeral self-assigned this Jul 9, 2025
@jaegeral jaegeral added the Data import All things that are with importing data label Jul 9, 2025
@jaegeral
Copy link
Collaborator Author

jaegeral commented Jul 9, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request enhances the CSV parser by adding support for microsecond epoch timestamps, flexible timestamp handling, robust mixed-precision timestamp parsing, and guaranteed timestamp consistency. The code changes include modifications to the read_and_validate_csv function to handle different timestamp formats and the addition of a new helper function _convert_timestamp_to_datetime. The code review identified opportunities to improve type hinting and avoid unnecessary type conversions.

@jkppr
Copy link
Collaborator

jkppr commented Jul 10, 2025

@jaegeral can you please fix the merge conflicts?

jaegeral and others added 2 commits July 10, 2025 21:58
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
logger.warning(
"{} rows skipped since they were missing datetime field "
"or it was empty ".format(len(skipped_rows))
"Chunk %d skipped because it is missing a datetime field.", idx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How useful is this error information? idx - does this value represent the line in the CSV so people can go back and check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data import All things that are with importing data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants