Skip to content

Fix: Correctly parse microsecond epoch timestamps in CSV files #3462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 10, 2025

Conversation

jaegeral
Copy link
Collaborator

@jaegeral jaegeral commented Jul 9, 2025

Fix: Correctly parse microsecond epoch timestamps in CSV files

This pull request resolves an issue where the CSV parser would misinterpret epoch timestamps provided in microsecond precision, leading to incorrect date ingestion.

The Problem

When a CSV file is uploaded with a datetime column containing a large integer (e.g., 1723474994349345), the Pandas parser defaults to interpreting it as nanoseconds. This causes timestamps that are actually in microseconds to be parsed as dates in the early 1970s, corrupting the event data upon import.

The Solution

The read_and_validate_csv function in timesketch/lib/utils.py has been updated to intelligently handle this scenario. The new logic performs a check on numeric datetime columns:

  1. It uses a heuristic to determine if the timestamp is likely in microseconds (i.e., if the number is greater than 1e15).
  2. If it is, pandas.to_datetime is called with unit='us' to ensure correct conversion.
  3. If not, it falls back to the existing general-purpose date string parser.

This ensures that both standard date strings and high-precision epoch timestamps are handled correctly.

Testing

To verify this fix and prevent regressions, a new unit test, test_csv_with_timestamp_in_datetime_field, has been added to timesketch/lib/utils_test.py. This test uses a new data file, tests/test_events/csv_timestamp_as_datetime.csv, which contains the problematic microsecond timestamps, and asserts that they are parsed into the correct ISO 8601 datetime strings.

@jaegeral jaegeral requested a review from jkppr July 9, 2025 11:07
@jaegeral jaegeral self-assigned this Jul 9, 2025
@jaegeral jaegeral added the Data import All things that are with importing data label Jul 9, 2025
@jaegeral
Copy link
Collaborator Author

jaegeral commented Jul 9, 2025

/gemini summary

@jaegeral
Copy link
Collaborator Author

jaegeral commented Jul 9, 2025

/gemini review

Copy link
Contributor

Summary of Changes

This pull request delivers a crucial fix to the CSV ingestion pipeline, enabling the correct parsing of high-precision microsecond epoch timestamps. By intelligently adapting the Pandas conversion unit based on the magnitude of the timestamp, it prevents data corruption and ensures accurate event data ingestion. The change is thoroughly validated with a new, specific unit test.

Highlights

  • Bug Fix: Microsecond Timestamp Parsing: This pull request resolves a critical issue where the CSV parser incorrectly interpreted microsecond epoch timestamps in the datetime column as nanoseconds, leading to corrupted date ingestion (e.g., dates appearing in the early 1970s).
  • Enhanced CSV Parsing Logic: The read_and_validate_csv function in timesketch/lib/utils.py has been updated. It now includes a heuristic to detect numeric datetime column values greater than 1e15, which are then explicitly parsed as microseconds using pandas.to_datetime(unit='us'). Otherwise, it falls back to general date string parsing.
  • New Unit Test for Validation: A new unit test, test_csv_with_timestamp_in_datetime_field, has been added to timesketch/lib/utils_test.py. This test uses a dedicated CSV file (tests/test_events/csv_timestamp_as_datetime.csv) to ensure that microsecond epoch timestamps are correctly converted to ISO 8601 format, preventing regressions.
Changelog
  • tests/test_events/csv_timestamp_as_datetime.csv
    • Added a new test data file containing sample CSV entries with microsecond epoch timestamps in the 'datetime' column. This file is used by the new unit test to validate the parsing fix.
  • timesketch/lib/utils.py
    • Modified the read_and_validate_csv function (lines 307-325) to introduce a conditional logic for the 'datetime' column. If the column is numeric and contains values greater than 1e15, pandas.to_datetime is now called with unit='us' to correctly interpret microsecond epoch timestamps. Otherwise, it defaults to format='mixed'.
  • timesketch/lib/utils_test.py
    • Added a new unit test test_csv_with_timestamp_in_datetime_field (lines 326-360). This test verifies that the read_and_validate_csv function correctly processes CSV files where the 'datetime' field contains microsecond epoch timestamps, asserting that the output matches the expected ISO 8601 datetime strings.
Activity
  • The pull request was opened by @jaegeral.
  • There has been no human review activity or comments from other users on this pull request yet.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a fix to correctly parse microsecond epoch timestamps in CSV files. The code changes include a heuristic to determine if a timestamp is likely in microseconds and then uses pandas.to_datetime with unit='us' to ensure correct conversion. A new unit test has been added to verify the fix and prevent regressions. There is a suggestion to refactor the code to avoid redundant checks and to add a log message to indicate when the microsecond conversion is applied.

@jkppr jkppr merged commit a74606e into google:master Jul 10, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data import All things that are with importing data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants