Fix: Correctly parse microsecond epoch timestamps in CSV files #3462

jaegeral · 2025-07-09T11:07:35Z

Fix: Correctly parse microsecond epoch timestamps in CSV files

This pull request resolves an issue where the CSV parser would misinterpret epoch timestamps provided in microsecond precision, leading to incorrect date ingestion.

The Problem

When a CSV file is uploaded with a datetime column containing a large integer (e.g., 1723474994349345), the Pandas parser defaults to interpreting it as nanoseconds. This causes timestamps that are actually in microseconds to be parsed as dates in the early 1970s, corrupting the event data upon import.

The Solution

The read_and_validate_csv function in timesketch/lib/utils.py has been updated to intelligently handle this scenario. The new logic performs a check on numeric datetime columns:

It uses a heuristic to determine if the timestamp is likely in microseconds (i.e., if the number is greater than 1e15).
If it is, pandas.to_datetime is called with unit='us' to ensure correct conversion.
If not, it falls back to the existing general-purpose date string parser.

This ensures that both standard date strings and high-precision epoch timestamps are handled correctly.

Testing

To verify this fix and prevent regressions, a new unit test, test_csv_with_timestamp_in_datetime_field, has been added to timesketch/lib/utils_test.py. This test uses a new data file, tests/test_events/csv_timestamp_as_datetime.csv, which contains the problematic microsecond timestamps, and asserts that they are parsed into the correct ISO 8601 datetime strings.

jaegeral · 2025-07-09T12:55:20Z

/gemini summary

jaegeral · 2025-07-09T12:55:28Z

/gemini review

gemini-code-assist · 2025-07-09T12:55:41Z

Summary of Changes

This pull request delivers a crucial fix to the CSV ingestion pipeline, enabling the correct parsing of high-precision microsecond epoch timestamps. By intelligently adapting the Pandas conversion unit based on the magnitude of the timestamp, it prevents data corruption and ensures accurate event data ingestion. The change is thoroughly validated with a new, specific unit test.

Highlights

Bug Fix: Microsecond Timestamp Parsing: This pull request resolves a critical issue where the CSV parser incorrectly interpreted microsecond epoch timestamps in the datetime column as nanoseconds, leading to corrupted date ingestion (e.g., dates appearing in the early 1970s).
Enhanced CSV Parsing Logic: The read_and_validate_csv function in timesketch/lib/utils.py has been updated. It now includes a heuristic to detect numeric datetime column values greater than 1e15, which are then explicitly parsed as microseconds using pandas.to_datetime(unit='us'). Otherwise, it falls back to general date string parsing.
New Unit Test for Validation: A new unit test, test_csv_with_timestamp_in_datetime_field, has been added to timesketch/lib/utils_test.py. This test uses a dedicated CSV file (tests/test_events/csv_timestamp_as_datetime.csv) to ensure that microsecond epoch timestamps are correctly converted to ISO 8601 format, preventing regressions.

Changelog

tests/test_events/csv_timestamp_as_datetime.csv
- Added a new test data file containing sample CSV entries with microsecond epoch timestamps in the 'datetime' column. This file is used by the new unit test to validate the parsing fix.
timesketch/lib/utils.py
- Modified the read_and_validate_csv function (lines 307-325) to introduce a conditional logic for the 'datetime' column. If the column is numeric and contains values greater than 1e15, pandas.to_datetime is now called with unit='us' to correctly interpret microsecond epoch timestamps. Otherwise, it defaults to format='mixed'.
timesketch/lib/utils_test.py
- Added a new unit test test_csv_with_timestamp_in_datetime_field (lines 326-360). This test verifies that the read_and_validate_csv function correctly processes CSV files where the 'datetime' field contains microsecond epoch timestamps, asserting that the output matches the expected ISO 8601 datetime strings.

Activity

The pull request was opened by @jaegeral.
There has been no human review activity or comments from other users on this pull request yet.

gemini-code-assist

Code Review

The pull request introduces a fix to correctly parse microsecond epoch timestamps in CSV files. The code changes include a heuristic to determine if a timestamp is likely in microseconds and then uses pandas.to_datetime with unit='us' to ensure correct conversion. A new unit test has been added to verify the fix and prevent regressions. There is a suggestion to refactor the code to avoid redundant checks and to add a log message to indicate when the microsecond conversion is applied.

jaegeral added 2 commits July 9, 2025 11:04

initial fix plus unit tests

f44ffea

remove a skeleton test

74aa2f3

jaegeral requested a review from jkppr July 9, 2025 11:07

jaegeral self-assigned this Jul 9, 2025

jaegeral added the Data import All things that are with importing data label Jul 9, 2025

jaegeral mentioned this pull request Jul 9, 2025

Feat: Enhance CSV parser for robust timestamp handling #3463

Open

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

Merge branch 'master' into 2025-07-08_csv_parsing_issue

68f0ed5

jkppr approved these changes Jul 10, 2025

View reviewed changes

jkppr merged commit a74606e into google:master Jul 10, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Correctly parse microsecond epoch timestamps in CSV files #3462

Fix: Correctly parse microsecond epoch timestamps in CSV files #3462

jaegeral commented Jul 9, 2025

Uh oh!

jaegeral commented Jul 9, 2025

Uh oh!

jaegeral commented Jul 9, 2025

Uh oh!

gemini-code-assist bot commented Jul 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Fix: Correctly parse microsecond epoch timestamps in CSV files #3462

Fix: Correctly parse microsecond epoch timestamps in CSV files #3462

Conversation

jaegeral commented Jul 9, 2025