-
Notifications
You must be signed in to change notification settings - Fork 610
Fix: Correctly parse microsecond epoch timestamps in CSV files #3462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/gemini summary |
/gemini review |
Summary of ChangesThis pull request delivers a crucial fix to the CSV ingestion pipeline, enabling the correct parsing of high-precision microsecond epoch timestamps. By intelligently adapting the Pandas conversion unit based on the magnitude of the timestamp, it prevents data corruption and ensures accurate event data ingestion. The change is thoroughly validated with a new, specific unit test. Highlights
Changelog
Activity
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a fix to correctly parse microsecond epoch timestamps in CSV files. The code changes include a heuristic to determine if a timestamp is likely in microseconds and then uses pandas.to_datetime with unit='us' to ensure correct conversion. A new unit test has been added to verify the fix and prevent regressions. There is a suggestion to refactor the code to avoid redundant checks and to add a log message to indicate when the microsecond conversion is applied.
Fix: Correctly parse microsecond epoch timestamps in CSV files
This pull request resolves an issue where the CSV parser would misinterpret epoch timestamps provided in microsecond precision, leading to incorrect date ingestion.
The Problem
When a CSV file is uploaded with a
datetime
column containing a large integer (e.g.,1723474994349345
), the Pandas parser defaults to interpreting it as nanoseconds. This causes timestamps that are actually in microseconds to be parsed as dates in the early 1970s, corrupting the event data upon import.The Solution
The
read_and_validate_csv
function intimesketch/lib/utils.py
has been updated to intelligently handle this scenario. The new logic performs a check on numericdatetime
columns:pandas.to_datetime
is called withunit='us'
to ensure correct conversion.This ensures that both standard date strings and high-precision epoch timestamps are handled correctly.
Testing
To verify this fix and prevent regressions, a new unit test,
test_csv_with_timestamp_in_datetime_field
, has been added totimesketch/lib/utils_test.py
. This test uses a new data file,tests/test_events/csv_timestamp_as_datetime.csv
, which contains the problematic microsecond timestamps, and asserts that they are parsed into the correct ISO 8601 datetime strings.