Skip to content

Safe handling of invalid timestamps #517

@Anjum48

Description

@Anjum48

In many Microsoft apps, the equivalent of a null timestamp is 1601/01/01 00:00:00. These can pop up sometimes if a file has been recovered or there was some other issue during saving.

This causes an OSError when trying to convert to a Unix timestamp: datetime(1601, 1, 1, 0, 0, 0).timestamp(),

Suggested fix:

In https://github.com/Unstructured-IO/unstructured-ingest/blob/main/unstructured_ingest/utils/string_and_date_utils.py add:

def safe_timestamp(dt: datetime) -> Optional[str]:
    """
    Converts a datetime object to a string representation of its timestamp.
    Handles potential exceptions that may arise during conversion.
    """
    try:
        return str(dt.timestamp())
    except (OSError, ValueError, OverflowError):
        return None

In the OneDrive ingester:

return FileData(
identifier=drive_item.id,
connector_type=self.connector_type,
source_identifiers=SourceIdentifiers(
fullpath=server_path, filename=drive_item.name, rel_path=rel_path
),
metadata=FileDataSourceMetadata(
url=drive_item.parent_reference.path + "/" + drive_item.name,
version=drive_item.etag,
date_modified=str(date_modified_dt.timestamp()) if date_modified_dt else None,
date_created=str(date_created_at.timestamp()) if date_created_at else None,
date_processed=str(time()),
record_locator={
"user_pname": self.connection_config.user_pname,
"server_relative_path": server_path,
},
),
additional_metadata=self.get_properties_sync(drive_item=drive_item),

use safe_timestamp()

        return FileData(
            identifier=drive_item.id,
            connector_type=self.connector_type,
            source_identifiers=SourceIdentifiers(
                fullpath=server_path, filename=drive_item.name, rel_path=rel_path
            ),
            metadata=FileDataSourceMetadata(
                url=drive_item.parent_reference.path + "/" + drive_item.name,
                version=drive_item.etag,
                date_modified=safe_timestamp(date_modified_dt),
                date_created=safe_timestamp(date_created_at),
                date_processed=str(time()),
                record_locator={
                    "user_pname": self.connection_config.user_pname,
                    "server_relative_path": server_path,
                },
            ),
            additional_metadata=self.get_properties_sync(drive_item=drive_item),
        )

Happy to open a PR if needed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions