Skip to content

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

@austintlee

Description

@austintlee

OpenSearchReader (OSR) is the current default Sycamore reader implementation for reading OpenSearch data into DocSets. It uses Point-in-Time (PIT) snapshots and “slices” to achieve parallelism when reading from OpenSearch (https://opensearch.org/docs/latest/search-plugins/searching-data/point-in-time/).

OpenSearchDatasource (OSD) is an implementation of the Datasource “interface” as defined by Ray’s data API. Ray comes with readers and writers for a number of data sources and sinks and provides examples for writing your own reader or writer (https://docs.ray.io/en/latest/data/custom-datasource-example.html). The latest implementation of OSD is checked into a branch:

https://github.com/aryn-ai/sycamore/blob/opensearch-datasource/lib/sycamore/sycamore/connectors/opensearch/opensearch_datasource.py

Both of the implementations parallelize reads by creating a read task per slice or per parent doc (for document reconstruct). OSR does this by using a flat_map operator where each row corresponds to a slice (each row maps to all matching docs in a slice) and relies on Ray to spread the read tasks across available resources (workers).

OSD explicitly creates ReadTasks which are picked up by available workers. Each read task works on a slice or a parent doc.

When I compare the performance between OSR and OSD, we seem to be losing a lot of time converting between Pandas DataFrames and Sycamore’s DocSets. Overall, OSR outperforms OSD when it is not forced to prepare data as a PyArrow Table or a Pandas DataFrame.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions