[OpenSearchReader] Use DocSet instead of Dataset whenever possible

OpenSearchReader (OSR) is the current default Sycamore reader implementation for reading OpenSearch data into DocSets.  It uses Point-in-Time (PIT) snapshots and “slices” to achieve parallelism when reading from OpenSearch (https://opensearch.org/docs/latest/search-plugins/searching-data/point-in-time/).

OpenSearchDatasource (OSD) is an implementation of the Datasource “interface” as defined by Ray’s data API.  Ray comes with readers and writers for a number of data sources and sinks and provides examples for writing your own reader or writer (https://docs.ray.io/en/latest/data/custom-datasource-example.html).  The latest implementation of OSD is checked into a branch:  

https://github.com/aryn-ai/sycamore/blob/opensearch-datasource/lib/sycamore/sycamore/connectors/opensearch/opensearch_datasource.py

Both of the implementations parallelize reads by creating a read task per slice or per parent doc (for document reconstruct).  OSR does this by using a flat_map operator where each row corresponds to a slice (each row maps to all matching docs in a slice) and relies on Ray to spread the read tasks across available resources (workers).

OSD explicitly creates ReadTasks which are picked up by available workers.  Each read task works on a slice or a parent doc.

When I compare the performance between OSR and OSD, we seem to be losing a lot of time converting between Pandas DataFrames and Sycamore’s DocSets.  Overall, OSR outperforms OSD when it is not forced to prepare data as a PyArrow Table or a Pandas DataFrame.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[OpenSearchReader] Use DocSet instead of Dataset whenever possible #1110

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions