-
Notifications
You must be signed in to change notification settings - Fork 65
Description
OpenSearchReader (OSR) is the current default Sycamore reader implementation for reading OpenSearch data into DocSets. It uses Point-in-Time (PIT) snapshots and “slices” to achieve parallelism when reading from OpenSearch (https://opensearch.org/docs/latest/search-plugins/searching-data/point-in-time/).
OpenSearchDatasource (OSD) is an implementation of the Datasource “interface” as defined by Ray’s data API. Ray comes with readers and writers for a number of data sources and sinks and provides examples for writing your own reader or writer (https://docs.ray.io/en/latest/data/custom-datasource-example.html). The latest implementation of OSD is checked into a branch:
Both of the implementations parallelize reads by creating a read task per slice or per parent doc (for document reconstruct). OSR does this by using a flat_map operator where each row corresponds to a slice (each row maps to all matching docs in a slice) and relies on Ray to spread the read tasks across available resources (workers).
OSD explicitly creates ReadTasks which are picked up by available workers. Each read task works on a slice or a parent doc.
When I compare the performance between OSR and OSD, we seem to be losing a lot of time converting between Pandas DataFrames and Sycamore’s DocSets. Overall, OSR outperforms OSD when it is not forced to prepare data as a PyArrow Table or a Pandas DataFrame.