Skip to content

[BUG] Slow S3 Upload Times on GPU Worker #66

@anntians

Description

@anntians

What is the bug?

While running E2E benchmarking on 10m dataset, we're seeing slow S3 upload times of the graph file after index build is completed in each worker.

The metrics below are collected during force merge operation of OS cluster with 6 primary shards. So 10m/6 = 1.6m vectors in the build request

GPU Worker
- download time ~ 10s
- index build time ~ 83s
- upload time ~ 30s
OS Cluster
- Repository write took 6763 ms
- Submit vector build took 11 ms
- Await vector build took 131800 ms
- Repository read took 53778 ms
- Remote index build succeeded after 192354 ms
- Merge took 192355 ms

How can one reproduce the bug?

  1. Deploy OS Cluster
  2. Deploy GPU Remote Build Cluster
  3. Set up OS cluster to enable remote build
  4. Register Workers on Remote Build Cluster
  5. Run OSB on 10m dataset

What is the expected behavior?

S3 upload time on GPU worker should be similar to download time. We should expect download + index + upload ~ 100s.

What is your host/environment?

g5.2xlarge

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

On the OS cluster side, we also see some wasted time during await vector build. In example above, the total build request time on GPU is 10+83+30 = 123s. But OS cluster is awaiting 131800/1000 = 131.8s. So there is ~10s wasted waiting for build to complete. Perhaps there is some issue with the initial polling await time or polling interval on KNN side?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @krisfreedain@rchitale7@anntians

    Issue actions

      [BUG] Slow S3 Upload Times on GPU Worker · Issue #66 · opensearch-project/remote-vector-index-builder