-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter #2134
Comments
To reproduce easily cohere 1m dataset was used for benchmarking for the below table
Estimated bottlenecksKNNVectorValues CreationKNNVectorValues are created 3 times currently, we cannot reuse the same object as there is no way we could reset the iterator and putting effort into logic for resetting the iterator might not result in latency improvements
Currently we are creating KNNVectorValues when quantization is not needed. Exp 2 in the above table shows some improvement in force merge time TotalLiveDocs computesThere is a linear time complexity to compute total live docs. TotalLiveDocs value is currently needed to
Flush caseFor flush we can avoid this calculation as there are no deleted docs involved and we can rely on KNNVectorValues or vectors in the field to give us the right result for totalLiveDocs Merge caseMerge involves removing deleted docs, While merging the segments the deleted docs aren’t considered. To do that current code path is using APIs in MergedVectorValues to have an iterator that can iterate while skipping the deleted docs. The APIs here does not give an iterator which considered deleted docs in its size count. As a result even KNNVectorValues cannot return the right result as it relies on the iterator provided by the MergedVectorValues to compute total live docs |
@shatejas one way to avoid the linear complexity for totalLives does when there are no deleted docs is we can write our custom FloatVectorValues merger. We already have something like this in BinaryDocValues: ref: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN80Codec/KNN80DocValuesReader.java#L38-L64 If we do this, we can remove the complexity for total live docs. We can also add some clone interfaces on those merged values which will help us remove complexity from code. |
@shatejas The table doesnt show repro for 2.17 vs 2.16 right? |
@jmazanec15 So first row is 2.17 but on main branch as the code path is the same. To mimic 2.16 code path this change was made while running the bench mark |
@shatejas but isnt 2.17 time same as 2.16 - so can we not repro it with the setup? |
Not exactly same, the number of segments being merged is ~15% higher for 2.16 compared to 2.17 so there is some difference |
I did an initial deep-dive into the issue and ran some benchmarks. Here is the deep-dive results. I used https://github.com/opensearch-project/opensearch-cluster-cdk to setup all the cluster. This setup mimics the configuration for 10M workload testing in nightlies. Ref: https://opensearch.org/benchmarks/ Cluster Configuration
Attempt 1I added some logs around the code to know where the time is going in the merge in this attempt. Also during the merge I started doing hot_threads on the node to see where the threads are busy. Here is the output.
From logs I was able to see the logs which I added.
Hot threads when CPU starts to flatten
Logs when force merge on 1 shard is done
OSB results
Hypothesis
Attempt 2This time I used the same cluster and did hot threads call quite aggressively, and I was able to find out that there is a checksum validation being called before every merge starts. Here is the dump.
Observation
Will do some code deep-dive for this. |
Continuing my deep-dive from above, what I found is file containing flat vectors is opened via IOContext as RANDOM vs dvd files are getting opened as READ. This could be a potential reason for high latency. A simple check to confirm this is switching the Ref this to know about RANDOM and READ IOContext of Lucene which is mapped as madvise arguments which is used to adivse Operating system on how to mmap a file.
Attempt 3Cluster Configuration
Now to test that RANDOM is the culprit here I build a fresh version of Lucene, opensearch and k-NN from source by making my changes. Below are the steps I followed:
Once the tar was build and deployed I ran the indexing and force merge and got these logs.
OSB results
ConclusionAfter making the IOContext to READ from Random there is no flat line for reads that is happening. To compare with older runs where we were taking like ~10min before for merging the flat vectors and doing the integrity checks the time has come down to ~ 2.5min. Ref this to know about RANDOM and READ IOContext of Lucene which is mapped as madvise arguments which is used to adivse Operating system on how to mmap a file.
Code change I did: navneet1v/lucene@cd02e6f
The metrics for the node. So finally total merge time with final attempt 3 has come down from 5.23hrs(deep-dive 1) to ~4.8hrs(deep-dive3) which has a diff of > 20mins which was actual degradation. Open Questions
|
Thanks for the deep dive @navneet1v
Based on the code deep dive, IOContext is used to advise the kernel about the access pattern of the data for memory-mapped files. This advise is helpful so kernel can optimize disk reads. IOContext translates into madvise system call which is used to advise the kernel (ref1 - lucene 9.11, ref2 - lucene 10). The default behavior of the kernel (without any madvise call) seems to preload pages in a non-aggressive way. This preloading of data is asynchronous. Lucene engine depends on FlatVectorFormat to do its HNSW searches, where it seeks to the position of the vector based on the closest entry point. This access pattern is not sequential and opening the indexInput with IOContext.RANDOM gives a random access pattern advise to the kernel, as a response kernel does not preload data into the physical memory. This makes sense as it not only saves CPU resources since its not spending extra cycles on preloading data but also is seen to improve search latency. Merges are different compared to searches. While merging data, all segments and its pages are accessed sequentially from the disk. In this case its beneficial to preload pages of the segments since it reduces runtime reads from the disk, which decreases the total time taken to do the merge. Since the codec reader instance used by lucene is the same as that of search, the access pattern advise given to the kernel remains random and is not changed before merges. This in turn slows down the merges since there are more number of runtime reads. To see the behavior of lucene engine with IOContext.RANDOM vs using a preloading an experiment was done. Experiment uses cohere-10m dataset which has 10 million vectors, each with 768 dimensions. The opensearch cluster had 3 nodes, with a total of 6 shards and 1 replica. Each node had 4 vCPUs, 32Gb memory of which 16Gb was being used for Java heap. Baseline - with IOContext.RANDOMThe total force merge time was ~9hr 30mins with read operations of ~120k/min before merge or each shard. the graph shows CPU utilization (blue line) vs read operations (Orange line). Note that the initial bump in CPU is indexing Preload .vec and .vex filesTo enforce preloading, Opensearch has an index setting called The total force merge time was ~8hrs 54 mins with read operations of ~60k/min before merge of each shard So it seems that with the change of IOContext.RANDOM, while search performance improved it degraded the total force merge time |
In an attempt to fix the regression few solutions were explored 1. Preloading .vec and .vex files [Ruled out]Preloading data into physical memory for mmap directories helps with reducing latencies since it reduces read operations at runtime. To enable preloads a default cluster setting were overriden which tells the opensearch process to preload .vec and .vex on startup. Experiment with FAISS engine showed that the preload approach decreased the total force merge time by ~20mins. The CPU and memory for FAISS experiment was changed to 16 CPU and 128GB (with 50% allocated to JVM) per node. This was done to be able to compare it to nightly runs FAISS Baseline - with IOContext.RANDOMTotal force merge time: ~5hrs 10mins Preloading .vec and .vex filesTotal force merge time: ~4hrs 55mins Why was the solution ruled out?While this solution showed improvements in total force merge time compared to IOContext.RANDOM. It degraded the search latency by 50% These are Lucene latencies from experiment mentioned in #2134 (comment)
Why did the latency increase?Preloading not only loads data into the physical memory but also loads data into CPU caches. The code is structured in such a way that it ignores the explicit madvise() call if preloads are enabled (ref). So when the the reader is opened during search the random access madvise() call is ignored and the data is preloaded. As the reader seeks to different pages, CPU always seemed to have more cache misses compared to random madvise call, Preload seems to be evicting useful data from CPU cache to make space for preload data. This is just a hypothesis and I was not able to find the evidence for this 2. Opening a new IndexInput by overriding getMergeInstance()Before the merge starts, lucene creates a merge state. MergeState holds all the references to the readers needed during merge process. While creating merge state, it calls getMergeInstance() for each reader. The solution leverages this method to open a new IndexInput with IOContext as READ_ONCE . This ensures that the kernel aggressively loads pages and discards them due to a corresponding madvise(SEQUENTIAL) call as a result of read once IOContext. pseudo code for Lucene99FlatVectorsReader @Override
public FlatVectorsReader getMergeInstance() {
try {
return new Lucene99FlatVectorsReader(this, IOContext.READONCE);
} catch (IOException e) {
// Throwing for testing purposes, we can return existing instance once the testing is done
throw new RuntimeException(e);
}
}
private Lucene99FlatVectorsReader(
final Lucene99FlatVectorsReader flatVectorsReader, final IOContext ioContext)
throws IOException {
super(flatVectorsReader.getFlatVectorScorer());
this.segmentReadState = flatVectorsReader.segmentReadState;
boolean success = false;
try {
this.vectorData =
openDataInput(
flatVectorsReader.segmentReadState,
readMetadata(flatVectorsReader.segmentReadState),
Lucene99FlatVectorsFormat.VECTOR_DATA_EXTENSION,
Lucene99FlatVectorsFormat.VECTOR_DATA_CODEC_NAME,
ioContext);
success = true;
} finally {
if (success == false) {
IOUtils.closeWhileHandlingException(this);
}
}
} With the change in Lucene99FlatVectorsReader and the corresponding change in knn plugin code total force merge time for FAISS engine was ~4hr 46mins why is the solution not favored?Based on the javadoc its better to clone an IndexInput then open a new one. while experiments did not reveal any issues with opening a new IndexInput, for multi threaded use clone is more favorable according to documentation 3. Advice kernel with a sequential access pattern before the start of the mergeThis bypasses opening a new IndexInput. Instead it clones the existing IndexInput and calls madvise(SEQUENTIAL) for files needed to be merged. To do so we need access to each MemorySegment used. These are only accessible inside of IndexInput. The solution relies on We leverage this method to create Code snippets MemorySegmentIndexInput public void prefetchSequential() throws IOException {
if (NATIVE_ACCESS.isEmpty()) {
return;
}
long offset = 0;
for (MemorySegment seg : segments) {
prefetch(offset, seg.byteSize(), ReadAdvice.SEQUENTIAL);
offset += seg.byteSize();
}
}
@Override
public void prefetch(long offset, long length) throws IOException {
prefetch(offset, length, ReadAdvice.WILL_NEED);
} Lucene99FlatVectorsReader.java
Benchmarks showed the total merge time was approximately 4hrs 40mins for FAISS engine @navneet1v reached out to Lucene community to discuss the feasibility of the solution. Here is the github issues apache/lucene#13920 |
What is the bug?
After the switch to
NativeEngines990KnnVectorsWriter
we saw force merge latencies increased approximately by 20% in nightly runsThe increase has been consistent
How can one reproduce the bug?
Running a benchmark for force-merge against 2.17 vs 2.16 with cohere 10m dataset takes 2000seconds (~30 mins) more
What is the expected behavior?
It should take the same time or less
What is your host/environment?
Nightlies dashboard
Do you have any screenshots?
NA
Do you have any additional context?
NA
The text was updated successfully, but these errors were encountered: