Fix significant GOMP barrier overhead in exhaustive_L2sqr_blas. #4663
+0
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
#pragma omp parallel fordirective inexhaustive_L2sqr_blasintroduces substantial GOMP barrier overhead, which considerably slows down computation.Consider the following clustering example:
Running the above with SIFT1M vectors (
./clustering sift1M/sift_base.fvecs) results in significant GOMP overhead, wheregomp_barrier_wait_endandgomp_team_barrier_wait_endconsume more than 50% of the CPU cycles. Output (on a c7i.4xlarge instance):Iteration 1 (149.74 s, search 149.55 s): objective=4.11771e+10 imbalance=1.378 nsplit=0One way to address this issue is to move
#pragma omp parallel forto the outerj0loop (which requires movingip_blockinside). This change improves performance by approximately 2x.Iteration 1 (68.46 s, search 68.23 s): objective=4.11769e+10 imbalance=1.379 nsplit=0Another approach is to simply remove
#pragma omp parallel foraltogether, since thesgemm_call is already the dominant computation and is parallelized implicitly. Having an additionalparallel forinside the loop appears to be unnecessary overhead. This solution improves performance by roughly 5x.Iteration 1 (27.98 s, search 27.85 s): objective=4.11771e+10 imbalance=1.378 nsplit=0I also experimented with a few other approaches, but all of them introduced additional GOMP overhead. I propose that we take the second approach.
--
OMP_NUM_THREADS and OPENBLAS_NUM_THREADS were set to their default values in the experiments above. The behavior remains consistent as long as thread oversubscription does not occur.