Fix significant GOMP barrier overhead in exhaustive_L2sqr_blas. #4663

mulugetam · 2025-11-06T22:29:57Z

The #pragma omp parallel for directive in exhaustive_L2sqr_blas introduces substantial GOMP barrier overhead, which considerably slows down computation.

Consider the following clustering example:

#include <vector>
#include <iostream>

std::vector<float> read_fvecs(const std::string& filename, size_t& d, size_t& nb) {
    std::ifstream fin(filename, std::ios::binary);
    if (!fin) throw std::runtime_error("Cannot open file");

    // Read all vectors into memory
    fin.seekg(0, std::ios::end);
    size_t filesize = fin.tellg();
    fin.seekg(0, std::ios::beg);

    std::vector<float> xb;

    size_t offset = 0;
    while (offset < filesize) {
        int dim;
        fin.read(reinterpret_cast<char*>(&dim), sizeof(int));
        d = dim; // dimension (assume same for all vectors)

        std::vector<float> vec(dim);
        fin.read(reinterpret_cast<char*>(vec.data()), dim * sizeof(float));

        xb.insert(xb.end(), vec.begin(), vec.end());

        offset += sizeof(int) + dim * sizeof(float);
        nb++;
    }

    return xb;
}

int main(int argc, char** argv) {
    std::string filename = argv[1];
    size_t d = 0;
    size_t nb = 0;

    std::cout << "Reading dataset..." << std::endl;
    std::vector<float> xb = read_fvecs(filename, d, nb);
    std::cout << "Loaded " << nb << " vectors of dimension " << d << std::endl;

    // Setup clustering
    size_t ncentroids = 16384;
    faiss::IndexFlatL2 index(d);
    faiss::Clustering clus(d, ncentroids);
    clus.verbose = true;
    clus.niter = 2;

    std::cout << "Performing clustering..." << std::endl;
    clus.train(nb, xb.data(), index);

    std::cout << "Clustering finished." << std::endl;
    return 0;
}

Running the above with SIFT1M vectors (./clustering sift1M/sift_base.fvecs) results in significant GOMP overhead, where gomp_barrier_wait_end and gomp_team_barrier_wait_end consume more than 50% of the CPU cycles. Output (on a c7i.4xlarge instance):

Iteration 1 (149.74 s, search 149.55 s): objective=4.11771e+10 imbalance=1.378 nsplit=0

One way to address this issue is to move #pragma omp parallel for to the outer j0 loop (which requires moving ip_block inside). This change improves performance by approximately 2x.

Iteration 1 (68.46 s, search 68.23 s): objective=4.11769e+10 imbalance=1.379 nsplit=0

Another approach is to simply remove #pragma omp parallel for altogether, since the sgemm_ call is already the dominant computation and is parallelized implicitly. Having an additional parallel for inside the loop appears to be unnecessary overhead. This solution improves performance by roughly 5x.

Iteration 1 (27.98 s, search 27.85 s): objective=4.11771e+10 imbalance=1.378 nsplit=0

I also experimented with a few other approaches, but all of them introduced additional GOMP overhead. I propose that we take the second approach.

--
OMP_NUM_THREADS and OPENBLAS_NUM_THREADS were set to their default values in the experiments above. The behavior remains consistent as long as thread oversubscription does not occur.

trang-nm-nguyen · 2025-11-06T23:31:04Z

I am curious if you have tested set #pragma omp parallel for if (i1 > [fixed_number]) here and if there is gain on keeping this logic for larger for loop?

mulugetam · 2025-11-06T23:59:44Z

I am curious if you have tested set #pragma omp parallel for if (i1 > [fixed_number]) here and if there is gain on keeping this logic for larger for loop?

@trang-nm-nguyen I tried that—for example, if (i1 > 12 * 4096 = 49152)—but the gomp barrier still consume a significant amount of CPU cycles (though lesser). Question is: what value should it be?

One additional observation is that when you oversubscribe OMP_NUM_THREADS relative to the number of cores in your system, the kernel seems to de-schedule the threads, and performance actually improves with no significant gomp barrier overhead.

trang-nm-nguyen · 2025-11-07T06:36:06Z

Sounds good to me. I just wanted to check if this has been verified.

In this case, can you remove these lines instead of commenting out? Just overall make the code looks better.

Signed-off-by: Mulugeta Mammo <[email protected]>

mulugetam · 2025-11-07T17:55:36Z

@trang-nm-nguyen removed.

meta-codesync · 2025-11-07T21:54:57Z

@trang-nm-nguyen has imported this pull request. If you are a Meta employee, you can view this in D86557804.

meta-cla bot added the CLA Signed label Nov 6, 2025

mulugetam force-pushed the omp_bottleneck branch from d5611f3 to 310d473 Compare November 6, 2025 22:50

mulugetam force-pushed the omp_bottleneck branch from 310d473 to 6bc6d45 Compare November 7, 2025 17:54

Fix significant GOMP barrier overhead in exhaustive_L2sqr_blas.

6bc6d45

Signed-off-by: Mulugeta Mammo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix significant GOMP barrier overhead in exhaustive_L2sqr_blas. #4663

Fix significant GOMP barrier overhead in exhaustive_L2sqr_blas. #4663

mulugetam commented Nov 6, 2025

Uh oh!

trang-nm-nguyen commented Nov 6, 2025

Uh oh!

mulugetam commented Nov 6, 2025

Uh oh!

trang-nm-nguyen commented Nov 7, 2025

Uh oh!

mulugetam commented Nov 7, 2025

Uh oh!

meta-codesync bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix significant GOMP barrier overhead in exhaustive_L2sqr_blas. #4663

Are you sure you want to change the base?

Fix significant GOMP barrier overhead in exhaustive_L2sqr_blas. #4663

Conversation

mulugetam commented Nov 6, 2025

Uh oh!

trang-nm-nguyen commented Nov 6, 2025

Uh oh!

mulugetam commented Nov 6, 2025

Uh oh!

trang-nm-nguyen commented Nov 7, 2025

Uh oh!

mulugetam commented Nov 7, 2025

Uh oh!

meta-codesync bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants