Skip to content

Conversation

@mulugetam
Copy link
Contributor

The #pragma omp parallel for directive in exhaustive_L2sqr_blas introduces substantial GOMP barrier overhead, which considerably slows down computation.

Consider the following clustering example:

#include <vector>
#include <iostream>

std::vector<float> read_fvecs(const std::string& filename, size_t& d, size_t& nb) {
    std::ifstream fin(filename, std::ios::binary);
    if (!fin) throw std::runtime_error("Cannot open file");

    // Read all vectors into memory
    fin.seekg(0, std::ios::end);
    size_t filesize = fin.tellg();
    fin.seekg(0, std::ios::beg);

    std::vector<float> xb;

    size_t offset = 0;
    while (offset < filesize) {
        int dim;
        fin.read(reinterpret_cast<char*>(&dim), sizeof(int));
        d = dim; // dimension (assume same for all vectors)

        std::vector<float> vec(dim);
        fin.read(reinterpret_cast<char*>(vec.data()), dim * sizeof(float));

        xb.insert(xb.end(), vec.begin(), vec.end());

        offset += sizeof(int) + dim * sizeof(float);
        nb++;
    }

    return xb;
}

int main(int argc, char** argv) {
    std::string filename = argv[1];
    size_t d = 0;
    size_t nb = 0;

    std::cout << "Reading dataset..." << std::endl;
    std::vector<float> xb = read_fvecs(filename, d, nb);
    std::cout << "Loaded " << nb << " vectors of dimension " << d << std::endl;

    // Setup clustering
    size_t ncentroids = 16384;
    faiss::IndexFlatL2 index(d);
    faiss::Clustering clus(d, ncentroids);
    clus.verbose = true;
    clus.niter = 2;

    std::cout << "Performing clustering..." << std::endl;
    clus.train(nb, xb.data(), index);

    std::cout << "Clustering finished." << std::endl;
    return 0;
}

Running the above with SIFT1M vectors (./clustering sift1M/sift_base.fvecs) results in significant GOMP overhead, where gomp_barrier_wait_end and gomp_team_barrier_wait_end consume more than 50% of the CPU cycles. Output (on a c7i.4xlarge instance):

Iteration 1 (149.74 s, search 149.55 s): objective=4.11771e+10 imbalance=1.378 nsplit=0

One way to address this issue is to move #pragma omp parallel for to the outer j0 loop (which requires moving ip_block inside). This change improves performance by approximately 2x.

Iteration 1 (68.46 s, search 68.23 s): objective=4.11769e+10 imbalance=1.379 nsplit=0

Another approach is to simply remove #pragma omp parallel for altogether, since the sgemm_ call is already the dominant computation and is parallelized implicitly. Having an additional parallel for inside the loop appears to be unnecessary overhead. This solution improves performance by roughly 5x.

Iteration 1 (27.98 s, search 27.85 s): objective=4.11771e+10 imbalance=1.378 nsplit=0

I also experimented with a few other approaches, but all of them introduced additional GOMP overhead. I propose that we take the second approach.

--
OMP_NUM_THREADS and OPENBLAS_NUM_THREADS were set to their default values in the experiments above. The behavior remains consistent as long as thread oversubscription does not occur.

@trang-nm-nguyen
Copy link
Contributor

I am curious if you have tested set #pragma omp parallel for if (i1 > [fixed_number]) here and if there is gain on keeping this logic for larger for loop?

@mulugetam
Copy link
Contributor Author

I am curious if you have tested set #pragma omp parallel for if (i1 > [fixed_number]) here and if there is gain on keeping this logic for larger for loop?

@trang-nm-nguyen I tried that—for example, if (i1 > 12 * 4096 = 49152)—but the gomp barrier still consume a significant amount of CPU cycles (though lesser). Question is: what value should it be?

One additional observation is that when you oversubscribe OMP_NUM_THREADS relative to the number of cores in your system, the kernel seems to de-schedule the threads, and performance actually improves with no significant gomp barrier overhead.

@trang-nm-nguyen
Copy link
Contributor

Sounds good to me. I just wanted to check if this has been verified.

In this case, can you remove these lines instead of commenting out? Just overall make the code looks better.

@mulugetam
Copy link
Contributor Author

@trang-nm-nguyen removed.

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 7, 2025

@trang-nm-nguyen has imported this pull request. If you are a Meta employee, you can view this in D86557804.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants