Skip to content

Possible data races detected with ThreadSanitizer in ger_thread, ... #5153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vfdev-5 opened this issue Feb 26, 2025 · 9 comments
Open

Possible data races detected with ThreadSanitizer in ger_thread, ... #5153

vfdev-5 opened this issue Feb 26, 2025 · 9 comments
Milestone

Comments

@vfdev-5
Copy link

vfdev-5 commented Feb 26, 2025

I built TSAN instrumented OpenBLAS with clang-18, here is the build command:

cd /project/OpenBLAS

export CC=clang-18
export CXX=clang++-18
export CFLAGS="-fsanitize=thread -g"
export CXXFLAGS="-fsanitize=thread -g"
export LDFLAGS=-fsanitize=thread

apt-get update && apt-get install -y perl --no-install-recommends
make -j 32 shared
make -j 32 install

and then I run utests with few threads:

OMP_NUM_THREADS=2 make -C utest &> make_utest.nt-2.log

ThreadSanitizer reports various data races:
https://gist.github.com/vfdev-5/7e57d305f100003aa4b9ada11efbcf0e#file-make_utest-nt-2-log

Environment info:

  • linux, Ubuntu 24.04.1 LTS
# Makefile.conf
OSNAME=Linux
ARCH=x86_64
C_COMPILER=CLANG
BINARY32=
BINARY64=1
CEXTRALIB= -L/usr/bin/../lib/gcc/x86_64-linux-gnu/14 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib  -lpthread -lrt -lm -ldl -lresolv -lc  /usr/lib/llvm-18/lib/clang/18/lib/linux/libclang_rt.tsan-x86_64.a
F_COMPILER=GFORTRAN
FC=gfortran
BU=_
NOFORTRAN=1
CORE=ZEN
LIBCORE=zen
NUM_CORES=128
HAVE_MMX=1
HAVE_SSE=1
HAVE_SSE2=1
HAVE_SSE3=1
HAVE_SSSE3=1
HAVE_SSE4_1=1
HAVE_SSE4_2=1
HAVE_SSE4A=1
HAVE_AVX=1
HAVE_AVX2=1
HAVE_FMA3=1
MAKEFLAGS += -j 128
SBGEMM_UNROLL_M=8
SBGEMM_UNROLL_N=4
SGEMM_UNROLL_M=8
SGEMM_UNROLL_N=4
DGEMM_UNROLL_M=4
DGEMM_UNROLL_N=8
QGEMM_UNROLL_M=2
QGEMM_UNROLL_N=2
CGEMM_UNROLL_M=8
CGEMM_UNROLL_N=2
ZGEMM_UNROLL_M=4
ZGEMM_UNROLL_N=2
XGEMM_UNROLL_M=1
XGEMM_UNROLL_N=1
CGEMM3M_UNROLL_M=8
CGEMM3M_UNROLL_N=4
ZGEMM3M_UNROLL_M=4
ZGEMM3M_UNROLL_N=4
XGEMM3M_UNROLL_M=2
XGEMM3M_UNROLL_N=2

@rgommers
Copy link
Contributor

Thanks @vfdev-5, this is useful. I noticed that a number of the issues are present in ger_thread, and ger* is the only routine where I found issues when doing parallel testing of scipy.linalg: scipy/scipy#21936. It's possible that that's a coincidence, but I'll try to dig into that a bit this week.

@rgommers
Copy link
Contributor

@martin-frbg for context: in Python there's a significant effort happening to remove the GIL (Global Interpreter Lock), so a free-threading CPython interpreter now allows calling Python functions in parallel threads at the same time. Meaning numpy.linalg and scipy.linalg (and PyTorch, and other packages with direct BLAS/LAPACK bindings) are going to be calling OpenBLAS from parallel threads in the same process. TSAN is used to detect potential issues with that.

@martin-frbg
Copy link
Collaborator

Thanks - I'm looking into level 2 BLAS thread safety in conjunction with #5104 - most attention had been on level 3 and the thread server code in the past

@martin-frbg
Copy link
Collaborator

The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array

@hawkinsp
Copy link
Contributor

The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array

I'm not familiar with your code so my diagnosis may be wrong, but the tsan race looks right to me.

The read here

void *sa = queue -> sa;

and the write here:

queue[num_cpu].sb = NULL;

are racy because I don't think we've done anything to order the reader and the writer.

The only synchronization is via the queue pointer, I think, but you use relaxed reads and writes which, critically, do not imply ordering of any other reads or writes, even from the same thread, with respect to the atomic read/write. I think you need acquire/release ordering ?

@vfdev-5
Copy link
Author

vfdev-5 commented Feb 27, 2025

The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array

I have an impression that the race reported as

WARNING: ThreadSanitizer: data race (pid=1458024)
  Read of size 8 at 0x7fffffff7408 by thread T1:
    #0 ger_kernel /project/OpenBLAS/driver/level2/ger_thread.c:61:24 (openblas_utest+0x26e79d) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #1 exec_threads /project/OpenBLAS/driver/others/blas_server.c:1148:3 (openblas_utest+0x14a05a) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #2 blas_thread_server /project/OpenBLAS/driver/others/blas_server.c:463:5 (openblas_utest+0x1498f7) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)

  Previous write of size 8 at 0x7fffffff7408 by main thread:
    #0 dger_thread /project/OpenBLAS/driver/level2/ger_thread.c:148:10 (openblas_utest+0x26e56f) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #1 dger_ /project/OpenBLAS/interface/ger.c:196:5 (openblas_utest+0x26e499) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #2 dlarf_ /project/OpenBLAS/lapack-netlib/SRC/dlarf.c (openblas_utest+0x26e14d) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #3 dgebd2_ /project/OpenBLAS/lapack-netlib/SRC/dgebd2.c:779:3 (openblas_utest+0x26d59f) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
   ...

between:

x = (FLOAT *)args -> a;

and
args.a = (void *)x;

looks real and in the level3 I saw a use of static pthread_mutex_t level3_lock for a similar situation:
pthread_mutex_lock(&level3_lock);

I'm also not familiar with OpenBLAS code, so may be wrong as well.

@martin-frbg
Copy link
Collaborator

@vfdev-5 thanks. Unfortunately I'm struggling to get anything at all OpenBLAS-wise done right now, so please allow for some delays and premature comments. Also most of that is legacy GotoBLAS code that maybe got more trust than it deserves.

@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.

@hawkinsp
Copy link
Contributor

@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.

And indeed it would make perfect sense that such a problem only show up on ARM, which has weaker memory model than x86, say.

@hawkinsp
Copy link
Contributor

@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.

I suspect it's enough to use ATOMIC_ACQUIRE on the reads and ATOMIC_RELEASE on the writes, but this is hard to be sure of from a drive-by read of the code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants