New ThreadPool + thread pool facade by mzient · Pull Request #6224 · NVIDIA/DALI

mzient · 2026-02-23T20:46:10Z

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

ThreadPool is now a name of an abstract interface.
The old implementation is renamed to OldThreadPool and it implements the ThreadPool interface.
A NewThreadPool is added, which inherits from ThreadPoolBase (the new thing) and DOES NOT implement ThreadPool.
A ThreadPoolFacade wraps a pointer to ThreadPoolBase and a list of Jobs.

In the executor, an environment variable DALI_USE_NEW_THREAD_POOL is checked and when it's set to 1, the new thread pool is used and each operator is given its own ThreadPoolFacade. This allows all operators to execute in parallel, because now they will add tasks to separate Job objects (even though the jobs will be executed in the same thread pool). DALI_USE_NEW_THREAD_POOL is also checked when restricting parallelism policy (when not set, CPU operators are never parallelized).

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

New qa tests script

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

dali-automaton · 2026-02-23T20:48:16Z

CI MESSAGE: [44656169]: BUILD STARTED

dali-automaton · 2026-02-23T23:19:19Z

CI MESSAGE: [44667089]: BUILD STARTED

dali-automaton · 2026-02-24T00:10:04Z

CI MESSAGE: [44667089]: BUILD FAILED

dali-automaton · 2026-02-24T01:42:22Z

CI MESSAGE: [44656169]: BUILD PASSED

dali-automaton · 2026-02-24T08:06:05Z

CI MESSAGE: [44698015]: BUILD STARTED

dali-automaton · 2026-02-24T08:13:29Z

CI MESSAGE: [44698471]: BUILD STARTED

dali-automaton · 2026-02-24T14:11:26Z

CI MESSAGE: [44717682]: BUILD STARTED

dali-automaton · 2026-02-24T14:23:36Z

CI MESSAGE: [44717682]: BUILD FAILED

dali-automaton · 2026-02-24T14:55:35Z

CI MESSAGE: [44719838]: BUILD STARTED

dali-automaton · 2026-02-24T16:46:53Z

CI MESSAGE: [44698471]: BUILD FAILED

dali-automaton · 2026-02-25T00:56:43Z

CI MESSAGE: [44719838]: BUILD FAILED

dali-automaton · 2026-02-25T06:24:00Z

CI MESSAGE: [44719838]: BUILD PASSED

dali-automaton · 2026-03-05T16:30:34Z

CI MESSAGE: [45440403]: BUILD STARTED

greptile-apps · 2026-03-05T16:39:25Z

Greptile Summary

This PR refactors DALI's thread pool infrastructure: ThreadPool becomes an abstract interface, the original implementation is renamed to OldThreadPool, a new NewThreadPool (built on ThreadPoolBase) is added alongside a ThreadPoolFacade adapter, and the executor (exec2) is wired to use the new pool when DALI_USE_NEW_THREAD_POOL=1 is set. The key benefit is that each CPU operator gets its own ThreadPoolFacade wrapping a shared NewThreadPool, enabling all operators to run in parallel (each submitting to separate Job objects) rather than being serialized through a single OldThreadPool.

Key observations from this review cycle:

All previously identified correctness bugs (destruction order, device_id_ never stored, MultipleErrors not thrown, WaitCompleted() copy-paste, double-wait in multi-job path) have been fixed.
The multi-job RunAll/WaitForWork contract is well-defined: RunAll(false) must be paired with a subsequent WaitForWork(), and the destructor intentionally terminates if this contract is violated.
Member declaration order in Executor2::Impl (thread_pool_wrappers_ after new_tp_) correctly ensures facades are destroyed before the underlying pool both at explicit SetupThreadPool() reset and at Impl destruction.
The debug std::cerr on the new-pool path is acknowledged as intentional for CI validation and will be removed before the final merge.
NVML affinity initialization for CPU-only pools remains conditioned on device_id.has_value() rather than set_affinity, which is a behavioral difference from OldThreadPool that is worth tracking.

Confidence Score: 4/5

Safe to merge with the new thread pool path remaining experimental (env-var gated); the default path is unchanged and all correctness issues from prior review rounds have been addressed.
All critical bugs identified in previous review rounds (destruction order, device_id_ assignment, MultipleErrors throw, WaitCompleted accessor, double-wait) are fixed. The new RunAll/WaitForWork contract is internally consistent and intentional. The one remaining point to watch before a public release is the NVML affinity initialization asymmetry for CPU-only pools with set_affinity=true (conditioned on device_id.has_value() rather than set_affinity, unlike OldThreadPool), and the debug std::cerr which is confirmed intentional for CI but must be removed before a stable release. Neither blocks merging as an experimental feature.
dali/pipeline/util/new_thread_pool.cc — NVML affinity guard condition; dali/pipeline/executor/executor2/exec2.cc — debug std::cerr to remove before public release

Important Files Changed

Filename	Overview
include/dali/core/exec/thread_pool_base.h	Minor visibility change: `Started()`, `WaitStarted()`, `WaitCompleted()`, and `IsCooperative()` moved from `protected` to `public` on `JobBase`. The `WaitCompleted()` copy-paste bug (previously returning `wait_started_`) is now corrected — it returns `wait_completed_`.
dali/pipeline/util/thread_pool_interface.h	New file introducing the abstract `ThreadPool` interface with pure virtual `AddWork`, `RunAll`, `WaitForWork`, `NumThreads`, and `GetThreadIds`. Clean, minimal design inheriting from `ThisThreadIdx`.
dali/pipeline/util/new_thread_pool.h	New file declaring `NewThreadPool` (extends `ThreadPoolBase`) and `ThreadPoolFacade` (extends `ThreadPool` interface). One lingering docstring typo on line 93 (`systsm-specific`). The raw pointer lifetime contract for `tp_` is documented in the constructor Doxygen. Overall design is sound.
dali/pipeline/util/new_thread_pool.cc	Full implementation of `NewThreadPool` and `ThreadPoolFacade`. `device_id_` is now properly assigned (previous bug fixed). `MultipleErrors` is now thrown correctly. The multi-job `RunAll`/`WaitForWork` logic is correct. A debug `std::cerr` remains (acknowledged as intentional for CI). NVML initialization for CPU-only pools with `set_affinity=true` is conditioned on `device_id.has_value()` rather than `set_affinity`.
dali/pipeline/util/thread_pool.h	Renamed `ThreadPool` → `OldThreadPool : public ThreadPool`. `Work` type changed to `void()` and `WorkWithThreadIdx` introduced for `void(int)`. `WaitForWork()` public override now delegates to private `WaitForWork(bool)`. Clean migration to the new interface hierarchy.
dali/pipeline/util/thread_pool.cc	All `OldThreadPool` methods correctly updated. The `AddWork(Work)` overload wraps the no-arg callable with a thread-index-taking lambda (using `std::move` to avoid unnecessary copies). `ThreadMain` sets `this_thread_idx_` for proper thread-index tracking.
dali/pipeline/executor/executor2/exec2.cc	New `SetupThreadPool()` branches on `UseNewThreadPool()`: creates `NewThreadPool` + per-operator `ThreadPoolFacade` wrappers when enabled, falls back to `OldThreadPool` otherwise. Destruction order is correct (`thread_pool_wrappers_` cleared before `new_tp_` reset in `SetupThreadPool`, and member declaration order ensures correct implicit destruction). Debug `std::cerr` is acknowledged as intentional for CI. `ApplyConcurrencyLimit` correctly skips CPU serialization when the new thread pool is active.
dali/pipeline/util/thread_pool_test.cc	Tests updated to use `OldThreadPool` directly; test names updated accordingly. No tests for `NewThreadPool` or `ThreadPoolFacade` — integration coverage provided by the new QA script.
qa/TL0_python-self-test-core-newtp/test.sh	New QA test: sets `DALI_USE_NEW_THREAD_POOL=1`, changes into the core test directory, and runs the existing test suite. Correctly uses `./test.sh` after `pushd`.
qa/TL1_decoder_perf/test.sh	Adds new-thread-pool benchmark runs (`LOG1_TP`, `LOG2_TP`) and relative performance checks (`PERF_RESULT1_TP`, `PERF_RESULT2_TP`) within 2% of the old-pool baseline. Previous log-file naming bugs (GraceHopper branch) and redundant absolute-value assignments have been corrected.
dali/operators/video/input/video_input.h	`thread_pool_` changed from `std::optional<ThreadPool>` (non-constructible now that `ThreadPool` is abstract) to `std::unique_ptr<ThreadPool>` holding an `OldThreadPool`. Construction and assertion updated accordingly — semantically equivalent.
dali/test/python/auto_aug/test_auto_augment.py	Adds `OperatorConcurrency.FULL` concurrency setting to auto-augmentation pipeline tests, ensuring the new parallel-operator capability is exercised in the Python test suite.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class ThisThreadIdx {
        +this_thread_idx() int
    }

    class ThreadPool {
        <<abstract interface>>
        +AddWork(void(int), priority) void
        +AddWork(void(), priority) void
        +RunAll(wait) void
        +WaitForWork() void
        +NumThreads() int
        +GetThreadIds() vector
    }

    class OldThreadPool {
        +AddWork(WorkWithThreadIdx, priority) void
        +AddWork(Work, priority) void
        +RunAll(wait) void
        +WaitForWork() void
        -WaitForWork(checkForErrors) void
        -ThreadMain(thread_id, ...) void
    }

    class ThreadPoolBase {
        +Init(num_threads, on_thread_start) void
        +AddTask(TaskFunc) void
        +NumThreads() int
        +GetThreadIds() vector
        #Shutdown(join) void
    }

    class NewThreadPool {
        +NewThreadPool(num_threads, device_id, set_affinity, name)
        -OnThreadStart(thread_idx, set_affinity) any
        -device_id_ optional~int~
        -name_ string
        -nvml_handle_ NvmlInstance
    }

    class ThreadPoolFacade {
        +AddWork(void(int), priority) void
        +AddWork(void(), priority) void
        +RunAll(wait) void
        +WaitForWork() void
        +NumThreads() int
        +GetThreadIds() vector
        -tp_* ThreadPoolBase
        -jobs_ list~Job~
    }

    class Executor2Impl {
        -old_tp_ unique_ptr~OldThreadPool~
        -new_tp_ unique_ptr~NewThreadPool~
        -thread_pool_wrappers_ vector~unique_ptr~ThreadPool~~
        +SetupThreadPool() void
    }

    ThisThreadIdx <|-- ThreadPool
    ThisThreadIdx <|-- ThreadPoolBase
    ThreadPool <|-- OldThreadPool
    ThreadPool <|-- ThreadPoolFacade
    ThreadPoolBase <|-- NewThreadPool
    ThreadPoolFacade --> ThreadPoolBase : delegates via tp_*
    Executor2Impl --> NewThreadPool : owns (DALI_USE_NEW_THREAD_POOL=1)
    Executor2Impl --> OldThreadPool : owns (default)
    Executor2Impl --> ThreadPoolFacade : owns N facades (one per CPU op)

_{Last reviewed commit: 9f0e8d7}

dali/pipeline/util/new_thread_pool.cc

greptile-apps · 2026-03-05T16:39:29Z

dali/pipeline/executor/executor2/exec2.cc

+    if (UseNewThreadPool()) {
+      std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl;
+      if (graph_info_.num_cpu > 0) {


Debug std::cerr left in production code

This diagnostic print was presumably added during development but should not be shipped:

Suggested change

if (UseNewThreadPool()) {

std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl;

if (graph_info_.num_cpu > 0) {

if (UseNewThreadPool()) {

if (graph_info_.num_cpu > 0) {

Remove the std::cerr line entirely — it will spam every user's stderr whenever they set DALI_USE_NEW_THREAD_POOL=1.

@greptileai This will be removed before merging, but is necessary to validate that the flag is properly set and used in CI.

Got it! That makes sense for CI validation. Consider adding a TODO comment on that line so it's clear it needs to be removed:

Suggested change

if (UseNewThreadPool()) {

std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl;

if (graph_info_.num_cpu > 0) {

// TODO: Remove before merge - only for CI validation

std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl;

This way it won't accidentally slip through code review.

qa/TL1_decoder_perf/test.sh

dali-automaton · 2026-03-07T07:41:37Z

CI MESSAGE: [45440403]: BUILD FAILED

dali-automaton · 2026-03-09T10:20:33Z

CI MESSAGE: [45700206]: BUILD STARTED

dali/pipeline/util/new_thread_pool.cc

dali/pipeline/executor/executor2/exec2.cc

dali-automaton · 2026-03-09T11:28:21Z

CI MESSAGE: [45700206]: BUILD FAILED

dali-automaton · 2026-03-09T15:50:35Z

CI MESSAGE: [45719366]: BUILD STARTED

dali/pipeline/util/new_thread_pool.h

qa/TL0_python-self-test-core-newtp/test.sh

dali-automaton · 2026-03-10T12:20:27Z

CI MESSAGE: [45791241]: BUILD STARTED

dali-automaton · 2026-03-10T19:22:31Z

CI MESSAGE: [45791241]: BUILD FAILED

dali-automaton · 2026-03-10T20:05:27Z

CI MESSAGE: [45785260]: BUILD FAILED

dali-automaton · 2026-03-11T09:02:48Z

CI MESSAGE: [45785260]: BUILD PASSED

mzient · 2026-03-11T10:18:08Z

dali/test/python/auto_aug/test_auto_augment.py

 vid_files = ["sintel_trailer-720p_2.mp4"]
 vid_filenames = [os.path.join(vid_dir, vid_file) for vid_file in vid_files]

+concurrency = OperatorConcurrency.FULL


This enables testing of the parallel execution of CPU operators.

Is this intentional that only FULL concurrency is tested with auto augment tests? I would expect to exercise at least default concurrency as well.

mzient · 2026-03-11T10:25:43Z

dali/pipeline/util/thread_pool_interface.h

+
+  virtual void AddWork(std::function<void(int)> work, int64_t priority = 0) = 0;
+
+  virtual void AddWork(std::function<void()> work, int64_t priority = 0) = 0;


This is for future refactoring - in the end, most of the time we don't need the thread index and the new thread pool doesn't provide one to the callback. It adds one more layer of function wrapping, which is an avoidable cost. When the new thread pool becomes the default, we'll refactor the code to remove the thread_idx parameter from task functions.

dali-automaton · 2026-03-11T12:29:38Z

CI MESSAGE: [45791241]: BUILD PASSED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

---- Signed-off-by: Michał Zientkiewicz <michalz@nvidia.com>

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mdabek-nvidia · 2026-03-11T15:59:58Z

dali/operators/video/input/video_input.h


  /// CPU operators have default Thread Pool inside Workspace. Mixed and GPU ops don't.
-  std::optional<ThreadPool> thread_pool_ = std::nullopt;
+  std::unique_ptr<ThreadPool> thread_pool_;


Suggested change

std::unique_ptr<ThreadPool> thread_pool_;

std::unique_ptr<OldThreadPool> thread_pool_;

mdabek-nvidia · 2026-03-11T17:01:22Z

dali/test/python/auto_aug/test_auto_augment.py

 vid_files = ["sintel_trailer-720p_2.mp4"]
 vid_filenames = [os.path.join(vid_dir, vid_file) for vid_file in vid_files]

+concurrency = OperatorConcurrency.FULL


Is this intentional that only FULL concurrency is tested with auto augment tests? I would expect to exercise at least default concurrency as well.

mzient force-pushed the ThreadPoolFacade branch from 8fe96ee to d879927 Compare February 24, 2026 08:04

mzient force-pushed the ThreadPoolFacade branch 4 times, most recently from 193e754 to e485702 Compare March 5, 2026 14:47

mzient changed the title ~~Thread pool facade~~ New ThreadPool + thread pool facade Mar 5, 2026

mzient marked this pull request as ready for review March 5, 2026 16:34

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

dali-automaton assigned mdabek-nvidia and rostan-t Mar 6, 2026

mzient force-pushed the ThreadPoolFacade branch from 85e5c27 to a9ec7fa Compare March 9, 2026 10:17

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

dali/pipeline/util/new_thread_pool.cc Show resolved Hide resolved

dali/pipeline/executor/executor2/exec2.cc Outdated Show resolved Hide resolved

dali/pipeline/executor/executor2/exec2.cc Outdated Show resolved Hide resolved

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

dali/pipeline/util/new_thread_pool.h Outdated Show resolved Hide resolved

qa/TL0_python-self-test-core-newtp/test.sh Outdated Show resolved Hide resolved

mzient commented Mar 11, 2026

View reviewed changes

mzient and others added 20 commits March 11, 2026 15:38

[WIP]

3a6461c

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Extract common interface.

aa81217

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Facade working and hooked up to exec2.

1b01eff

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fix comment.

a5519c0

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Temporarily enable new thread pool.

c1a2ee9

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Adjust exception specification for clang.

4ae4df5

---- Signed-off-by: Michał Zientkiewicz <michalz@nvidia.com>

Add override specifiers

0a4d053

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Use non-reentrant jobs in new thread pool facade.

0786202

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Move UseNewThreadPool to NewThreadPool.

45e3d45

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Add tests with new thread pool.

a39b910

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Fix test invocation. Assign device_id in NewThreadPool constructor.

c0a1141

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Update perf test cleanup function.

89cd938

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Allow multiple runs before wait in ThreadPoolFacade.

5ab0d4c

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Fixes.

f5fceab

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Typo.

3285da6

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Fixes.

006f0f3

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Documentation.

8422133

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Fix -newtp launch script.

50b862a

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Review issues.

2ae5c13

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Add TODO to debug print with new thread pool.

9f0e8d7

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient force-pushed the ThreadPoolFacade branch from 01a53c0 to 9f0e8d7 Compare March 11, 2026 14:40

mdabek-nvidia requested changes Mar 11, 2026

View reviewed changes


		virtual void AddWork(std::function<void(int)> work, int64_t priority = 0) = 0;

		virtual void AddWork(std::function<void()> work, int64_t priority = 0) = 0;

	std::unique_ptr<ThreadPool> thread_pool_;
	std::unique_ptr<OldThreadPool> thread_pool_;

Conversation

mzient commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

dali-automaton commented Feb 23, 2026

Uh oh!

dali-automaton commented Feb 23, 2026

Uh oh!

dali-automaton commented Feb 24, 2026

Uh oh!

dali-automaton commented Feb 24, 2026

Uh oh!

dali-automaton commented Feb 24, 2026

Uh oh!

dali-automaton commented Feb 24, 2026

Uh oh!

dali-automaton commented Feb 24, 2026

Uh oh!

dali-automaton commented Feb 24, 2026

Uh oh!

dali-automaton commented Feb 24, 2026

Uh oh!

dali-automaton commented Feb 24, 2026

Uh oh!

dali-automaton commented Feb 25, 2026

Uh oh!

dali-automaton commented Feb 25, 2026

Uh oh!

dali-automaton commented Mar 5, 2026

Uh oh!

greptile-apps bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Class Diagram

Uh oh!

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

mzient Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dali-automaton commented Mar 7, 2026

Uh oh!

dali-automaton commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dali-automaton commented Mar 9, 2026

Uh oh!

dali-automaton commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

dali-automaton commented Mar 10, 2026

Uh oh!

dali-automaton commented Mar 10, 2026

Uh oh!

dali-automaton commented Mar 10, 2026

Uh oh!

dali-automaton commented Mar 11, 2026

mzient commented Feb 23, 2026 •

edited

Loading

greptile-apps bot commented Mar 5, 2026 •

edited

Loading