Skip to content

New ThreadPool + thread pool facade#6224

Open
mzient wants to merge 20 commits intoNVIDIA:mainfrom
mzient:ThreadPoolFacade
Open

New ThreadPool + thread pool facade#6224
mzient wants to merge 20 commits intoNVIDIA:mainfrom
mzient:ThreadPoolFacade

Conversation

@mzient
Copy link
Contributor

@mzient mzient commented Feb 23, 2026

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

  • ThreadPool is now a name of an abstract interface.
  • The old implementation is renamed to OldThreadPool and it implements the ThreadPool interface.
  • A NewThreadPool is added, which inherits from ThreadPoolBase (the new thing) and DOES NOT implement ThreadPool.
  • A ThreadPoolFacade wraps a pointer to ThreadPoolBase and a list of Jobs.

In the executor, an environment variable DALI_USE_NEW_THREAD_POOL is checked and when it's set to 1, the new thread pool is used and each operator is given its own ThreadPoolFacade. This allows all operators to execute in parallel, because now they will add tasks to separate Job objects (even though the jobs will be executed in the same thread pool). DALI_USE_NEW_THREAD_POOL is also checked when restricting parallelism policy (when not set, CPU operators are never parallelized).

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

New qa tests script

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44656169]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44667089]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44667089]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44656169]: BUILD PASSED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44698015]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44698471]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44717682]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44717682]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44719838]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44698471]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44719838]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [44719838]: BUILD PASSED

@mzient mzient force-pushed the ThreadPoolFacade branch 4 times, most recently from 193e754 to e485702 Compare March 5, 2026 14:47
@mzient mzient changed the title Thread pool facade New ThreadPool + thread pool facade Mar 5, 2026
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45440403]: BUILD STARTED

@mzient mzient marked this pull request as ready for review March 5, 2026 16:34
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR refactors DALI's thread pool infrastructure: ThreadPool becomes an abstract interface, the original implementation is renamed to OldThreadPool, a new NewThreadPool (built on ThreadPoolBase) is added alongside a ThreadPoolFacade adapter, and the executor (exec2) is wired to use the new pool when DALI_USE_NEW_THREAD_POOL=1 is set. The key benefit is that each CPU operator gets its own ThreadPoolFacade wrapping a shared NewThreadPool, enabling all operators to run in parallel (each submitting to separate Job objects) rather than being serialized through a single OldThreadPool.

Key observations from this review cycle:

  • All previously identified correctness bugs (destruction order, device_id_ never stored, MultipleErrors not thrown, WaitCompleted() copy-paste, double-wait in multi-job path) have been fixed.
  • The multi-job RunAll/WaitForWork contract is well-defined: RunAll(false) must be paired with a subsequent WaitForWork(), and the destructor intentionally terminates if this contract is violated.
  • Member declaration order in Executor2::Impl (thread_pool_wrappers_ after new_tp_) correctly ensures facades are destroyed before the underlying pool both at explicit SetupThreadPool() reset and at Impl destruction.
  • The debug std::cerr on the new-pool path is acknowledged as intentional for CI validation and will be removed before the final merge.
  • NVML affinity initialization for CPU-only pools remains conditioned on device_id.has_value() rather than set_affinity, which is a behavioral difference from OldThreadPool that is worth tracking.

Confidence Score: 4/5

  • Safe to merge with the new thread pool path remaining experimental (env-var gated); the default path is unchanged and all correctness issues from prior review rounds have been addressed.
  • All critical bugs identified in previous review rounds (destruction order, device_id_ assignment, MultipleErrors throw, WaitCompleted accessor, double-wait) are fixed. The new RunAll/WaitForWork contract is internally consistent and intentional. The one remaining point to watch before a public release is the NVML affinity initialization asymmetry for CPU-only pools with set_affinity=true (conditioned on device_id.has_value() rather than set_affinity, unlike OldThreadPool), and the debug std::cerr which is confirmed intentional for CI but must be removed before a stable release. Neither blocks merging as an experimental feature.
  • dali/pipeline/util/new_thread_pool.cc — NVML affinity guard condition; dali/pipeline/executor/executor2/exec2.cc — debug std::cerr to remove before public release

Important Files Changed

Filename Overview
include/dali/core/exec/thread_pool_base.h Minor visibility change: Started(), WaitStarted(), WaitCompleted(), and IsCooperative() moved from protected to public on JobBase. The WaitCompleted() copy-paste bug (previously returning wait_started_) is now corrected — it returns wait_completed_.
dali/pipeline/util/thread_pool_interface.h New file introducing the abstract ThreadPool interface with pure virtual AddWork, RunAll, WaitForWork, NumThreads, and GetThreadIds. Clean, minimal design inheriting from ThisThreadIdx.
dali/pipeline/util/new_thread_pool.h New file declaring NewThreadPool (extends ThreadPoolBase) and ThreadPoolFacade (extends ThreadPool interface). One lingering docstring typo on line 93 (systsm-specific). The raw pointer lifetime contract for tp_ is documented in the constructor Doxygen. Overall design is sound.
dali/pipeline/util/new_thread_pool.cc Full implementation of NewThreadPool and ThreadPoolFacade. device_id_ is now properly assigned (previous bug fixed). MultipleErrors is now thrown correctly. The multi-job RunAll/WaitForWork logic is correct. A debug std::cerr remains (acknowledged as intentional for CI). NVML initialization for CPU-only pools with set_affinity=true is conditioned on device_id.has_value() rather than set_affinity.
dali/pipeline/util/thread_pool.h Renamed ThreadPoolOldThreadPool : public ThreadPool. Work type changed to void() and WorkWithThreadIdx introduced for void(int). WaitForWork() public override now delegates to private WaitForWork(bool). Clean migration to the new interface hierarchy.
dali/pipeline/util/thread_pool.cc All OldThreadPool methods correctly updated. The AddWork(Work) overload wraps the no-arg callable with a thread-index-taking lambda (using std::move to avoid unnecessary copies). ThreadMain sets this_thread_idx_ for proper thread-index tracking.
dali/pipeline/executor/executor2/exec2.cc New SetupThreadPool() branches on UseNewThreadPool(): creates NewThreadPool + per-operator ThreadPoolFacade wrappers when enabled, falls back to OldThreadPool otherwise. Destruction order is correct (thread_pool_wrappers_ cleared before new_tp_ reset in SetupThreadPool, and member declaration order ensures correct implicit destruction). Debug std::cerr is acknowledged as intentional for CI. ApplyConcurrencyLimit correctly skips CPU serialization when the new thread pool is active.
dali/pipeline/util/thread_pool_test.cc Tests updated to use OldThreadPool directly; test names updated accordingly. No tests for NewThreadPool or ThreadPoolFacade — integration coverage provided by the new QA script.
qa/TL0_python-self-test-core-newtp/test.sh New QA test: sets DALI_USE_NEW_THREAD_POOL=1, changes into the core test directory, and runs the existing test suite. Correctly uses ./test.sh after pushd.
qa/TL1_decoder_perf/test.sh Adds new-thread-pool benchmark runs (LOG1_TP, LOG2_TP) and relative performance checks (PERF_RESULT1_TP, PERF_RESULT2_TP) within 2% of the old-pool baseline. Previous log-file naming bugs (GraceHopper branch) and redundant absolute-value assignments have been corrected.
dali/operators/video/input/video_input.h thread_pool_ changed from std::optional<ThreadPool> (non-constructible now that ThreadPool is abstract) to std::unique_ptr<ThreadPool> holding an OldThreadPool. Construction and assertion updated accordingly — semantically equivalent.
dali/test/python/auto_aug/test_auto_augment.py Adds OperatorConcurrency.FULL concurrency setting to auto-augmentation pipeline tests, ensuring the new parallel-operator capability is exercised in the Python test suite.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class ThisThreadIdx {
        +this_thread_idx() int
    }

    class ThreadPool {
        <<abstract interface>>
        +AddWork(void(int), priority) void
        +AddWork(void(), priority) void
        +RunAll(wait) void
        +WaitForWork() void
        +NumThreads() int
        +GetThreadIds() vector
    }

    class OldThreadPool {
        +AddWork(WorkWithThreadIdx, priority) void
        +AddWork(Work, priority) void
        +RunAll(wait) void
        +WaitForWork() void
        -WaitForWork(checkForErrors) void
        -ThreadMain(thread_id, ...) void
    }

    class ThreadPoolBase {
        +Init(num_threads, on_thread_start) void
        +AddTask(TaskFunc) void
        +NumThreads() int
        +GetThreadIds() vector
        #Shutdown(join) void
    }

    class NewThreadPool {
        +NewThreadPool(num_threads, device_id, set_affinity, name)
        -OnThreadStart(thread_idx, set_affinity) any
        -device_id_ optional~int~
        -name_ string
        -nvml_handle_ NvmlInstance
    }

    class ThreadPoolFacade {
        +AddWork(void(int), priority) void
        +AddWork(void(), priority) void
        +RunAll(wait) void
        +WaitForWork() void
        +NumThreads() int
        +GetThreadIds() vector
        -tp_* ThreadPoolBase
        -jobs_ list~Job~
    }

    class Executor2Impl {
        -old_tp_ unique_ptr~OldThreadPool~
        -new_tp_ unique_ptr~NewThreadPool~
        -thread_pool_wrappers_ vector~unique_ptr~ThreadPool~~
        +SetupThreadPool() void
    }

    ThisThreadIdx <|-- ThreadPool
    ThisThreadIdx <|-- ThreadPoolBase
    ThreadPool <|-- OldThreadPool
    ThreadPool <|-- ThreadPoolFacade
    ThreadPoolBase <|-- NewThreadPool
    ThreadPoolFacade --> ThreadPoolBase : delegates via tp_*
    Executor2Impl --> NewThreadPool : owns (DALI_USE_NEW_THREAD_POOL=1)
    Executor2Impl --> OldThreadPool : owns (default)
    Executor2Impl --> ThreadPoolFacade : owns N facades (one per CPU op)
Loading

Last reviewed commit: 9f0e8d7

Comment on lines +353 to +355
if (UseNewThreadPool()) {
std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl;
if (graph_info_.num_cpu > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug std::cerr left in production code

This diagnostic print was presumably added during development but should not be shipped:

Suggested change
if (UseNewThreadPool()) {
std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl;
if (graph_info_.num_cpu > 0) {
if (UseNewThreadPool()) {
if (graph_info_.num_cpu > 0) {

Remove the std::cerr line entirely — it will spam every user's stderr whenever they set DALI_USE_NEW_THREAD_POOL=1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai This will be removed before merging, but is necessary to validate that the flag is properly set and used in CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! That makes sense for CI validation. Consider adding a TODO comment on that line so it's clear it needs to be removed:

Suggested change
if (UseNewThreadPool()) {
std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl;
if (graph_info_.num_cpu > 0) {
// TODO: Remove before merge - only for CI validation
std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl;

This way it won't accidentally slip through code review.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45440403]: BUILD FAILED

@mzient mzient force-pushed the ThreadPoolFacade branch from 85e5c27 to a9ec7fa Compare March 9, 2026 10:17
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45700206]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45700206]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45719366]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45791241]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45791241]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45785260]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45785260]: BUILD PASSED

vid_files = ["sintel_trailer-720p_2.mp4"]
vid_filenames = [os.path.join(vid_dir, vid_file) for vid_file in vid_files]

concurrency = OperatorConcurrency.FULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables testing of the parallel execution of CPU operators.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional that only FULL concurrency is tested with auto augment tests? I would expect to exercise at least default concurrency as well.


virtual void AddWork(std::function<void(int)> work, int64_t priority = 0) = 0;

virtual void AddWork(std::function<void()> work, int64_t priority = 0) = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for future refactoring - in the end, most of the time we don't need the thread index and the new thread pool doesn't provide one to the callback. It adds one more layer of function wrapping, which is an avoidable cost. When the new thread pool becomes the default, we'll refactor the code to remove the thread_idx parameter from task functions.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [45791241]: BUILD PASSED

mzient and others added 20 commits March 11, 2026 15:38
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
----
Signed-off-by: Michał Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@mzient mzient force-pushed the ThreadPoolFacade branch from 01a53c0 to 9f0e8d7 Compare March 11, 2026 14:40

/// CPU operators have default Thread Pool inside Workspace. Mixed and GPU ops don't.
std::optional<ThreadPool> thread_pool_ = std::nullopt;
std::unique_ptr<ThreadPool> thread_pool_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::unique_ptr<ThreadPool> thread_pool_;
std::unique_ptr<OldThreadPool> thread_pool_;

vid_files = ["sintel_trailer-720p_2.mp4"]
vid_filenames = [os.path.join(vid_dir, vid_file) for vid_file in vid_files]

concurrency = OperatorConcurrency.FULL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional that only FULL concurrency is tested with auto augment tests? I would expect to exercise at least default concurrency as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants