Improve stage loading performance for onStageSet and stageChanged #4434
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
Our production teams have recently reported performance issues when loading large scenes or activating/deactivating the parent prim of extensive environment sets.
While PR-4275 has provided significant improvement, the loading performance remains suboptimal from a user experience perspective, primarily because our production scenes often contain hundreds or thousands of instances and point instancers.
We believe that, in addition to PR-4275, further improvements can be made in two other places. This PR here as a starting point for review and discussion:
The example scene in this PR is a moderately sized scene, which is typical of our production assets, featuring a significant number of instances and point instancers, the scene has over 50k+ prims in the hierarchy. We are not able to share the actual production scene, but the issues described here should be reproducible with any USD file with instances, such as ALab.
1. Unnecessary UFE notifications are being triggered by root prototypes like "/__Prototype_xxxx".
The Problem:
01_BEFORE_root_prototypes_reloading_stage.mp4
01_BEFORE_root_prototypes_activate_prims.mp4
Since these prototypes are implicitly managed by USD, they are not user-editable, and are invisible in the Maya Outliner, it seems to be reasonable to ignore them to reduce the number of UFE notifications (commit b7838fe):
01_AFTER_ignore_root_prototypes_reloading_stage.mp4
2. Inefficient
UsdHierarchy::children()performanceCloser examination of
sendObjectAdd()andsendSubtreeInvalidate()reveals thatUsdHierarchy::children()is called frequently to traverse the hierarchy to gather parent-child relationship information, notably, trace data indicates this process is currently running in a single-threaded manner:02_BEFORE_children_call.mp4
We attempted to investigate optimization options, but found that the direct caller of
UsdHierarchy::children()originates fromlibShared.so. This makes it difficult to control or modify within themaya-usdcodebase. Furthermore, there is insufficient information regarding the expected operation and return behavior ofUsdHierarchy::children():To address the performance bottleneck in traversing the USD hierarchy, particularly within the opaque call sequence between
Ufe_v4::Subject::notify()andMayaUsd_v0::ufe::UsdHierarchy::children(), we implemented a two-pronged workaround: parallel pre-caching and refining traversal filters.- Parallel Hierarchy Pre-caching
We introduced an upfront parallel caching mechanism for the USD hierarchy instead of the single threaded implementation:
UsdHierarchy::children(), the proxy shape prim path is retrieved (it is assumed to be stable afteronStageSet()orstageChanged()).UsdHierarchy::children()serve the requested data directly from this cache.- Tweak Traversal Filters
We modified two specific traversal filters to improve performance:
prim.IsA<PXR_NS::UsdGeomPointInstancer>(), the original logic inUsdSceneItem::isPointInstance()included a check against _instanceIndex (first commit here). We suspect this check is irrelevant outside of a Hydra/Usd imaging context, as the _instanceIndex (defaulting toUsdImagingDelegate::ALL_INSTANCES, or -1) would cause `UsdSceneItem::isPointInstance() to always return false in a USD stage context.UsdTraverseInstanceProxies()from the traversal logic, unlike the original implementation in getUSDFilteredChildren().UsdTraverseInstanceProxies()was initially added here to fix an outliner display issue 6 years ago, but we are not able to reproduce that specific problem after removingUsdTraverseInstanceProxies(), so we are skipping the traversal of instance proxies to improve traversal performance.The end result, as shown in commit 9011124, demonstrates a significant performance improvement. The hierarchy traversal time dropped sharply from approximately 11 seconds to less than 100ms, with the
cache::getChildren()execution running in a multi-threaded environment:02_AFTER_improve_children_onStageSet.mp4
02_AFTER_improve_children_stageChanged.mp4
We are seeking clarification on several points regarding the use of
UsdHierarchy::children(), as information about its direct callers is quite limited.Our specific questions are:
onStageSet()andstageChanged()to limit the scope of children caching to these two operations. This was done because we lack a precise way to determine when a sequence ofUsdHierarchy::children()calls concludes. Ideally, this control would come from the caller side (likely withinlibShared.so). We would appreciate feedback on whether this current begin/end caching design can be improved with more detailed knowledge of its usage, also seeking confirmation about the expected return value forUsdHierarchy::children().UsdHierarchy::children()should still execute as before. However, after moving the traversal into a multi-threaded context, we observed a significant reduction in the number ofUsdHierarchy::children()calls.UsdHierarchy::children(). This is our main question for Autodesk.Thanks and looking forward to receiving feedback!