Skip to content

video_core/shader: Optimize fragment shader by skipping passthrough TEV stages #1029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

jbm11208
Copy link
Contributor

This change adds a fast-path optimization in the fragment shader generator to detect and skip TEV stages that simply pass through their input unchanged. This reduces shader complexity and improves performance for common rendering cases where TEV stages are configured as passthrough.

The optimization checks for:

  • Replace operation for both color and alpha
  • Previous buffer as source
  • No color/alpha modifiers
  • Unity multipliers

This is a safe optimization as it preserves exact PICA behavior while reducing unnecessary shader instructions.

This change also increases performance in games like Luigi's Mansion: Dark Moon

@PabloMK7 PabloMK7 self-requested a review May 10, 2025 23:04
@jbm11208 jbm11208 marked this pull request as ready for review May 11, 2025 01:20
@PabloMK7
Copy link
Member

Looks like the change is breaking some visuals, notice the fountain:
azahar_2QkakE2WBw

Here is how it should look like:
azahar_HvMI3XcEx3

@Algiuxs
Copy link

Algiuxs commented May 11, 2025

Looks like the change is breaking some visuals, notice the fountain:
azahar_2QkakE2WBw

Here is how it should look like:
azahar_HvMI3XcEx3

Also the red bush

@jbm11208
Copy link
Contributor Author

I'll take a look and see if I can fix it

@jbm11208
Copy link
Contributor Author

All fixed

@jbm11208
Copy link
Contributor Author

Got a little carried away there, I think I'm going to stop here with optimizations

@jbm11208
Copy link
Contributor Author

Everything should be working now, I went and tested my library of games and there are no longer any graphical issues.

Copy link
Member

@PabloMK7 PabloMK7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start the review, but please note that this area of the emulator is not my strong point, so you will need to provide info in some places.

// Compile the vertex shader.
shader_engine->SetupBatch(vs_setup, regs.internal.vs.main_offset);
// Compute current VS config hash
u64 vs_hash = vs_setup.GetProgramCodeHash() ^ vs_setup.GetSwizzleDataHash();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use an xor? Is the calculation of the hash less time consuming than just falling through the rest of the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash comparison acts as a cache check. if the shader configuration hasn't changed since last time, skip shader setup

@@ -291,6 +291,9 @@ class PicaCore {
PrimitiveAssembler primitive_assembler;
CommandList cmd_list;
std::unique_ptr<ShaderEngine> shader_engine;
u64 last_vs_hash = 0xDEADBEEFDEADBEEF; // Track last used VS hash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but why not just start at 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it when I'm not busy if you want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget why I didn't do that, probably just to track the value for debugging or something

WriteTevStage(index);
}
if (!tev_stage_processed) {
out += "combiner_output = rounded_primary_color;\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this? If all stages are passthrough then is the primary color used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tev_stage_processed flag stays false if all stages are detected as passthrough by the IsPassThroughTevStage helper function, and that if statement is triggered when false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that I understood, I'm asking for justifying the use of rounded_primary_color

AppendAlphaCombiner(stage.alpha_op);
out += ");\n";
}
if (IsPassThroughTevStage(stage)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check it here again? Are there other codepaths than on line 164?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check is pointless and can be removed. I must've left that there from testing.

"clamp(alpha_output_{} * {}.0, 0.0, 1.0));\n",
index, stage.GetColorMultiplier(), index, stage.GetAlphaMultiplier());
// Batch static appends for color_results
out += "color_results_1 = ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered maybe using a more optimized string structure for concatenating? I agree that using fmt may not be the best idea, but is it so to just use string concat? (A "haven't researched, out of scope" response is a valid response :P)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look into that, but I will right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made that change and pushed to GitHub, I noticed a decent improvement in performance as a result

@@ -1610,4 +1606,51 @@ std::vector<u32> GenerateFragmentShader(const FSConfig& config, const Profile& p
return module.Assemble();
}

// Helper to detect passthrough TEV stages for optimization
static bool IsPassThroughTevStage(const TexturingRegs::TevStageConfig& stage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function different than the previous IsPassThroughTevStage one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were the same, but including constant as a passthrough stage on glsl caused graphical glitches, and so I had to be more conservative on that optimization for glsl. I believe I made that change in one of the most recent commits


namespace Pica::Shader {

JitEngine::JitEngine() = default;
JitEngine::~JitEngine() = default;
JitEngine::JitEngine() {
Copy link
Member

@PabloMK7 PabloMK7 May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this is all about and the reasoning behind it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it offloads shader compilation to a thread pool to reduce the load on the main thread

@PabloMK7
Copy link
Member

By the way, I have played the LM2 intro side by side on 2121.1 and the msys2 artifact from this build, and the vulkan shader stutter, with the cache cleaned up beforehand, seems to be LONGER (a few ms) in this PR than on 2121.1.

@jbm11208
Copy link
Contributor Author

By the way, I have played the LM2 intro side by side on 2121.1 and the msys2 artifact from this build, and the vulkan shader stutter, with the cache cleaned up beforehand, seems to be LONGER (a few ms) in this PR than on 2121.1.

do you have at least a 3-3.5 ms render thread delay? you still need a delay, just much less. On my hardware, level D-1 of LM2 went from requiring a 9.5 ms delay on 2121.1 just to get to where the stuttering infrequent enough to be playable, to only needing a 3-4 ms delay to eliminate stuttering altogether

@OpenSauce04
Copy link
Member

As per the project readme, don't repeatedly merge master into your branch. A maintainer will do it if/when necessary.

@jbm11208
Copy link
Contributor Author

As per the project readme, don't repeatedly merge master into your branch. A maintainer will do it if/when necessary.

I did that because the PR that was recently merged had modified files that may have an effect on this PR

@Dragios

This comment was marked as off-topic.

@OpenSauce04

This comment was marked as off-topic.

jbm11208 added 20 commits June 9, 2025 19:39
…EV stages

This change adds a fast-path optimization in the fragment shader generator to detect and skip TEV stages that simply pass through their input unchanged. This reduces shader complexity and improves performance for common rendering cases where TEV stages are configured as passthrough.

The optimization checks for:
- Replace operation for both color and alpha
- Previous buffer as source
- No color/alpha modifiers
- Unity multipliers

This is a safe optimization as it preserves exact PICA behavior while reducing unnecessary shader instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants