Getting better compile time performance on large shader libraries #9354

sandy-carter-unity · 2025-12-12T16:20:51Z

sandy-carter-unity
Dec 12, 2025

I'm interested in exploring the feature set from slang with cross-api, modules and specialization constants being key areas of focus.

I have found that for the same hlsl shader, outputting SPIR-V, there's quite more time taken by slang for validation than DXC. I've tested various vertex and fragement shaders and measured between 4 and 20x more time taken by slangc than dxc.

This is of concern to me because these time differences can compound in codebases with very large shader libraries and particularly ones with large permutations of shader variants.

Real world shader Benchmark

To start, I grabbed the hlsl output (after custom precompiler) from a Unity URP Shader graph called VegetationSSS. It is about 11 thousand lines long due to a couple dozen #includes of the SRP shader library. This is a very big shader but it also very common in URP.

I ended up converting all the SRP library hlsl (which are usually #include) to modules with specialization constants to give slang the best chance.

I ran these both on a windows desktop and a Linux laptop which ended up being faster due to better IO and better single core performance. I tested their output to SPIR-V.

Specs

OS: Arch Linux x86_64
Host: Precision 7680
Kernel: Linux 6.17.8-arch1-1
Shell: bash 5.3.3
Display (SDC4164): 3840x2400 @ 1.5x in 16", 60 Hz [Built-in, HDR]
DE: KDE Plasma 6.5.3
WM: KWin (Wayland)
CPU: 13th Gen Intel(R) Core(TM) i9-13950HX (32) @ 5.50 GHz
GPU 1: NVIDIA RTX 4000 Ada Generation Laptop GPU
GPU 2: Intel Raptor Lake-S UHD Graphics @ 1.65 GHz [Integrated]
Memory: 13.70 GiB / 62.48 GiB (22%)
Swap: 993.17 MiB / 70.00 GiB (1%)
Disk (/): 17.14 GiB / 97.86 GiB (18%) - ext4
Disk (/home): 111.03 GiB / 1.77 TiB (6%) - ext4
Locale: en_US.UTF-8

I ran a few different configurations:

VegetationSSS (frag): Normal slang on the preprocessed shader doing one entry point
deadcode pre-removed: I ran slang to output hlsl in order to remove unused functions and other dead code. This reduced the number of lines to 4.5 thousand.
VegetationSSS (vert+frag): Normal slang on the preprocessed shader doing both entry points in one pass. These are the best gains I've seen since it's effectively a 200% speed improvement.
SRP shader library (module): I moved all the Shader Library function to a precompiled slang modules to measure this step and linking time in a later step.
SRP shader library (module combined): I converted all the URP and SRP core shaders to modules, then imported those into another module.
SRP shader library (module combined, auto find path): Same as above but I don't give the paths to the modules when building the bigger one so it has to find them in the FS (huge perf loss)
VegetationSSS (vert+frag usingsinglemodule): The original shader with all the library functions imported from the module above (SRP shader library (module)). Only the base shader code remains this this shader and it imports the libraries with one import:

import unity_srp_core_all;

VegetationSSS (vert+frag usingmodulesdirectly): Same as above but imports the SRP Library file each converted to small modules which are individually imported:

import Varyings;
import Core;
import Lighting;
import PBRForwardPass;
import LODCrossFade;
import Input;
import SurfaceData;
import DBuffer;

	VegetationSSS (dxc)	VegetationSSS (frag)	deadcode pre-removed	VegetationSSS (vert+frag)	SRP shader library (module)	SRP shader library (module combined)	SRP shader library (module combined, auto find path)	VegetationSSS (vert+frag usingsinglemodule)	VegetationSSS (vert+frag usingmodulesdirectly)
loadBuiltinModule		81.480ms	82.470ms	82.180ms	75.080ms	73.640ms	80.620ms	76.730ms	79.540ms
readSerializedModuleAST		19.810ms	19.530ms	18.790ms	18.600ms	18.140ms	18.600ms	18.180ms	19.070ms
serialize_CapabilitySet		25.850ms	21.480ms	24.400ms	43.210ms	43.790ms	24.900ms	24.690ms	29.550ms
readSerializedModuleIR		42.190ms	47.120ms	41.410ms	40.670ms	46.180ms	41.400ms	39.770ms	46.860ms
compileInner		999.880ms	455.620ms	1075.590ms	854.320ms	87.430ms	1195.260ms	804.310ms	301.360ms
endToEndActions		999.880ms	455.620ms	1075.590ms	854.320ms	87.430ms	1195.260ms	804.310ms	301.360ms
frontEndExecute		793.610ms	257.950ms	799.760ms	785.630ms	2.330ms	1192.090ms	507.820ms	10.800ms
parseTranslationUnit		24.120ms	5.810ms	23.960ms	23.160ms	14.320ms	20.490ms	43.340ms	14.100ms
SemanticChecking		670.900ms	228.730ms	671.990ms	665.540ms	1.480ms	1186.900ms	498.290ms	3.810ms
checkAllTranslationUnits		670.900ms	228.730ms	671.990ms	665.540ms	1676.080ms	2252.090ms	1980.540ms	1676.240ms
generateIR		95.990ms	21.040ms	101.020ms	96.620ms	0.470ms	0.550ms	2.520ms	2.590ms
generateIRForTranslationUnit		95.990ms	21.040ms	101.020ms	96.610ms	131.760ms	189.110ms	287.230ms	133.420ms
performMandatoryEarlyInlining		4.620ms	0.780ms	4.600ms	4.220ms	2.250ms	3.290ms	7.680ms	2.860ms
generateOutput		203.970ms	195.470ms	273.330ms	68.680ms	85.090ms	3.160ms	290.100ms	285.380ms
writeSerializedModuleAST					48.690ms	42.560ms	0.620ms
linkAndOptimizeIR		124.490ms	118.130ms	174.090ms				190.020ms	188.710ms
linkIR		8.460ms	6.700ms	8.940ms				23.380ms	21.860ms
simplifyIR		65.480ms	57.410ms	100.550ms				101.480ms	101.220ms
specializeModule		9.450ms	9.860ms	11.280ms				11.040ms	11.760ms
unrollLoopsInModule		0.110ms	0.120ms	0.140ms				0.130ms	0.140ms
lowerGenerics		6.130ms	6.880ms	7.880ms				7.900ms	8.130ms
performForceInlining		2.550ms	4.070ms	2.700ms				2.670ms	2.710ms
legalizeExistentialTypeLayout		4.570ms	4.920ms	5.600ms				5.560ms	5.700ms
legalizeResourceTypes		4.290ms	4.690ms	5.490ms				5.390ms	5.570ms
InstantiateClass	0.541ms
ParseClass	1.084ms
Frontend	61.894ms
Compile	315.021ms
Total Compile	315.021ms
Total Frontend	61.894ms
Total ParseClass	2.619ms
Total InstantiateClass	2.690ms

Looking at these measurements, linkAndOptimizeIR and specializeModule fare much better than I expected having put a couple dozen specialization constants and with Unity SRP Shader libraries, there is a lot of potential for saving there.

DXC looks to be nearly 3-4x as fast as slang to compile the same fragment shader in the best case.

Moving code to modules greatly improves compileInner to the point of taking less time than dxc in the compilation step but shifts the time taken to checkAllTranslationUnits to the point of overally not saving much time at all.

Perf analysis

Here are some of the results of the compilation of the full shadergraph hlsl to spirv on linux.
The first flamegraph is slang and the second is dxc.
I've highlighted the same call both do to the spirv library (spvtools::Optimizer::Run) after most of the AST is parsed.
On slang, this call takes less long, presumably thanks to the optimization slang does. However, from looking at this graph it looks like slang spends more time optimizing than the time difference with dxc.
Another thing to note is that in the DXC graph, the majority of the time is this call to spvtools whereas in slang, the AST building dominates the time taken.

Observations

What this test has shown me is that:

Slang is input limited and time to compile scales linearly and very closely the number of input lines as opposed to the number output instructions. Which I guess is not very surprising.
Moving library code to modules saves time for the compilation but that time almost completely taken up by checkAllTranslationUnits scaling up.
Specialization constants don't increase total time in any measurable way for this many input lines.
Seems like slang's IR optimization saves time for the spv optimizer but the time saved is less than the time slang takes to do said optimization.
It also seems like any optimization I do to the source code such as dead code removal also improved dxc by an equal factor.

Compared to DXC

I redid this test at a smaller scale with 1 thousand lines of dead code (attached) and even with just a dummy pixel shader returning white and no other code. What I see is that, ignoring the time spent loading the builtin module before checkAllTranslationUnits, slang's runtime is very close to DXC's.

Slang seems to do parse for all Translation Units followed by a check for all translation units. DXC does a parse and a DiagnoseTranslationUnit together one translation unit at a time.

I'm not sure if DiagnoseTranslationUnit is equivalent to checkTranslationUnits and DiagnoseTranslationUnit is barely measurable in terms of performance.

How to repro

Dummy pixel shader

The most basic shader takes

float4 PSMain() : SV_TARGET
{
    return float4(1, 1, 1, 1);
}

Slang Results

$ slangc -v # commit 3a2969a32ec0fb0e25b3834cf604f9cc44c47f68 compiled with RelWithDebInfo. I added a profiling line
2025.21
$ time slangc -I. -Wno-30081 -report-perf-benchmark test.hlsl -o test.spirv -target spi
rv -entry PSMain -stage pixel
(0): note 103: compiler performance benchmark:
[*]              loadBuiltinModule      2          79.36ms
[*]        readSerializedModuleAST      2          17.85ms
[*]        serialize_CapabilitySet      4429       12.13ms
[*]         readSerializedModuleIR      2          38.37ms
[*]                   parseOptions      1           0.11ms
[*]                   compileInner      1           5.38ms
[*]                endToEndActions      1           5.38ms
[*]                frontEndExecute      1           3.41ms
[*]           parseTranslationUnit      1           0.05ms
[*]               SemanticChecking      1           3.10ms
[*]       checkAllTranslationUnits      1           3.10ms
[*]                     generateIR      1           0.15ms
[*]   generateIRForTranslationUnit      1           0.15ms
[*]  performMandatoryEarlyInlining      2           0.00ms
[*]                 generateOutput      1           1.91ms
[*]              linkAndOptimizeIR      1           1.10ms
[*]                         linkIR      1           0.88ms
[*]                     simplifyIR      6           0.05ms
[*]               specializeModule      2           0.01ms
[*]            unrollLoopsInModule      1           0.00ms
[*]           performForceInlining      3           0.00ms
[*]          legalizeResourceTypes      1           0.02ms

Type Dictionary Size: 107


real    0m0.104s
user    0m0.066s
sys     0m0.037s

DXC Results

$ dxc --version # linux version from https://github.com/microsoft/DirectXShaderCompiler/releases/tag/v1.8.2505.1
libdxcompiler.so: 1.9(dev;4950-b106a961); libdxil.so: 1.9
$ time dxc -I. -ftime-report -ftime-trace test.hlsl -Fo test.spirv -spirv -fvk-use-dx-positio
n-w -fvk-bind-globals 0 3 -E PSMain -T ps_6_9
; { "traceEvents": [
; { "pid":1, "tid":0, "ph":"X", "ts":3, "dur":3642, "name":"Compile", "args":{ "detail":""} },
; { "pid":1, "tid":1, "ph":"X", "ts":0, "dur":3642, "name":"Total Compile", "args":{ "count":1, "avg ms":3} },
; { "pid":1, "tid":2, "ph":"X", "ts":0, "dur":287, "name":"Total Frontend", "args":{ "count":1, "avg ms":0} },
; { "pid":1, "tid":3, "ph":"X", "ts":0, "dur":203, "name":"Total InstantiateClass", "args":{ "count":3, "avg ms":0} },
; { "pid":1, "tid":4, "ph":"X", "ts":0, "dur":0, "name":"Total PerformPendingInstantiations", "args":{ "count":1, "avg ms":0} },
; { "pid":1, "tid":5, "ph":"X", "ts":0, "dur":0, "name":"Total Frontend - Consumer", "args":{ "count":1, "avg ms":0} },
; { "cat":"", "pid":1, "tid":0, "ts":0, "ph":"M", "name":"process_name", "args":{ "name":"clang" } }
; ] }
; 

real    0m0.017s
user    0m0.009s
sys     0m0.008s

Pixel shader with dead code from Unity SRP Shader Library

This requires cloning https://github.com/Unity-Technologies/Graphics and compiling this shader at the root of that repo.

I've attached the output of the precompiler so you don't need to do the cloning.
test.preprocessed.hlsl.txt

// Required defines to get the include to work
#define SHADER_API_MOBILE 0
#define SHADER_API_GLES3 0
#define SHADER_API_SWITCH 0
#define SHADER_API_SWITCH2 0
#define SHADER_STAGE_RAY_TRACING 0
#define UNITY_RAY_TRACING_GLOBAL_RESOURCES 0
#define SHADER_TARGET 45
#define PREFER_HALF 0
#define min16float half
#define min16float2 half2
#define min16float3 half3
#define min16float4 half4

#include "Packages/com.unity.render-pipelines.core/ShaderLibrary/Common.hlsl"


float4 PSMain() : SV_TARGET
{
    return float4(1, 1, 1, 1);
}

Slang Results

$ slangc -v # commit 3a2969a32ec0fb0e25b3834cf604f9cc44c47f68 compiled with RelWithDebInfo. I added a profiling line
2025.21
$ time slangc -I. -Wno-30081 -report-perf-benchmark test.hlsl -o test.spirv -target spirv -entry PSMain -stage pixel
(0): note 103: compiler performance benchmark:
[*]              loadBuiltinModule      2          78.10ms
[*]        readSerializedModuleAST      2          20.48ms
[*]        serialize_CapabilitySet      12677      19.41ms
[*]         readSerializedModuleIR      2          37.53ms
[*]                   parseOptions      1           0.12ms
[*]                   compileInner      1         197.44ms
[*]                endToEndActions      1         197.44ms
[*]                frontEndExecute      1         194.81ms
[*]           parseTranslationUnit      1           3.12ms
[*]               SemanticChecking      1         177.11ms
[*]       checkAllTranslationUnits      1         177.11ms
[*]                     generateIR      1          14.40ms
[*]   generateIRForTranslationUnit      1          14.39ms
[*]  performMandatoryEarlyInlining      3           0.37ms
[*]                 generateOutput      1           2.49ms
[*]              linkAndOptimizeIR      1           1.56ms
[*]                         linkIR      1           1.33ms
[*]                     simplifyIR      6           0.05ms
[*]               specializeModule      2           0.01ms
[*]            unrollLoopsInModule      1           0.00ms
[*]           performForceInlining      3           0.00ms
[*]          legalizeResourceTypes      1           0.02ms

Type Dictionary Size: 5687


real    0m0.302s
user    0m0.248s
sys     0m0.053s

DXC Results

$ dxc --version # linux version from https://github.com/microsoft/DirectXShaderCompiler/releases/tag/v1.8.2505.1
libdxcompiler.so: 1.9(dev;4950-b106a961); libdxil.so: 1.9
$ time dxc  -I. -ftime-report -ftime-trace test.hlsl -Fo test.spirv -spirv -fvk-use-dx-position-w -fvk-bind-globals 0 3 -E PSMain -T ps_6_9
; { "traceEvents": [
; { "pid":1, "tid":0, "ph":"X", "ts":1314, "dur":7655, "name":"Frontend", "args":{ "detail":""} },
; { "pid":1, "tid":0, "ph":"X", "ts":0, "dur":9883, "name":"Compile", "args":{ "detail":""} },
; { "pid":1, "tid":1, "ph":"X", "ts":0, "dur":9883, "name":"Total Compile", "args":{ "count":1, "avg ms":9} },
; { "pid":1, "tid":2, "ph":"X", "ts":0, "dur":7655, "name":"Total Frontend", "args":{ "count":1, "avg ms":7} },
; { "pid":1, "tid":3, "ph":"X", "ts":0, "dur":731, "name":"Total InstantiateClass", "args":{ "count":20, "avg ms":0} },
; { "pid":1, "tid":4, "ph":"X", "ts":0, "dur":21, "name":"Total ParseClass", "args":{ "count":1, "avg ms":0} },
; { "pid":1, "tid":5, "ph":"X", "ts":0, "dur":5, "name":"Total PerformPendingInstantiations", "args":{ "count":1, "avg ms":0} },
; { "pid":1, "tid":6, "ph":"X", "ts":0, "dur":4, "name":"Total InstantiateFunction", "args":{ "count":2, "avg ms":0} },
; { "pid":1, "tid":7, "ph":"X", "ts":0, "dur":0, "name":"Total Frontend - Consumer", "args":{ "count":1, "avg ms":0} },
; { "cat":"", "pid":1, "tid":0, "ts":0, "ph":"M", "name":"process_name", "args":{ "name":"clang" } }
; ] }
; 

real    0m0.017s
user    0m0.014s
sys     0m0.003s

Questions

I'm posting here because I'm curious why there is such a large difference in performance between the two compilers.

Is there a reason in the design of slang that causes such a discrepancy? Can it be overcome?

Has anyone had luck tuning slang to improve the validation performance?

tangent-vector · 2025-12-18T22:50:22Z

tangent-vector
Dec 18, 2025
Maintainer

There's a lot of detail here, and it's hard for me (as a person who is not primarily a performance-oriented engineer) to tease apart what the take-away points are. That said, I'll try to lend my thoughts as somebody who has been involved with Slang since the start.

Aside: I want to note that I greatly appreciate you setting up the closest thing possible to an "apples to apples" comparison between slangc and dxc. For compilation flows where slangc generates output source for consumption by a downstream compiler, Slang compilation pays for both the front-end work done inside slangc plus a full invocation of the downstream compiler. In such cases it would be very difficult (but not technically impossible) for, say, a slangc->dxc compile from plain HLSL to DXIL to beat just using dxc for that same task.

Why use Slang instead of XYZ, if Slang is slower?

This isn't one of the questions you asked, but I think it's a question that looms around any comparison like this between tools.

If all you have is what we might call "vanilla" HLSL code, and all you want is a command-line tool that you can invoke on your .hlsl files to get output as SPIR-V or DXIL (and maybe you get code for other targets using spirv-cross), then both slangc and dxc are just two different tools you can apply to that same task. In that case, it's entirely reasonable to pick the one that has the best performance (or the best stability, diagnostics/UX, etc. - pick your metric).

My opinion is that the main reasons to use the Slang compiler instead of something else are because either:

You want to use the Slang language
You want to do things that the Slang compiler supports, but other tools don't

Ultimately those are just the same reason; they both amount to using the Slang compiler because of the unique things it enables you to do. Our main focus on the Slang project (since even back in the days when it was just a research effort) has been to show that GPU shading languages could be significantly better than what most developers were stuck using.

If the DXC team announced tomorrow that their compiler now supports the Slang language, cross-compiles to all the targets the Slang compiler supports, and also has better performance? I'd actually consider that a win. All along, my personal goal has been to help get better tools in the hands of developers, and we only designed and implemented a new language to show people just how much better things could be.

Why is the Slang compiler slower than DXC?

I believe there are two main factors here.

The Slang compiler supports the Slang language, not just HLSL

Even if you are just using slangc as a drop-in replacement for dxc, the Slang compiler still supports all of the Slang features, so you would be running a compiler for a much more powerful (and complicated) language on code written in a less powerful subset of that language.

An (entirely justified) argument can be made that the Slang compiler should only make you "pay for what you use," so that when handing the Slang compiler vanilla HLSL it wouldn't need to turn on any of its more advanced features. That's a good aspirational goal but, in many cases, being able to support some of the features of Slang at all requires particular approaches to the compiler's design and architecture, such that they aren't just things we can turn on/off as needed.

The Slang codebase is less mature than DXC

Those of us working on Slang need to be honest about the fact that there are many quality of implementation (QOI) differences between the Slang codebase/compiler and DXC. DXC is quite simply a more mature codebase, and there are times when it shows.

Slang has been in development for less time than DXC, and has had many fewer active contributors over the history of the project (on average). Moving the Slang project to open governance under Khronos has given us more attention and some new contributors, but catching up to other tools in terms of overall maturity will take time.

It is also important to note that the DXC project leveraged a lot of pre-existing mature code, in clang and LLVM. The use of clang/LLVM gave DXC a foundation with many person-years of effort put into it, which includes a lot of thoughtful engineering work to optimize the performance of the compiler framework. Such technology choices have benefited some aspects of DXC, such as compilation performance, but they also represent trade-offs.

At the start of the Slang project we considered whether to build on top of clang/LLVM, and made a conscious decision to follow a different path for our compiler architecture. The clang compiler architecture does not match well with the vision we had/have for the Slang language, and LLVM is (even now) not a good match for several of the compilation targets we wanted to support.

The DXC team took the trade-off that let them (more) easily build a mature compiler for a less ambitious language. The Slang team took the trade-off that made it possible for us to build a compiler for a much more ambitious language at all.

Can the performance gap be overcome?

The question is phrased as a binary yes/no, but the answer (whatever it is) is likely to be much more subtle.

One thing I can state with some confidence is that it would take less engineering effort to greatly improve the performance of the Slang compiler codebase than it would to make the DXC compiler codebase accept the Slang language. So if the goal is to close the gap by having a compiler for a language as powerful as Slang, but with compilation performance closer to DXC, we know which side of the gap to start from.

Another thing that isn't explicit in the question, but that I'd like to make sure sees discussion (I hope that other Slang contributors will follow up here...) is the question of whether the performance gap should be overcome by:

improving the performance of the slangc command-line tool for known/existing input code (e.g., vanilla HLSL), in direct comparison to command-line dxc, by making changes to the compiler implementation codebase
changing the ways that developers author shader codebases in Slang, and the ways their workflows drive the Slang compiler, so that they can achieve improvements to compilation time that could not be achieved by just optimizing slangc compiles

These amount to the typical question of whether one should optimize the code for an existing algorithm, or pick a different/better algorithm, when trying to optimize.

When code is coming from an existing engine/renderer, such as Unity's SRP system, the shaders will typically have been authored in many ways (large and small) around the design and limitations of the HLSL language and the DXC compiler. Sometimes it is possible to refactor such code to be closer to more idiomatic Slang-first designs, but it is not always easy to work around design choices that are deeply baked into the engine/renderer.

Some quick thoughts on things that might be worth exploring (I'm saying this as somebody who doesn't primarily focus on performance analysis/optimization, so I really hope others from the Slang project can help with deeper insights:

Command-Line `slangc` is the Wrong Tool

First and foremost, I want to note that the OP appears (AFAICT) to only be talking about use of the slangc command-line tool.

In general, any system complicated enough that shader compile times become a concern should not be using slangc on the command line (e.g., sequencing many slangc calls from shell scripts or a build system like CMake) and should instead take the time to write a custom application that uses the Slang compiler's C++ API to orchestrate compilation. Many opportunities for amortization become available when multiple compiles are done in the same process, instead of invoking slangc over and over.

Amortization across many entry points

This is kind of a specific example of the preceding point about slangc vs. a custom tool.

The OP mentions the benefits of compiling more than one entry point in a single invocation of slangc. When compiling via the C++ API, it is possible for a single process can be invoked to handle batch compilation of many entry points across many files. If the C++ code re-uses the same slang::ISession across multiple compiles, they can often amortize out almost all of the front-end work that goes into any modules that are imported multiple times, even without use of pre-compiled modules.

Even just amortizing out the overhead of loading up the Slang core module (by creating and re-using a slang::IGlobalSession) can give a performance boost to per-entry-point compile times.

Any performance evaluation that compares Slang vs. DXC for compiling a small number of entry points (e.g., a single vert+frag pair) is not going to be able to see the true potential of modules for amortizing compilation costs.

Are we actually seeing "precompiled" modules here?

The OP mentions attempts to use "precompiled modules," which is a good idea, but looking at the presented performance data, I am left wondering whether this experiment was actually successful in making use of precompiled modules at all.

The fact that the usingsinglemodule and usingmodulesdirectly cases somehow spent more time in checkAllTranslationUnits than code not using modules is a big flag (for me) that something is probably going wrong in the experimental setup. Similarly, the fact that cases using more fine-grained modules instead of a single monolithic module took significantly longer in checkAllTranslationUnits is a sign something is going wrong in terms of what is being fed to the Slang compiler.

One possibility is that there is a massive bug in the Slang compiler that is causing it to do way more work in those cases for no justifiable reason. If that's the case, then we should root-cause it and solve the problem there.

Another possibility is that these cases aren't actually using previously-compiled code (the checkAllTranslationUnit timings make this possibility seem plausible). I initially thought there was a smoking gun when the AST/IR deserialization times were basically flat across all the test cases, but the current Slang compiler uses an on-demand scheme for deserialization of AST/IR which means that much of the time spent on deserialization of precompiled modules would be spread out across the other steps.

Yet another possibility is that the experimental setup is using precompiled modules, but there is a large amount of "common" code that still gets #included into each module, and also each shader program that gets compiled, such that the overhead of re-compiling all the common code over and over outweighs the potential benefits of amortization. Understanding how many lines/bytes of code (after pre-processing) are being processed by Slang for each of the modules in each of the scenarios would help here. If the amount of code being #included in every translation unit is far greater than what is being amortized out using modules, then it makes sense that modules wouldn't be of much benefit (just like how a C/C++ codebase that #includes tons of code into every translation unit can't benefit as much from factoring the code into more translation units).

Summary

There is definitely plenty of work that we can and should do on the Slang project to improve compile times for all code that goes through the Slang compiler, including in the most naive case of compiling one entry point of vanilla HLSL at a time with command-line slangc. Anybody out there who wants to help contribute to Slang with a focus on compiler perf is welcome to reach out; there are plenty of opportunities to have a big impact.

On the flip side, there is also a lot that can be done by application developers, in terms of how they architect their shader code and how they orchestrate compilation using the Slang compiler. I would like to make sure that developers don't fixate on the constant-factor improvements when there can order-of-magnitude improvements to be had when engaging with Slang on the right terms.

Additionally, I'd like us to drill down on the results being presented here, and make sure we get to the bottom of some of these perf numbers that seems suspect. There's still plenty to talk about in terms of improving the performance of the Slang compiler, but I'd like to make sure that any strategy for improvement is based on the right data/analysis.

Has anybody had luck tuning Slang for compilation performance?

It's unclear if this question is asking if the Slang developers/contributors have done work to improve compilation performance, or if it is asking if other Slang users have done work to architect their shader codebases to get better perf from Slang.

The short answer to both questions is "yes." Hopefully other contributors to Slang can give more details (this hasn't been my primary focus).

On the implementation side, @csyonghe has done a lot of work to drastically improve the performance of the Slang compiler, and has previously wrung multiple orders of magnitude out of the codebase by implementing smarter data structures, caching in the right places, and plain old low-level optimization.

For my part, earlier this year I re-architected the approach that the Slang compiler uses for deserializing the core module AST, which resulted in up to 3x perf improvements for compiling to SPIR-V for extremely small shaders being compiled via slangc. Later, @expipiplus1 expanded the new deserialization scheme to also cover the IR for the core module, resulting in further gains, and I believe she's worked on improvements to serialize_CapabilitySet that should greatly reduce its presence in timing results like those in the OP.

Lots of other contributors have worked on performance improvements, and it's something that we continue to keep an eye on. We appreciate it greatly when users take the time to bring our attention to performance concerns (especially when they are reproducible and backed up by data), because that can help us identify where the low-hanging-fruit optimization opportunities might be.

More on the user side, we definitely have users of Slang who are extremely sensitive to compilation performance and who have taken time to experiment with how to change their code to optimize compile times. Many of the high-order-bit results are what I've already mentioned, around using the C++ API rather than command-line slangc to amortize out overheads, and then making good/thoughtful use of modules.

If other Slang users out there know of techniques for controlling/improving compile times, please do weigh in.

Wrapping Up

Thanks for starting a conversation around compilation performance; it's a good thing for the Slang community to be mindful of, and all of the contributors would like to make sure that we keep improving the experience of developers using Slang, which requires hearing about pain points that developers face.

Hopefully more folks can weigh in on this thread, but even if they don't I'd like to make sure that we can bottom out on some of the apparent performance anomalies in the data presented in the OP. No matter what turns out to be the cause, I'm sure that those of us working on slang will learn something.

0 replies

tangent-vector · 2025-12-18T23:20:09Z

tangent-vector
Dec 18, 2025
Maintainer

Of course it's only after writing that giant post that I realize I should cross reference this with a specific performance-related issue that, for all I know, could be related to some of the counter-intuitive perf results seen in the OP of this discussion when trying to use pre-compiled modules: #9400

It doesn't immediately seem like that issue is related to the performance issues being observed here, but it also isn't something I'd want to rule out. Either way, it is an example of a performance concern raised by a user that we believe we can (and will) address.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting better compile time performance on large shader libraries #9354

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Getting better compile time performance on large shader libraries #9354

Uh oh!

Uh oh!

sandy-carter-unity Dec 12, 2025

Real world shader Benchmark

Perf analysis

Observations

Compared to DXC

How to repro

Dummy pixel shader

Slang Results

DXC Results

Pixel shader with dead code from Unity SRP Shader Library

Slang Results

DXC Results

Questions

Replies: 2 comments

Uh oh!

Uh oh!

tangent-vector Dec 18, 2025 Maintainer

Why use Slang instead of XYZ, if Slang is slower?

Why is the Slang compiler slower than DXC?

The Slang compiler supports the Slang language, not just HLSL

The Slang codebase is less mature than DXC

Can the performance gap be overcome?

Command-Line slangc is the Wrong Tool

Amortization across many entry points

Are we actually seeing "precompiled" modules here?

Summary

Has anybody had luck tuning Slang for compilation performance?

Wrapping Up

Uh oh!

tangent-vector Dec 18, 2025 Maintainer

sandy-carter-unity
Dec 12, 2025

tangent-vector
Dec 18, 2025
Maintainer

Command-Line `slangc` is the Wrong Tool

tangent-vector
Dec 18, 2025
Maintainer