[Low level] Store SourceLanguage properties outside of the main structure use a new bit-set type for "sets" of languages #1355

d-ronnqvist · 2025-11-17T18:44:53Z

Bug/issue #, if applicable:

Summary

This is a series of low-level optimizations regarding the use of SourceLanguage and particularly Set<SourceLanguage>.

Note

These performance improvements (see below) adds a limitation that a single docc convert call can't define more than 64 different SourceLanguage values. However, considering that C/C++/Objective-C is rolled into one and that anything more than 2 languages in the same project is very rare—and anything above 3 languages being practically unheard of—I don't think this limitations is going to be impactful in practice.

These ideas came from a realization that DocC uses SourceLanguage almost exclusively for equality checks in one of 3 forms:

==(_:_:) comparisons between full values
comparisons between only the string id properties
set-algebra operations on Set<SourceLanguage> (which does hash(into:) and ==(::)` behind the scenes.

I confirmed this hypothesis by adding a local counter whenever a SourceLanguage values was created, checked for equality, checked for comparison, hashed, and whenever any of its properties were accessed. This showed that across a full build, DocC does on average ~50 ==(_:_:) calls per page, on average ~140 hash(into:) calls per page, and on average ~100 id accesses page page.

Based on these numbers I hypothesized that it would be worthwhile to optimize SourceLanguage for quick comparisons at the cost of slower accesses of name and other properties.

At first I though about only storing the id string in the structure and accessing the other properties through indirect storage but then I thought that—because the most common use of SourceLanguage values is to put them in a Set and because in practice, projects are expected to have very few different languages (low single digits)—if the identifier was numeric, sets of languages could be represented as bit set. Using a private numeric ID would mean that the simpler someLanguage == .swift would be faster than someLanguage.id == "swift" which a lot of existing code was doing.

I reimplemented the internal of SourceLanguage in 10dabb4 and 4fdb68c. Then in f27559e, 4caf109, 090cad1, 3e1396a, and b9a6030 I generalized and improved the exist fixed-width bit set type—that DocC uses for type signature disambiguation—to finally be able to add a bit-set backed type that represents a "set" of
source languages in 6ac19ea.

After that; 5273e94, d6e63ae, and 3d3d580 each updated other existing code in DocC to favor full SourceLanguage comparisons and favor the bit-set backed "set" type in internal implementation details.

Trading id accesses for ==(_:_:) checks like this, increased the number of ==(_:_:) checks by ~3× (on average ~150 calls per page) but reduced the number of id accesses by ~~~3× (on average ~30 calls per page)~~ ~4.5× (on average ~20 calls per page) after 9c8a9c1. The remaining id calls is largely caused by the RenderJSON code which uses string identifiers in an enum that can't would require source breaking changes to update. Because existing API need to surface Set<SourceLanguage> API externally, while using a bit-set internally, the number of SourceLanguage.hash(into:) calls increased by ~1.3×, but the new hash(into:) implementation is ~10× faster, so that's still a net-positive.

Additionally, the changes to use the new bit-set backed type for "sets" of languages resulted in a large number of method calls moving from Set<SourceLanguage> to the new SmallSourceLanguageSet. For example:

~25 inset(_:) calls per page, which is ~15× faster in micro benchmarks
~10 contains(_:) calls per page, which is ~100× faster in micro benchmarks
~10 intersection(_:) calls per page, which is >1000× faster in micro benchmarks
~10 min() calls per page, which is ~150× faster in micro benchmarks

In aggregate, on the scale of an entire documentation build, these add up to a small but measurable improvement. In one large (~10k page) Swift-only framework I measured these time and memory improvements (on my machine):

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Metric                                   │ Change          │ main                 │ current              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Duration for 'convert-total-time'        │ -2,177 %¹       │ 5,457 sec            │ 5,338 sec            │
│ Duration for 'documentation-processing'  │ -3,754 %²       │ 2,712 sec            │ 2,61 sec             │
│ Duration for 'finalize-navigation-index' │ no change³      │ 0,041 sec            │ 0,041 sec            │
│ Peak memory footprint                    │ -3,423 %⁴       │ 929,8 MB             │ 898 MB               │
│ Data subdirectory size                   │ no change       │ 169,3 MB             │ 169,3 MB             │
│ Index subdirectory size                  │ no change       │ 1,5 MB               │ 1,5 MB               │
│ Total DocC archive size                  │ no change       │ 197,3 MB             │ 197,3 MB             │
│ Topic Anchor Checksum                    │ no change       │ 78abcd6aed9cbccaa983 │ 78abcd6aed9cbccaa983 │
│ Topic Graph Checksum                     │ no change       │ 38eaf266fdc697658430 │ 38eaf266fdc697658430 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

In another large (~10k pages) framework with both Swift and Objective-C project symbols I measured similar (~2%) improvements.

Dependencies

None.

Testing

Nothing in particular. This isn't a user-facing change.

Checklist

Make sure you check off the following items. If they cannot be completed, provide a reason.

Added tests
Ran the ./bin/test script and it succeeded
~~[ ] Updated documentation if necessary~~

…separately This idea came from after realizing that in practice accesses are _almost_ exclusively `id`, `==`, or `<`.

…entations

d-ronnqvist · 2025-11-17T18:45:43Z

@swift-ci please test

Sources/DocCCommon/FixedSizeBitSet.swift

…/bits

d-ronnqvist · 2025-11-17T20:20:23Z

@swift-ci please test

d-ronnqvist · 2025-11-17T20:23:42Z

Tests/DocCCommonTests/FixedSizeBitSetTests.swift

+    }
+
+    @Test()
+    func testCombinations() {


FYI: This is a moved test (and implementation) from before.

d-ronnqvist · 2025-11-17T20:23:59Z

Tests/DocCCommonTests/FixedSizeBitSetTests.swift

+
+struct FixedSizeBitSetTests {
+    @Test
+    func testBehavesSameAsSet() {


FYI: This is a moved test (and implementation) from before.

d-ronnqvist · 2025-11-20T10:11:14Z

@swift-ci please test

patshaughnessy

Amazing work! Just some naming suggestions and a few questions...

Sources/DocCCommon/Mutex.swift

patshaughnessy · 2025-11-20T18:48:52Z

Sources/DocCCommon/SourceLanguage.swift

+    private static func _accessInfo(id: UInt8) -> _SourceLanguageInformation {
+        let (unknownIndex, isKnownLanguage) = id.subtractingReportingOverflow(SourceLanguage._numberOfKnownLanguages)
+        return if isKnownLanguage {
+            _knownLanguages[Int(id)]


Why does Swift require a cast here? We can't index an array using UInt8?

Correct. The Index type for Array is plain Int so we have to cast the id value.

patshaughnessy · 2025-11-20T19:03:35Z

Sources/DocCCommon/SourceLanguage.swift

-                .map { $0.lowercased() }
-                .contains(id)
+    private static func knownLanguage(withName name: String) -> SourceLanguage? {
+        switch name.lowercased() {


It would be ideal not to have to repeat all of the language names like this. Is there a way to refactor this to iterate over the _knownLanguages array somehow?

Yes, this could be implemented as an iteration over _knownLanguages like this:

let name = name.lowercased() let index = _knownLanguages.firstIndex(where: { $0.name == name }) return index.map { SourceLanguage_new(_id: UInt8($0)) }

Very similarly, _knownLanguage(withIdentifier:) below could be implemented as an iteration like this:

let id = id.lowercased() let index = _knownLanguages.firstIndex(where: { $0.id == id || $0.idAliases.contains(id) }) return index.map { SourceLanguage_new(_id: UInt8($0)) }

I had speculated that the switch implementations would be faster—thinking that the compiler would have more information to go on to optimize the code—but I didn't actually try and measure anything until now.

With the switch cases in the current code, the Swift compiler—when compiling with optimizations (a release build)—creates assembly that corresponds to a series of if checks one after another whereas with the firstIndex(where:) implementation, it creates assembly that corresponds to a basic loop. Essentially you can think of the difference as being between a loop that's been unrolled and one that hasn't been.

In micro benchmarks, the switch implementation for knownLanguage(withName:) is ~2.5× faster than the iteration implementation for known values (e.g. "Swift") and ~2× faster for unknown values (e.g. "Banana"). For _knownLanguage(withIdentifier:) the switch implementation is about ~4.5× faster for known values and ~4 × times faster for unknown values.

Because both initializers are being called quite frequently (>10 times per page), these differences could add up.

Also, because the list of known languages is unlikely to change frequently (possibly not for years), I find that the little bit of code duplication within this file is worth it for these initializers.

patshaughnessy · 2025-11-20T19:06:50Z

Sources/DocCCommon/SourceLanguage.swift

+
+// MARK: SourceLanguage Set
+
+package struct SmallSourceLanguageSet: Sendable, Hashable, SetAlgebra, ExpressibleByArrayLiteral, Sequence, Collection {


Maybe use a separate Swift file for SmallSourceLanguageSet ?

It needs to be defined in the same file in order to be able to access _id and init(_id:) which have fileprivate access. The alternative would be to increase the to internal access so that (all) other files in this module can access them.

Sources/DocCCommon/FixedSizeBitSet.swift

...ces/SwiftDocC/Infrastructure/Link Resolution/PathHierarchy+TypeSignatureDisambiguation.swift

patshaughnessy · 2025-11-20T19:50:36Z

Sources/DocCCommon/SourceLanguage.swift

+// MARK: SourceLanguage Set
+
+package struct SmallSourceLanguageSet: Sendable, Hashable, SetAlgebra, ExpressibleByArrayLiteral, Sequence, Collection {
+    // There are a few different valid ways that we could implement this, each with their own tradeoffs.


And could you just name this SourceLanguageSet ? Do we need "Small" in the name?

My intention was to have some kind of indication that it's has a size limitation, unlike a regular Set<SourceLanguage> would. I thought of prefixing it with "FixedWidth" but "Small" felt less technical.

patshaughnessy · 2025-11-20T19:55:38Z

Sources/SwiftDocC/Model/Identifier.swift

+        self.init(bundleID: bundleID, path: path, fragment: fragment, _smallSourceLanguages: .init(sourceLanguages))
+    }
+
+    init(bundleID: DocumentationBundle.Identifier, path: String, fragment: String? = nil, _smallSourceLanguages: SmallSourceLanguageSet) {


Can you just name this sourceLanguages ? It's a bit odd reading "small" at every call site, and also why the underscore?

I can rename the parameter for the initializer because overloads can be distinguished by their parameter types but the property needs to be named something other than sourceLanguages because that's what the public Set<SourceLanguage> property is already called and renaming that or changing its type would be an API breaking change.

patshaughnessy · 2025-11-20T20:00:20Z

Tests/DocCCommonTests/SourceLanguageTests.swift

-        XCTAssertEqual(SourceLanguage(knownLanguageIdentifier: "objc"), .objectiveC)
-        XCTAssertEqual(SourceLanguage(knownLanguageIdentifier: "c"), .objectiveC)
+struct SourceLanguageTests {
+    @Test(arguments: SourceLanguage.knownLanguages)


Nice use of parameterized tests!

…han their string IDs

d-ronnqvist · 2025-11-25T10:36:15Z

@swift-ci please test

d-ronnqvist added 14 commits November 17, 2025 05:31

Implement SourceLanguage as a tiny identifier with properties stored …

10dabb4

…separately This idea came from after realizing that in practice accesses are _almost_ exclusively `id`, `==`, or `<`.

Deprecate mutating SourceLanguage properties

4fdb68c

Remove unnecessary custom language sorting

3cb5dd8

Avoid accessing the language ID where it's not necessary

d1fa784

Move "_TinySmallValueIntSet" to DocCCommon as "_FixedSizeBitSet"

f27559e

Support different sizes of _FixedSizeBitSet

4caf109

Add Collection conformance to _FixedSizeBitSet

090cad1

Specialize a few common collection methods

3e1396a

Use masking shifts and overflow adds in other _FixedSizeBitSet implem…

b9a6030

…entations

Add a SmallSourceLanguageSet type

6ac19ea

Rely only in _id comparison for known language sorting

d61d3a0

Use new small language set type in link resolution code

5273e94

Use new small language set type inside ResolvedTopicReference

d6e63ae

Use new small language set type inside DocumentationDataVariantsTrait

3d3d580

d-ronnqvist commented Nov 17, 2025

View reviewed changes

Sources/DocCCommon/FixedSizeBitSet.swift Outdated Show resolved Hide resolved

d-ronnqvist commented Nov 17, 2025

View reviewed changes

Sources/DocCCommon/FixedSizeBitSet.swift Outdated Show resolved Hide resolved

Fix implementation code comments about direction for layout of values…

bf5b3b5

…/bits

d-ronnqvist commented Nov 17, 2025

View reviewed changes

d-ronnqvist added 2 commits November 20, 2025 10:08

Merge branch 'main' into tiny-source-language

dc384b9

Add new source files to CMakeLists for Windows CI

793db44

patshaughnessy approved these changes Nov 20, 2025

View reviewed changes

d-ronnqvist added 4 commits November 24, 2025 19:07

Merge branch 'main' into tiny-source-language

2c71020

User simpler parameter name for ResolvedTopicReference initializer

577068b

Update additional callers to prefer passing source languages rather t…

9c8a9c1

…han their string IDs

Misc minor code comment fixes and clarifications

9dc59b6


		// MARK: SourceLanguage Set

		package struct SmallSourceLanguageSet: Sendable, Hashable, SetAlgebra, ExpressibleByArrayLiteral, Sequence, Collection {

[Low level] Store SourceLanguage properties outside of the main structure use a new bit-set type for "sets" of languages #1355

Are you sure you want to change the base?

[Low level] Store SourceLanguage properties outside of the main structure use a new bit-set type for "sets" of languages #1355

Uh oh!

Conversation

d-ronnqvist commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Testing

Checklist

Uh oh!

d-ronnqvist commented Nov 17, 2025

Uh oh!

Uh oh!

Uh oh!

d-ronnqvist commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-ronnqvist commented Nov 20, 2025

Uh oh!

patshaughnessy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-ronnqvist Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-ronnqvist commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d-ronnqvist commented Nov 17, 2025 •

edited

Loading

d-ronnqvist Nov 24, 2025 •

edited

Loading