Skip to content

For use in workspace symbols (WIP)

License

Notifications You must be signed in to change notification settings

zigtools/trigram-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

9ac9927 · May 4, 2024

History

2 Commits
May 4, 2024
May 4, 2024
May 4, 2024
May 4, 2024
May 4, 2024
May 4, 2024
May 3, 2024

Repository files navigation

Trigram Bench

For use in ZLS workspace symbols.

Problem Statement

Definitions:

  • Document: A Zig file.
  • Declaration: A u32 that represents a declaration. Only a name is exposed. The rest is considered a ZLS implementation detail for practical purposes.
  • Symbol / Declaration Name: A []const u8 that is the name of a declaration.
  • Index: Preprocessed state representing a document's symbols used to perform a search.
  • Query / Search: A search query that will be matched with a trigram search described below.
  • Trigram: A window over a string with window size 3 and stride 1. Our trigrams are Unicode-based and not byte-based and are case sensitive. Example: Counter is composed of the trigrams Cou, oun, unt, nte ter.

Indexing

We begin by obtaining 10,000 symbols extracted from zigwin32's everything.zig found in symbols.txt. Each symbol is given a declaration, from 0 to 9_999. A list declarations : Declaration (u32) -> Symbol ([]const u8) exists and is used for checking the correctness of a search.

Each symbol is then split into its constituent trigrams, and a mapping trigram_to_decls : Trigram -> []const Declaration is created. This maps trigrams to the declarations whose names contain the trigram.

Currently there is one indexing method in common.zig. If your query method requires a different kind of indexing, let me (Auguste) know.

Querying

We begin by splitting our query into its constituent trigrams. Then we access trigram_to_decls with each trigram and obtain the intersection of each decl list.

Let's walk through a made up example. We're searching for Alloc, which is composed of the trigrams All, llo, and loc.

We access the trigram_to_decls mapping for each trigram and obtain the following declaration lists for each trigram:

All -> { 0, 1, 3, 4, 7, 8 }
llo -> { 2, 3, 5, 6, 9 }
loc -> { 0, 2, 3, 10 }

We perform the intersection and obtain:

All ∩ llo ∩ loc -> { 2, 3 }

To check if this result makes sense, we can access declarations:

2 -> Allocator
3 -> ArenaAllocator

Success!

Challenge

Can you come up with the most effective way of indexing/querying this data? Anything is permitted as long as it replicates the test results with a reasonable balance of memory usage and performance. If a trade-off between indexing and query time appears, we'd rather make indexing faster to prevent delays for users not/rarely utilizing workspace symbols.

Please use a std.AutoArrayHashMapUnmanaged as it's required for the binary fuse filter (this bench only operates on a single document, but ZLS will operate on thousands, so the filter used to prevent unnecessary accesses).

The current solutions are:

  • merge.zig
  • hashmap.zig

Useful Stats

Our trigram_to_decls mapping has 10505 elements with an average of ~21 declarations (can be repeated) per trigram.

About

For use in workspace symbols (WIP)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages