Skip to content

StringZilla 4.0! #201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 457 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
457 commits
Select commit Hold shift + click to select a range
244e605
Fix: Reverse order `std::search` offsets
ashvardanian Mar 11, 2025
3f1c723
Improve: New-style "container" benchmarks
ashvardanian Mar 12, 2025
aa7f275
Make: Upgrade to C++20 for benchmarks
ashvardanian Mar 12, 2025
b9794e5
Improve: Naming "vtable" entries
ashvardanian Mar 12, 2025
9676cdb
Fix: Naming byteset signature
ashvardanian Mar 12, 2025
af686dd
Docs: Describe trivial types
ashvardanian Mar 12, 2025
12e1edd
Improve: New token-level benchmarks
ashvardanian Mar 12, 2025
20f35c7
Improve: Faster equality checks on NEON/SVE
ashvardanian Mar 12, 2025
a5de795
Fix: Computing improvement percent
ashvardanian Mar 12, 2025
148b615
Add: SVE backend for sorting
ashvardanian Mar 12, 2025
7d534fb
Improve: Better sorting benchmarks
ashvardanian Mar 12, 2025
e860af0
Add: Intersection benchmarks
ashvardanian Mar 13, 2025
0074ad7
Improve: Naming benchmark names
ashvardanian Mar 13, 2025
991a78b
Make: Rename `bench_sort`
ashvardanian Mar 13, 2025
d1a3779
Improve: `do_not_optimize` token-level results
ashvardanian Mar 13, 2025
9fbdc9a
Add: SVE2 macros
ashvardanian Mar 13, 2025
d6f87d7
Add: SVE & SVE2 bytesum
ashvardanian Mar 13, 2025
a007c7c
Add: Short string hashing in SVE2
ashvardanian Mar 13, 2025
fafb8b0
Docs: SVE2 intersects TODO
ashvardanian Mar 13, 2025
c12e6c4
Add: New string similarity benchmarks
ashvardanian Mar 13, 2025
56a49c8
Add: Faster `find_byte` with SVE
ashvardanian Mar 13, 2025
d7ede5d
Add: Draft `sz_find_sve`
ashvardanian Mar 13, 2025
06ea5f7
Fix: Dispatching SVE kernels
ashvardanian Mar 13, 2025
efda23b
Improve: Use `SQINCP` in SVE for increments
ashvardanian Mar 13, 2025
3d77ec6
Fix: Drop double negation in logging
ashvardanian Mar 13, 2025
905749c
Fix: `bench_sequence` CMake target
ashvardanian Mar 13, 2025
68449fd
Improve: More simple substring search tests
ashvardanian Mar 13, 2025
a493ab8
Fix: `find_sve` mask update on long needles
ashvardanian Mar 13, 2025
77e482e
Improve: Unrolled serial hashing
ashvardanian Mar 13, 2025
a1604b7
Improve: Simpler SVE find nested loop
ashvardanian Mar 13, 2025
dfd6ddf
Add: New memory benchmarks
ashvardanian Mar 14, 2025
0bba772
Improve: `fill_random` checksums in benchmarks
ashvardanian Mar 14, 2025
2ab635e
Improve: Bold benchmark names in CLI
ashvardanian Mar 14, 2025
75ef77e
Fix: Expanding feature-detecting macros
ashvardanian Mar 14, 2025
45b15b0
Make: Patch passing `SZ_USE_SVE` definitions
ashvardanian Mar 14, 2025
343b858
Docs: Listing bench details
ashvardanian Mar 14, 2025
4cb096b
Fix: No return
ashvardanian Mar 14, 2025
928fd79
Docs: Formatting and references
ashvardanian Mar 14, 2025
44485fb
Make: CMake formatting
ashvardanian Mar 14, 2025
7e08180
Fix: Compiling Py bindings
ashvardanian Mar 14, 2025
e467649
Add: Sorting in Rust
ashvardanian Mar 14, 2025
cc71d1a
Merge branch 'main-dev' of https://github.com/ashvardanian/StringZill…
ashvardanian Mar 14, 2025
0c08564
Fix: Remove missing Ice-Lake benchs
ashvardanian Mar 14, 2025
06784fc
Add: Set intersections in Rust
ashvardanian Mar 14, 2025
34660f2
Improve: Construct `Byteset::from_bytes`
ashvardanian Mar 14, 2025
667ea91
Add: Expose `find_byteset` in Rust
ashvardanian Mar 14, 2025
811fc59
Improve: Use `GiB` over `GB`
ashvardanian Mar 14, 2025
763538e
Add: Stateful hashing in Rust
ashvardanian Mar 15, 2025
1b3cdd5
Improve: Align inner hash-states
ashvardanian Mar 15, 2025
7e65a1e
Fix: Unaligned `sz_hash_state_t` stores
ashvardanian Mar 15, 2025
9460fd4
Docs: `uv` instructions
ashvardanian Mar 16, 2025
dd57536
Docs: Formatting
ashvardanian Mar 16, 2025
6f8cdb9
Add: OpenMP C++ draft
ashvardanian Mar 16, 2025
1907d2b
Make: List `scripts/` deps for `uv`
ashvardanian Mar 16, 2025
3090472
Add: `lookup` transform in Rust
ashvardanian Mar 16, 2025
28282d2
Improve: Inline cheap calls
ashvardanian Mar 16, 2025
42270c8
Fix: Compiling SVE on MacOS
ashvardanian Mar 16, 2025
778d4f0
Fix: `sz::lookup` examples in Rs
ashvardanian Mar 16, 2025
b55c696
Docs: Showcase indexing diagonals
ashvardanian Mar 16, 2025
427d5b5
Add: OpenMP `score_diagonally`
ashvardanian Mar 16, 2025
ef53f75
Fix: Diagonals depth
ashvardanian Mar 17, 2025
2727a87
Improve: Expose rune-parsing headers
ashvardanian Mar 22, 2025
37863c9
Fix: Unaligned loads/stores of hash state
ashvardanian Mar 22, 2025
6b5ef98
Fix: Shifting Levenshtein diagonals in OpenMP
ashvardanian Mar 22, 2025
bc311b3
Improve: Cleaner API for OpenMP
ashvardanian Mar 23, 2025
d369170
Fix: Calling unused helper struct unit tests
ashvardanian Mar 28, 2025
bca734a
Fix: Overriding `SZ_DEBUG` macro
ashvardanian Mar 28, 2025
aac2e8f
Fix: Sign-cast warning in `_mm256_set1_epi64x`
ashvardanian Mar 28, 2025
64b40a9
Fix: Track capacity in fixed buffer alloc
ashvardanian Mar 28, 2025
a6c0fa2
Fix: NVCC warning for negative size field
ashvardanian Mar 28, 2025
e4e517f
Fix: Accounting for different gap costs
ashvardanian Mar 28, 2025
e82d045
Improve: Share C++ macros and typedefs
ashvardanian Mar 28, 2025
7524882
Fix: Hardening `malloc(0)` behavior
ashvardanian Mar 28, 2025
2861a85
Add: Parallel GPU kernels
ashvardanian Mar 28, 2025
1b96ef4
Make: Draft CUDA compilation
ashvardanian Mar 28, 2025
ea7647f
Make: NVCC can't handle `fsanitize`
ashvardanian Mar 28, 2025
27ad8f5
Fix: Shared memory requirements
ashvardanian Mar 28, 2025
3b85a00
Fix: Synchronizing CUDA kernel launch
ashvardanian Mar 28, 2025
668a386
Fix: Overwriting alignment scores
ashvardanian Mar 28, 2025
fa4b0f4
Fix: Arrow-like string array
ashvardanian Mar 28, 2025
0c0ff42
Make: NVCC kernel debugging symbols
ashvardanian Mar 28, 2025
efcadd1
Fix: Overflow `mean_token_length` calculation
ashvardanian Mar 29, 2025
247c6ec
Fix: Shuffle datasets with over 4B tokens
ashvardanian Mar 30, 2025
b1c9a74
Improve: Allow datasets in VRAM
ashvardanian Mar 30, 2025
1671b0f
Make: Compile with OpenMP
ashvardanian Mar 30, 2025
109d8de
Add: Local alignment adaptation
ashvardanian Mar 31, 2025
1c1582f
Improve: Move capabilities to `types.h`
ashvardanian Mar 31, 2025
c5fd4bc
Make: Separate StringCuZilla
ashvardanian Mar 31, 2025
02a2481
Add: Horizontal scoring in OpenMP
ashvardanian Apr 1, 2025
044c7cc
Add: Baseline NW and SW alignment with ~O(NM) space
ashvardanian Apr 1, 2025
f42aa85
Docs: StringCuZilla design choices
ashvardanian Apr 1, 2025
7ac0fdd
Add: CUDA scoring benchmarks
ashvardanian Apr 1, 2025
0c6ff1f
Make: Revert to C++ for core tests
ashvardanian Apr 1, 2025
3ef1d26
Improve: Shorter type aliases
ashvardanian Apr 1, 2025
e4b1bbd
Fix: Avoid similarity scoring references
ashvardanian Apr 2, 2025
238c86d
Fix: Missing `sz_i16_t` definition
ashvardanian Apr 2, 2025
2ef667c
Improve: Annotate throwing exceptions
ashvardanian Apr 3, 2025
6d7f221
Add: `arrow_strings_tape::try_append`
ashvardanian Apr 3, 2025
6751e79
Add: Thrust-like `constant_iterator`
ashvardanian Apr 3, 2025
85b9675
Add: Parallel SW & NW scoring in OpenMP
ashvardanian Apr 3, 2025
3cb8dd0
Add: Separate parallel & serial tests
ashvardanian Apr 3, 2025
c96c5ed
Improve: Differentiate min/max-imizers
ashvardanian Apr 3, 2025
fea39ed
Fix: NW/SW test correspondence
ashvardanian Apr 3, 2025
8a99373
Add: Blosum62 and NUC.4.4 matrices
ashvardanian Apr 5, 2025
ca0fece
Add: New NW and SW GPU kernels
ashvardanian Apr 5, 2025
4c75d81
Fix: Passing CUDA similarity tests
ashvardanian Apr 5, 2025
53d3e0d
Add: Mem consumption CUDA tests
ashvardanian Apr 8, 2025
4a62715
Add: Warp-shuffle optimizations
ashvardanian Apr 8, 2025
2eeab83
Make: Rename test files
ashvardanian Apr 8, 2025
b175103
Make: Separate Parallel C++ and CUDA tests
ashvardanian Apr 8, 2025
05f17a6
Fix: Report error codes in tests
ashvardanian Apr 8, 2025
aca3301
Add: New batch similarity benchmarks
ashvardanian Apr 8, 2025
90bfcb6
Make: Move similarity benchmarks
ashvardanian Apr 8, 2025
b3db596
Fix: Avoid `std::iterator` dependency
ashvardanian Apr 8, 2025
4c87404
Fix: `std::allocator::rebind` deprecated
ashvardanian Apr 8, 2025
bc59ee3
Fix: Compiling `constant_iterator` in CUDA
ashvardanian Apr 8, 2025
d0bdd17
Add: Fetching Nvidia GPU specs
ashvardanian Apr 8, 2025
bd2a21d
Improve: Allow custom validators in benchmarks
ashvardanian Apr 8, 2025
3c8d181
Add: Ice Lake similarity kernels
ashvardanian Apr 9, 2025
c36d2b8
Add: `gpu_specs` and `cuda_status_t`
ashvardanian Apr 9, 2025
f64fc56
Add: Repeat similarity benchmarks on CPU
ashvardanian Apr 9, 2025
35ba76c
Improve: Use tagged C `enum`s
ashvardanian Apr 10, 2025
91df0ce
Fix: Initializing horizontal aligner
ashvardanian Apr 11, 2025
10279f9
Improve: New template SFINAE in `similarity.hpp`
ashvardanian Apr 11, 2025
6edcf7f
Fix: Uniform costs for UTF-32 runes
ashvardanian Apr 11, 2025
ee38a22
Fix: Included filename
ashvardanian Apr 11, 2025
9db0287
Fix: Build issues
ashvardanian Apr 11, 2025
1ece547
Make: Parallel test launchers
ashvardanian Apr 11, 2025
b04e934
Improve: Better UTF8 tests for similarity
ashvardanian Apr 11, 2025
f76fa40
Add: Multi-flag `sz_caps`
ashvardanian Apr 11, 2025
cb739e2
Improve: Catch & log exceptions
ashvardanian Apr 11, 2025
8cc9794
Fix: `i16` Ice Lake NW/SW alignment
ashvardanian Apr 11, 2025
382c05d
Improve: Show signed integers in SIMD types
ashvardanian Apr 11, 2025
0a04960
Improve: Forward-declare substituters
ashvardanian Apr 12, 2025
cebd180
Improve: Use CUDA constant memory
ashvardanian Apr 12, 2025
f7d365a
Fix: Missing STL includes
ashvardanian Apr 12, 2025
7bd92c5
Improve: Generalize memory requirements estimates
ashvardanian Apr 12, 2025
9d86d4c
Fix: Forward GPU specs in CUDA tests
ashvardanian Apr 12, 2025
1879aeb
Add: NW benchmarks on GPU
ashvardanian Apr 12, 2025
41e1a6e
Fix: Using OpenMP directives
ashvardanian Apr 12, 2025
7a9a243
Fix: Underutilized 99% of the H100
ashvardanian Apr 12, 2025
eabe605
Fix: Propagate substitutions to benchmarks
ashvardanian Apr 12, 2025
6da5e1e
Fix: `sz_copy_skylake` tail handling on large input (#222)
njnes Apr 13, 2025
d4d55fa
Improve: Consistent shuffling behavior in benchmarks
ashvardanian Apr 13, 2025
a83cdb5
Add: Multi-pattern exact substring search
ashvardanian Apr 14, 2025
e27f86f
Fix: Aho Corasick construction
ashvardanian Apr 14, 2025
0c6dd00
Fix: Use smaller types in BFS queue
ashvardanian Apr 14, 2025
42fa08d
Fix: Indexing needle IDs
ashvardanian Apr 14, 2025
d0ebee8
Fix: Calculating `find_many_match_t` properties
ashvardanian Apr 14, 2025
faad971
Fix: Passing multi-needle tests
ashvardanian Apr 15, 2025
e7065f0
Merge branch 'main-dev' of https://github.com/ashvardanian/StringZill…
ashvardanian Apr 15, 2025
a5ca2ac
Add: `indexed_container_iterator`
ashvardanian Apr 15, 2025
b70096b
Add: Immutable iterators for Arrow tapes
ashvardanian Apr 15, 2025
ac6218a
Add: `safe_vector`, `safe_array`
ashvardanian Apr 15, 2025
0a26059
Add: Random access ranges
ashvardanian Apr 15, 2025
a0fd136
Improve: Pointer-constructible spans
ashvardanian Apr 15, 2025
783cfa8
Add: Overflow-risk error codes
ashvardanian Apr 15, 2025
cf53b9b
Add: Draft parallel substring search
ashvardanian Apr 15, 2025
c8dc314
Add: Parallel multi-needle search with OpenMP
ashvardanian Apr 16, 2025
83bc966
Fix: `bytes_per_core_optimal` estimate
ashvardanian Apr 16, 2025
6ebc7b0
Improve: Parallel baseline for substring search
ashvardanian Apr 16, 2025
f228264
Improve: Use `std::execution` for baseline tests
ashvardanian Apr 17, 2025
7bcc803
Fix: Slicing corner-cases in OpenMP
ashvardanian Apr 18, 2025
d47d849
Improve: Splitting jobs in baseline multi-search
ashvardanian Apr 18, 2025
0895d28
Improve: Faster multi-needle tests
ashvardanian Apr 18, 2025
b85d3ca
Fix: Over-estimating number of overlapping matches
ashvardanian Apr 18, 2025
938bf6f
Add: Levenshtein on Kepler
ashvardanian Apr 20, 2025
914e98d
Improve: Warp-shuffle reductions in SW
ashvardanian Apr 22, 2025
871d7bc
Add: Draft device-wide similarity kernels
ashvardanian Apr 22, 2025
83dd5fb
Improve: Fetch warp-size dynamically
ashvardanian Apr 23, 2025
9c5a56c
Improve: Alloc type-size check in `safe_vector`
ashvardanian Apr 23, 2025
945613c
Add: Affine gap extensions baseline
ashvardanian Apr 23, 2025
07196d0
Add: Draft global scoring in CUDA
ashvardanian Apr 23, 2025
25ab3b6
Add: Affine Levenshtein-Gotoh variants
ashvardanian Apr 24, 2025
e5d85f3
Fix: Horizontal Affine Walkers
ashvardanian Apr 24, 2025
8c15447
Improve: Differentiate unary & uniform costs
ashvardanian Apr 24, 2025
640853f
Fix: Initializing affine DP matrices
ashvardanian Apr 24, 2025
84397ae
Fix: All Gotoh baselines
ashvardanian Apr 24, 2025
2b4fa09
Add: Unmasked Ice Lake Levenshtein
ashvardanian Apr 24, 2025
9b2b4e5
Fix: Avoid horizontal walker overflow
ashvardanian Apr 24, 2025
b4eb6a4
Improve: Measure gap magnitude
ashvardanian Apr 24, 2025
3998c1f
Improve: Branchless K-mask calculation
ashvardanian Apr 24, 2025
10b5405
Improve: Aligned ZMM diagonal stores
ashvardanian Apr 25, 2025
0b33ac3
Docs: More datasets on HuggingFace
ashvardanian Apr 25, 2025
f7b091c
Add: C++20 concepts
ashvardanian Apr 25, 2025
e1a37de
Add: Draft executors for StringCuZilla
ashvardanian Apr 26, 2025
8e79fb7
Fix: Inclusion guard macro names
ashvardanian Apr 26, 2025
1975351
Add: Draft StringCuZilla C API
ashvardanian Apr 26, 2025
6be0da5
Add: Concepts & new multithreading scheme
ashvardanian Apr 27, 2025
29da524
Improve: Generalize Levenshtein in CUDA
ashvardanian Apr 30, 2025
7e778d9
Improve: Naming "executor" interfaces
ashvardanian Apr 30, 2025
22c691e
Improve: Use `requires` clause
ashvardanian Apr 30, 2025
23a7f58
Improve: Bounded methods not supported
ashvardanian Apr 30, 2025
60b99ab
Fix: Inconsistent timing in `bench_unary`
ashvardanian Apr 30, 2025
c6e907a
Improve: `bytes_per_cell_t` enum
ashvardanian Apr 30, 2025
6a61e1b
Improve: Avoid `views::group_by` with callback
ashvardanian Apr 30, 2025
77b1087
Improve: Scheduling Levenshtein in CUDA
ashvardanian Apr 30, 2025
563a73d
Fix: Correcting blends on Kepler
ashvardanian Apr 30, 2025
f59ba5e
Fix: Ice Lake calls for empty inputs
ashvardanian May 1, 2025
c709621
Make: Bump CUDA version
ashvardanian May 3, 2025
bfcd10e
Improve: Duplicate `bytesum` assignment
ashvardanian May 3, 2025
5b410b8
Add: `sz_similarity_gaps_t` enums
ashvardanian May 3, 2025
bd3d341
Make: Add `fork_union` dependency
ashvardanian May 4, 2025
8b62f2c
Docs: Inconsistent naming
ashvardanian May 4, 2025
81463d4
Improve: Use `fork_union` pools
ashvardanian May 4, 2025
1d6f58d
Improve: Match `fork_union` API in executors
ashvardanian May 4, 2025
3bbebc8
Improve: Test weird affine gaps
ashvardanian May 4, 2025
3e1df93
Add: Affine Levenshtein variants on GPU
ashvardanian May 4, 2025
076e58a
Fix: Levenshtein w. Affine costs on GPU for zero-length ins
ashvardanian May 5, 2025
b9e4160
Fix: Affine top row initialization
ashvardanian May 5, 2025
51afad5
Add: Ice Lake Affine Levenshtein kernels
ashvardanian May 5, 2025
79e7a2f
Improve: Shrink Affine Ice Lake kernels
ashvardanian May 5, 2025
d603f7d
Improve: Fuzzy test Ice Lake kernels
ashvardanian May 5, 2025
64d8f4d
Fix: Comparing Affine benchmark results
ashvardanian May 5, 2025
3b2f263
Improve: Run multiple warps per block
ashvardanian May 6, 2025
8a6a185
Improve: Scheduling speculative kernels
ashvardanian May 6, 2025
ece416d
Improve: More noticeable signaling in tests
ashvardanian May 6, 2025
f09bbf9
Add: Affine gaps Levenshtein on Kepler
ashvardanian May 6, 2025
cb7c48d
Add: Hopper Levenshtein kernels
ashvardanian May 6, 2025
e53c8d5
Add: NW and SW for Hopper
ashvardanian May 6, 2025
4134e44
Fix: NW & SW on Hopper
ashvardanian May 6, 2025
cd9ff1e
Fix: OOB impact on SW scoring
ashvardanian May 7, 2025
817b15c
Improve: Divergent branches on `i16` SW on Hopper
ashvardanian May 7, 2025
6aaf16c
Add: Parallel Ice Lake variants
ashvardanian May 7, 2025
d087afa
Add: CUDA Aho Corasick placeholder
ashvardanian May 7, 2025
35fad3d
Improve: Support executors in multi-pattern search
ashvardanian May 7, 2025
ade74f6
Fix: Checking `STRINGWARS_STRESS` env-var
ashvardanian May 8, 2025
df1f7ef
Improve: New caching primitives
ashvardanian May 8, 2025
496e55a
Docs: Aho-Corasick CUDA design
ashvardanian May 10, 2025
6f65624
Make: Rename `bench_search` -> `bench_find`
ashvardanian May 11, 2025
aaa6927
Improve: Benchmark early exit
ashvardanian May 11, 2025
0cf994a
Improve: Custom validators for nullary benchmarks
ashvardanian May 13, 2025
2d594e4
Add: `bench_find_many.cpp`
ashvardanian May 13, 2025
0c442d1
Improve: Propagate allocators in `safe_vector`
ashvardanian May 14, 2025
85c5bf8
Improve: Multi-byte characters support
ashvardanian May 14, 2025
853a0fa
Fix: Missing `std::generate` include
ashvardanian May 14, 2025
20c9135
Improve: New multi-pattern search APIs
ashvardanian May 14, 2025
5302a4d
Improve: Skip vocabulary duplicates
ashvardanian May 17, 2025
887c16b
Fix: Buffer overflow due to wrong thread count
ashvardanian May 17, 2025
3ee1676
Fix: Including the entire haystack into match
ashvardanian May 17, 2025
69c7b6a
Fix: Prefix length in parallel counting of short needles
ashvardanian May 17, 2025
16f5fa4
Fix: Compiler over-optimizing `bench_find_many`
ashvardanian May 17, 2025
3972cb3
Docs: Move segmenting & features drafts
ashvardanian May 18, 2025
13e0201
Fix: `unified_alloc` propagation
ashvardanian May 18, 2025
c45017d
Improve: `__reduce_max_sync` in SW on Hopper
ashvardanian May 18, 2025
0acbeeb
Fix: Revert to separate `find_many` algos for different length
ashvardanian May 18, 2025
3f5de03
Make: Bump `fork_union`
ashvardanian May 18, 2025
f5a94b1
Docs: Sections & links
ashvardanian Jun 3, 2025
966dd09
Fix: Match C++ class names
ashvardanian Jun 3, 2025
503e470
Docs: Target names
ashvardanian Jun 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ NamespaceIndentation: None
ColumnLimit: 120
ReflowComments: true
UseTab: Never
IndentPPDirectives: None

AlignConsecutiveAssignments: false
AlignConsecutiveDeclarations: false
Expand Down Expand Up @@ -44,8 +45,8 @@ BraceWrapping:
SplitEmptyNamespace: false
IndentBraces: false

SortIncludes: true
SortUsingDeclarations: true
SortIncludes: false
SortUsingDeclarations: false

SpaceAfterCStyleCast: false
SpaceAfterLogicalNot: false
Expand Down
19 changes: 19 additions & 0 deletions .cmake-format.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# -----------------------------
# Options effecting formatting.
# -----------------------------
with section("format"):
# How wide to allow formatted cmake files
line_width = 120

# How many spaces to tab for indent
tab_size = 4

# If true, separate flow control names from their parentheses with a space
separate_ctrl_name_with_space = True

# If true, separate function names from parentheses with a space
separate_fn_name_with_space = False

# If a statement is wrapped to more than one line, than dangle the closing
# parenthesis on its own line.
dangle_parens = True
42 changes: 42 additions & 0 deletions .git-blame-ignore-revs
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
6512f1d129aeddc8601c9df7332c135038914b68
fc9e5d61e5fb1c5031f6f10920f6b50e2530de1e
ad2af78f8651870727c5b39e1fea2eff26d71d2f
49e8d9d240993bdf68715a9c87824a032752798d
fc408fa0a0f2d947c610568bd7a5c4a60ecca443
b835051c09a0ecfc420932de444f3c6839610764
1ba7982559111d4fc9b58caa7bc7aa1c6e64257c
5b55e19d1378c61da88309b30a38f9cf7c64bf79
be4c63d926c8628451726863e4d14dbd1ea374dd
8b401bd41e4bd9c29c8fad9a5b83d8232efa50c7
295d49a38d66b08075357ac829ad66d80b5edab0
2a1fcd113d217e3124f6501c38e93a318aca37f0
2f7652141bd8dc3c2c38ab34321567bfcdb91d93
9e3180019acffe5261f0a1713b4ea324dca79ea0
45e57eefd796841cbd14ee7f75ec42b42b5bde0c
66778d6b2b3aa0eed27e32fbdceef79b8c54eda5
c357c3ea756523d3bcc8d8f25068ad08aef5456d
9b1948b3771c21dd56954e5f43301ca8a0b8b1a9
cbfe5c7ac6371047eae88621b092297474d0b82a
085d2d3c8b99e0f90d320dd027040e554e410929
3464cb428ae9a8721ab82a8c4bff214aa9ce6254
5d0d2da422c7df96f9613ec843cd47c579a2edce
89c46810c2f9bfafa31f8592339f9a1b45dcc245
3f9c248fbf59add2246055462e8fc19dc9f1693b
e23c35ff2c2d4ccb752f4ffbf9b6f39a1677b532
7fdc58fd26e06c41052287d47a9c729c068a95ca
10d829efcb8ed4cfa5f2db4050f8403184484423
d74e5dca2e62eb0078cb2ebacc0dac2b8bb92d54
1f60e6d7c81f0e285e594eb63fee6119e05a3e69
a6768af38b40307fe66364403f141c285b3e164c
08d0a20d35d3b29a44b9c8a826d53435c3ef839c
9e9f2567d052d635722921a1d70ec63d69ec6669
974ed78822dc0b519dd61bc1c4dc18d59fe4ad15
b007ba571860e1d3737d1478c7f8d66ae1839e36
14ba3bf3c43408438a7de9ad57118c747c1347b1
9e577be71dcd2e20854bf55f08c54854b3e82989
8cb0742b2d1b31b61fac5272f17017953c6677e6
bd547453122e9f8565e5be15f137e7b0de37caca
22e3d1e34d62d68c1e89df7c8bdc201faa18a9de
ecb377541d0c706cf8997faff4f026b07e3f76f3
0d982a45f842287d7e344f0d8b360f52482017f5
467b4b81cb4bc0e9a64844748a417762378918c9
17 changes: 9 additions & 8 deletions .github/workflows/prerelease.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@ jobs:
package.json:"version": "(\d+\.\d+\.\d+)"
CMakeLists.txt:VERSION (\d+\.\d+\.\d+)
update-major-version-in: |
include/stringzilla/stringzilla.h:^#define STRINGZILLA_VERSION_MAJOR (\d+)
include/stringzilla/stringzilla.h:^#define STRINGZILLA_H_VERSION_MAJOR (\d+)
update-minor-version-in: |
include/stringzilla/stringzilla.h:^#define STRINGZILLA_VERSION_MINOR (\d+)
include/stringzilla/stringzilla.h:^#define STRINGZILLA_H_VERSION_MINOR (\d+)
update-patch-version-in: |
include/stringzilla/stringzilla.h:^#define STRINGZILLA_VERSION_PATCH (\d+)
include/stringzilla/stringzilla.h:^#define STRINGZILLA_H_VERSION_PATCH (\d+)
dry-run: "true"

test_ubuntu_gcc:
Expand Down Expand Up @@ -87,10 +87,11 @@ jobs:
run: build_artifacts/stringzilla_test_cpp20
- name: Test on Real World Data
run: |
build_artifacts/stringzilla_bench_memory ${DATASET_PATH} # for string copies and fills
build_artifacts/stringzilla_bench_search ${DATASET_PATH} # for substring search
build_artifacts/stringzilla_bench_token ${DATASET_PATH} # for hashing, equality comparisons, etc.
build_artifacts/stringzilla_bench_similarity ${DATASET_PATH} # for edit distances and alignment scores
build_artifacts/stringzilla_bench_sort ${DATASET_PATH} # for sorting arrays of strings
build_artifacts/stringzilla_bench_sequence ${DATASET_PATH} # for sorting arrays of strings
build_artifacts/stringzilla_bench_container ${DATASET_PATH} # for STL containers with string keys
env:
DATASET_PATH: ./README.md
Expand Down Expand Up @@ -174,7 +175,7 @@ jobs:
build_artifacts/stringzilla_bench_search ${DATASET_PATH} # for substring search
build_artifacts/stringzilla_bench_token ${DATASET_PATH} # for hashing, equality comparisons, etc.
build_artifacts/stringzilla_bench_similarity ${DATASET_PATH} # for edit distances and alignment scores
build_artifacts/stringzilla_bench_sort ${DATASET_PATH} # for sorting arrays of strings
build_artifacts/stringzilla_bench_sequence ${DATASET_PATH} # for sorting arrays of strings
build_artifacts/stringzilla_bench_container ${DATASET_PATH} # for STL containers with string keys
env:
DATASET_PATH: ./README.md
Expand Down Expand Up @@ -275,7 +276,7 @@ jobs:
# We can't run the produced builds, but we can make sure they exist
- name: Test artifacts presense
run: |
test -e build_artifacts/libstringzillite.so
test -e build_artifacts/libstringzilla_bare.so
test -e build_artifacts/libstringzilla_shared.so
test -e build_artifacts/stringzilla_test_cpp20

Expand Down Expand Up @@ -306,7 +307,7 @@ jobs:
build_artifacts/stringzilla_bench_search ${DATASET_PATH} # for substring search
build_artifacts/stringzilla_bench_token ${DATASET_PATH} # for hashing, equality comparisons, etc.
build_artifacts/stringzilla_bench_similarity ${DATASET_PATH} # for edit distances and alignment scores
build_artifacts/stringzilla_bench_sort ${DATASET_PATH} # for sorting arrays of strings
build_artifacts/stringzilla_bench_sequence ${DATASET_PATH} # for sorting arrays of strings
build_artifacts/stringzilla_bench_container ${DATASET_PATH} # for STL containers with string keys
env:
DATASET_PATH: ./README.md
Expand Down Expand Up @@ -379,7 +380,7 @@ jobs:
.\build_artifacts\stringzilla_bench_search.exe ${DATASET_PATH} # for substring search
.\build_artifacts\stringzilla_bench_token.exe ${DATASET_PATH} # for hashing, equality comparisons, etc.
.\build_artifacts\stringzilla_bench_similarity.exe ${DATASET_PATH} # for edit distances and alignment scores
.\build_artifacts\stringzilla_bench_sort.exe ${DATASET_PATH} # for sorting arrays of strings
.\build_artifacts\stringzilla_bench_sequence.exe ${DATASET_PATH} # for sorting arrays of strings
.\build_artifacts\stringzilla_bench_container.exe ${DATASET_PATH} # for STL containers with string keys
env:
DATASET_PATH: ./README.md
Expand Down
30 changes: 15 additions & 15 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@ jobs:
package.json:"version": "(\d+\.\d+\.\d+)"
CMakeLists.txt:VERSION (\d+\.\d+\.\d+)
update-major-version-in: |
include/stringzilla/stringzilla.h:^#define STRINGZILLA_VERSION_MAJOR (\d+)
include/stringzilla/stringzilla.h:^#define STRINGZILLA_H_VERSION_MAJOR (\d+)
update-minor-version-in: |
include/stringzilla/stringzilla.h:^#define STRINGZILLA_VERSION_MINOR (\d+)
include/stringzilla/stringzilla.h:^#define STRINGZILLA_H_VERSION_MINOR (\d+)
update-patch-version-in: |
include/stringzilla/stringzilla.h:^#define STRINGZILLA_VERSION_PATCH (\d+)
include/stringzilla/stringzilla.h:^#define STRINGZILLA_H_VERSION_PATCH (\d+)
dry-run: "false"
push: "true"
create-release: "true"
Expand Down Expand Up @@ -253,15 +253,15 @@ jobs:

cmake --build build_release --config Release

cp build_release/libstringzillite.so "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}.so"
mkdir -p "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/DEBIAN"
touch "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/DEBIAN/control"
mkdir -p "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/usr/local/lib"
mkdir "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/usr/local/include"
cp include/stringzilla/stringzilla.h "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/usr/local/include/"
cp build_release/libstringzillite.so "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/usr/local/lib/"
echo -e "Package: stringzilla\nVersion: ${{ steps.set_version.outputs.version }}\nMaintainer: Ash Vardanian\nArchitecture: ${{ matrix.arch }}\nDescription: SIMD-accelerated string search, sort, hashes, fingerprints, & edit distances" > "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/DEBIAN/control"
dpkg-deb --build "stringzillite_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}"
cp build_release/libstringzilla_bare.so "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}.so"
mkdir -p "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/DEBIAN"
touch "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/DEBIAN/control"
mkdir -p "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/usr/local/lib"
mkdir "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/usr/local/include"
cp include/stringzilla/stringzilla.h "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/usr/local/include/"
cp build_release/libstringzilla_bare.so "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/usr/local/lib/"
echo -e "Package: stringzilla\nVersion: ${{ steps.set_version.outputs.version }}\nMaintainer: Ash Vardanian\nArchitecture: ${{ matrix.arch }}\nDescription: SIMD-accelerated string search, sort, hashes, fingerprints, & edit distances" > "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}/DEBIAN/control"
dpkg-deb --build "stringzilla_bare_linux_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}"

- name: Upload library
uses: xresloader/upload-to-github-release@v1
Expand Down Expand Up @@ -314,14 +314,14 @@ jobs:
run: |
cmake -DCMAKE_BUILD_TYPE=Release -B build_release
cmake --build build_release --config Release
tar -cvf "stringzillite_windows_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}.tar" "build_release/stringzillite.dll" "./include/stringzilla/stringzilla.h"
tar -cvf "stringzilla_bare_windows_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}.tar" "build_release/stringzilla_bare.dll" "./include/stringzilla/stringzilla.h"

- name: Upload archive
uses: xresloader/upload-to-github-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
file: "stringzillite_windows_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}.tar"
file: "stringzilla_bare_windows_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}.tar"
update_latest_release: true

create_macos_library:
Expand Down Expand Up @@ -349,7 +349,7 @@ jobs:
run: |
cmake -DCMAKE_BUILD_TYPE=Release -B build_release
cmake --build build_release --config Release
zip -r stringzillite_macos_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}.zip build_release/libstringzillite.dylib include/stringzilla/stringzilla.h
zip -r stringzilla_bare_macos_${{ matrix.arch }}_${{ steps.set_version.outputs.version }}.zip build_release/libstringzilla_bare.dylib include/stringzilla/stringzilla.h

- name: Upload archive
uses: xresloader/upload-to-github-release@v1
Expand Down
11 changes: 9 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,17 @@ node_modules/
*.tar.gz

# Recommended datasets
utf8.txt
leipzig1M.txt
enwik9.txt
xlsum.csv
human_protein_1200row_800len.txt
acgt_100.txt
acgt_100k.txt
acgt_10k.txt
acgt_10m.txt
acgt_1k.txt
acgt_1m.txt

# StringZilla-specific log files
/failed_sz_*
/failed_sz_*
/.cache/
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "fork_union"]
path = fork_union
url = https://github.com/ashvardanian/fork_union.git
Loading
Loading