Skip to content

Commit 086544d

Browse files
Pr 33 (Feature/performance: This PR introduces a high number of performance improvements) (#36)
* feat(performance): add benchmark project * feat(performance): reduce public API surface * feat(performance): reduce allocations * feat(performance): use compiled regex for better performance * feat(performance): run benchmark * chore: move reorganize files * feat(performance): replace SpecialTokenPatternRegex with faster alloc free solution * feat(performance): reduce string allocations * feat(performance): run benchmark * feat(performance): add BytePairIndex class to support faster implementation in net8.0 * feat(performance): cache model parameters to do params preparation only once * feat(performance): run benchmark * feat(performance): improve allowedSpecialTokens handling + bug fix * feat(performance): add support for ReadOnlySpan<char> in net8.0 * feat(performance): run benchmark * feat(performance): use compile time generated regex in net8.0 * feat: run tests for netstandard2.0 over net471 * feat(performance): run benchmark * feat: remove unused lastTokenLength + refactoring * feat(performance): implement fast MultiBytePairEncoder with almost zero allocations * feat(performance): run benchmark * feat(performance): reduce allocations * feat(performance): backport some optimizations to net6.0 and netstandard * feat(performance): run benchmark * chore: cosmetics * feat(refactor): extract Encoding.Encode() logic into priv EncodeCore to support broader use * feat(token-count): implement low allocation token count public method * feat(performance): run benchmark * feat(benchmark): don't make allocations in benchmark methods * feat(performance): reduce minor allocations * feat(benchmark): add benchmark for large file token count * feat(performance): run benchmark * fix: add test for allowedSpecialTokens and fix code * chore: fix naming * feat(benchmark): add another benchmark to show in README.md * feat(readme): add benchmark to README.md and add docs for TokenCount method * feat(performance): use multibyte cpu instructions for FastPartitionList.RemoveAt * feat(performance): re-add fast path - got lost in refactoring * feat(benchmark): add comparison to other tokenizer * feat(performance): improve ByteArrayEqualityComparer * feat(performance): small improvement * chore: update benchmarks in README.md * fix * improved pipleline * switch (key) * i += size - 1; * CountTokens renamed --------- Co-authored-by: René Larch <[email protected]>
1 parent e96811a commit 086544d

35 files changed

+2774
-595
lines changed

.github/workflows/build-test-and-publish.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,10 @@ jobs:
3636
- name: Calculate Package Version
3737
id: calculate_version
3838
run: |
39-
$version = "1.2.$env:GITHUB_RUN_NUMBER"
40-
echo "Calculated package version: $version"
39+
$GithubRunNumber = $env:GITHUB_RUN_NUMBER
40+
$Patch = $GithubRunNumber - 33
41+
$version = "2.0.$Patch"
42+
echo "Calculated package version: $version; Patch: $Patch; GitHub Run Number: $GithubRunNumber"
4143
echo "::set-output name=version::$version"
4244
4345
- name: Restore dependencies

.github/workflows/dotnet-build-test.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@ jobs:
99
fail-fast: false
1010
matrix:
1111
os: [windows-latest, ubuntu-latest, macos-latest]
12-
dotnet: ['netcoreapp3.1', 'net6.0', 'net8.0']
1312
runs-on: ${{ matrix.os }}
1413
steps:
1514
- name: Checkout repository
@@ -30,4 +29,4 @@ jobs:
3029
run: dotnet build --configuration Release --no-restore
3130

3231
- name: Test
33-
run: dotnet test --no-restore --verbosity normal -f ${{ matrix.dotnet }}
32+
run: dotnet test --no-restore --verbosity normal

README.md

Lines changed: 167 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,13 @@ And use the Decode method to decode the encoded tokens:
6060
var decoded = encoding.Decode(encoded); // Output: "Hello, world!"
6161
```
6262

63+
SharpToken also provides a high performance count method.
64+
It is usefull to check prompt size before sending it to a LLM or to use it in a TextSplitter/Chunker for RAG.
65+
66+
```csharp
67+
var count = encoding.CountTokens("Hello, world!"); // Output: 4
68+
```
69+
6370
## Supported Models
6471

6572
SharpToken currently supports the following models:
@@ -98,7 +105,7 @@ Examples of model names that fall under these prefixes include:
98105
To retrieve the encoding name based on a model name or its prefix, you can use the `GetEncodingNameForModel` method:
99106

100107
```csharp
101-
string encodingName = GetEncodingNameForModel("gpt-4-0314"); // This will return "cl100k_base"
108+
string encodingName = Model.GetEncodingNameForModel("gpt-4-0314"); // This will return "cl100k_base"
102109
```
103110

104111
If the provided model name doesn't match any direct model names or prefixes, the method will return `null`.
@@ -175,6 +182,165 @@ compatibility with the Python tiktoken library. These test cases validate the fu
175182
providing a reliable reference for developers. Running the unit tests and verifying the test cases helps maintain
176183
consistency between the C# SharpToken library and the original Python implementation.
177184

185+
## Performance Compared to TiktokenSharp and TokenizerLib
186+
187+
SharpToken is the fastest library with the lowest allocations!
188+
189+
<details>
190+
<summary>Benchmark Code</summary>
191+
192+
```csharp
193+
[SimpleJob(RuntimeMoniker.Net60)]
194+
[SimpleJob(RuntimeMoniker.Net80)]
195+
[SimpleJob(RuntimeMoniker.Net471)]
196+
[RPlotExporter]
197+
[MemoryDiagnoser]
198+
public class CompareBenchmark
199+
{
200+
private GptEncoding _sharpToken;
201+
private TikToken _tikToken;
202+
private ITokenizer _tokenizer;
203+
private string _kLongText;
204+
205+
[GlobalSetup]
206+
public async Task Setup()
207+
{
208+
_sharpToken = GptEncoding.GetEncoding("cl100k_base");
209+
_tikToken = await TikToken.GetEncodingAsync("cl100k_base").ConfigureAwait(false);
210+
_tokenizer = await TokenizerBuilder.CreateByModelNameAsync("gpt-4").ConfigureAwait(false);
211+
_kLongText = "King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.";
212+
}
213+
214+
[Benchmark]
215+
public int SharpToken()
216+
{
217+
var sum = 0;
218+
for (var i = 0; i < 10000; i++)
219+
{
220+
var encoded = _sharpToken.Encode(_kLongText);
221+
var decoded = _sharpToken.Decode(encoded);
222+
sum += decoded.Length;
223+
}
224+
225+
return sum;
226+
}
227+
228+
[Benchmark]
229+
public int TiktokenSharp()
230+
{
231+
var sum = 0;
232+
for (var i = 0; i < 10000; i++)
233+
{
234+
var encoded = _tikToken.Encode(_kLongText);
235+
var decoded = _tikToken.Decode(encoded);
236+
sum += decoded.Length;
237+
}
238+
239+
return sum;
240+
}
241+
242+
[Benchmark]
243+
public int TokenizerLib()
244+
{
245+
var sum = 0;
246+
for (var i = 0; i < 10000; i++)
247+
{
248+
var encoded = _tokenizer.Encode(_kLongText);
249+
var decoded = _tokenizer.Decode(encoded.ToArray());
250+
sum += decoded.Length;
251+
}
252+
253+
return sum;
254+
}
255+
}
256+
```
257+
258+
</details>
259+
260+
```
261+
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
262+
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
263+
.NET SDK 8.0.200
264+
[Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
265+
.NET 6.0 : .NET 6.0.16 (6.0.1623.17311), X64 RyuJIT AVX2
266+
.NET 8.0 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
267+
.NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
268+
```
269+
270+
| Method | Job | Runtime | Mean | Error | StdDev | Gen0 | Gen1 | Allocated |
271+
|--------------- |--------------------- |--------------------- |---------:|---------:|---------:|-----------:|----------:|----------:|
272+
| **SharpToken** | .NET 8.0 | .NET 8.0 | 100.4 ms | 1.95 ms | 1.91 ms | 2000.0000 | - | 22.13 MB |
273+
| **SharpToken** | .NET 6.0 | .NET 6.0 | 169.9 ms | 2.42 ms | 2.15 ms | 24333.3333 | 1000.0000 | 196.3 MB |
274+
| **SharpToken** | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 455.3 ms | 8.34 ms | 6.97 ms | 34000.0000 | 1000.0000 | 204.39 MB |
275+
| | | | | | | | | |
276+
| *TiktokenSharp*| .NET 8.0 | .NET 8.0 | 211.4 ms | 1.83 ms | 1.53 ms | 42000.0000 | 1000.0000 | 338.98 MB |
277+
| *TiktokenSharp*| .NET 6.0 | .NET 6.0 | 258.6 ms | 5.09 ms | 6.25 ms | 39000.0000 | 1000.0000 | 313.26 MB |
278+
| *TiktokenSharp*| .NET Framework 4.7.1 | .NET Framework 4.7.1 | 638.3 ms | 12.47 ms | 16.21 ms | 63000.0000 | 1000.0000 | 378.31 MB |
279+
| | | | | | | | | |
280+
| *TokenizerLib* | .NET 8.0 | .NET 8.0 | 124.4 ms | 1.81 ms | 1.60 ms | 27250.0000 | 1000.0000 | 217.82 MB |
281+
| *TokenizerLib* | .NET 6.0 | .NET 6.0 | 165.5 ms | 1.38 ms | 1.16 ms | 27000.0000 | 1000.0000 | 217.82 MB |
282+
| *TokenizerLib* | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 499.7 ms | 9.81 ms | 14.07 ms | 40000.0000 | 1000.0000 | 243.79 MB |
283+
284+
285+
## Performance
286+
287+
SharpToken is extreamly performance optimized on net8.0.
288+
It uses modern multibyte CPU instructions and almost no heap allocations.
289+
290+
All core methods have been tested on a large and a small input text.
291+
292+
**Inputs:**
293+
- `SmallText`: 453 B (text/plain)
294+
- `LargeText`: 51 KB (text/html)
295+
296+
**Methods:**
297+
- `Encode`: text to tokens
298+
- `Decode`: tokens to text
299+
- `CountTokens`: high performance API to count tokens of text
300+
301+
302+
```
303+
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
304+
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
305+
.NET SDK 8.0.200
306+
[Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
307+
.NET 6.0 : .NET 6.0.16 (6.0.1623.17311), X64 RyuJIT AVX2
308+
.NET 8.0 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
309+
.NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
310+
```
311+
312+
| Method | Mean | Error | StdDev | Ratio | RatioSD | Allocated | Alloc Ratio |
313+
|------------------------- |--------------:|------------:|------------:|------:|--------:|----------:|------------:|
314+
| **.NET 8.0** | | | | | | | |
315+
| Encode_SmallText | 22.649 us | 0.4244 us | 0.4359 us | 0.28 | 0.01 | 696 B | 0.02 |
316+
| Encode_LargeText | 4,542.505 us | 87.7988 us | 104.5182 us | 0.24 | 0.01 | 155547 B | 0.03 |
317+
| | | | | | | | |
318+
| Decode_SmallText | 1.623 us | 0.0324 us | 0.0373 us | 0.44 | 0.02 | 2320 B | 0.98 |
319+
| Decode_LargeText | 454.570 us | 6.8980 us | 6.4524 us | 0.80 | 0.02 | 286979 B | 1.00 |
320+
| | | | | | | | |
321+
| CountTokens_SmallText | 22.008 us | 0.1165 us | 0.0909 us | 0.28 | 0.00 | 184 B | 0.005 |
322+
| CountTokens_LargeText | 4,231.353 us | 14.5157 us | 11.3329 us | 0.23 | 0.00 | 195 B | 0.000 |
323+
| | | | | | | | |
324+
| **.NET 6.0** | | | | | | | |
325+
| Encode_SmallText | 36.370 us | 0.7178 us | 1.0962 us | 0.45 | 0.02 | 37344 B | 0.91 |
326+
| Encode_LargeText | 11,213.070 us | 219.6291 us | 269.7243 us | 0.59 | 0.02 | 5062574 B | 0.91 |
327+
| | | | | | | | |
328+
| Decode_SmallText | 2.588 us | 0.0394 us | 0.0350 us | 0.70 | 0.02 | 2320 B | 0.98 |
329+
| Decode_LargeText | 489.467 us | 8.9195 us | 8.3433 us | 0.86 | 0.02 | 286985 B | 1.00 |
330+
| | | | | | | | |
331+
| CountTokens_SmallText | 34.758 us | 0.2027 us | 0.1896 us | 0.45 | 0.01 | 36832 B | 0.907 |
332+
| CountTokens_LargeText | 11,252.083 us | 215.8912 us | 212.0340 us | 0.61 | 0.01 | 4907169 B | 0.907 |
333+
| | | | | | | | |
334+
| **.NET Framework 4.7.1** | | | | | | | |
335+
| Encode_SmallText | 79.947 us | 1.5621 us | 3.0097 us | 1.00 | 0.00 | 41138 B | 1.00 |
336+
| Encode_LargeText | 18,961.252 us | 253.1816 us | 236.8262 us | 1.00 | 0.00 | 5567685 B | 1.00 |
337+
| | | | | | | | |
338+
| Decode_SmallText | 3.723 us | 0.0728 us | 0.0997 us | 1.00 | 0.00 | 2375 B | 1.00 |
339+
| Decode_LargeText | 570.787 us | 11.0356 us | 11.8080 us | 1.00 | 0.00 | 287496 B | 1.00 |
340+
| | | | | | | | |
341+
| CountTokens_SmallText | 77.521 us | 1.0802 us | 0.9020 us | 1.00 | 0.00 | 40616 B | 1.000 |
342+
| CountTokens_LargeText | 18,485.392 us | 313.5834 us | 277.9836 us | 1.00 | 0.00 | 5413237 B | 1.000 |
343+
178344
## Contributions and Feedback
179345

180346
If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
using System.Threading.Tasks;
2+
using BenchmarkDotNet.Attributes;
3+
using BenchmarkDotNet.Jobs;
4+
using TiktokenSharp;
5+
using Microsoft.DeepDev;
6+
7+
8+
namespace SharpToken.Benchmark
9+
{
10+
[SimpleJob(RuntimeMoniker.Net60)]
11+
[SimpleJob(RuntimeMoniker.Net80)]
12+
[SimpleJob(RuntimeMoniker.Net471)]
13+
[RPlotExporter]
14+
[MemoryDiagnoser]
15+
public class CompareBenchmark
16+
{
17+
private GptEncoding _sharpToken;
18+
private TikToken _tikToken;
19+
private ITokenizer _tokenizer;
20+
private string _kLongText;
21+
22+
[GlobalSetup]
23+
public async Task Setup()
24+
{
25+
_sharpToken = GptEncoding.GetEncoding("cl100k_base");
26+
_tikToken = await TikToken.GetEncodingAsync("cl100k_base").ConfigureAwait(false);
27+
_tokenizer = await TokenizerBuilder.CreateByModelNameAsync("gpt-4").ConfigureAwait(false);
28+
_kLongText = "King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.";
29+
}
30+
31+
[Benchmark]
32+
public int SharpToken()
33+
{
34+
var sum = 0;
35+
for (var i = 0; i < 10000; i++)
36+
{
37+
var encoded = _sharpToken.Encode(_kLongText);
38+
var decoded = _sharpToken.Decode(encoded);
39+
sum += decoded.Length;
40+
}
41+
42+
return sum;
43+
}
44+
45+
[Benchmark]
46+
public int TiktokenSharp()
47+
{
48+
var sum = 0;
49+
for (var i = 0; i < 10000; i++)
50+
{
51+
var encoded = _tikToken.Encode(_kLongText);
52+
var decoded = _tikToken.Decode(encoded);
53+
sum += decoded.Length;
54+
}
55+
56+
return sum;
57+
}
58+
59+
[Benchmark]
60+
public int TokenizerLib()
61+
{
62+
var sum = 0;
63+
for (var i = 0; i < 10000; i++)
64+
{
65+
var encoded = _tokenizer.Encode(_kLongText);
66+
var decoded = _tokenizer.Decode(encoded.ToArray());
67+
sum += decoded.Length;
68+
}
69+
70+
return sum;
71+
}
72+
}
73+
}

SharpToken.Benchmark/FileHelper.cs

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
using System;
2+
using System.IO;
3+
4+
5+
namespace SharpToken.Benchmark
6+
{
7+
internal sealed class FileHelper
8+
{
9+
public static string ReadFile(string path)
10+
{
11+
return File.ReadAllText(Path.Combine(AppContext.BaseDirectory, path));
12+
}
13+
14+
public static T ReadJson<T>(string path)
15+
{
16+
return Newtonsoft.Json.JsonConvert.DeserializeObject<T>(ReadFile(path));
17+
}
18+
19+
public static string[] ReadFileLines(string path)
20+
{
21+
return File.ReadAllLines(Path.Combine(AppContext.BaseDirectory, path));
22+
}
23+
}
24+
}

0 commit comments

Comments
 (0)