Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions .github/workflows/cmake-multi-platform.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# This starter workflow is for a CMake project running on multiple platforms. There is a different starter workflow if you just want a single platform.
# See: https://github.com/actions/starter-workflows/blob/main/ci/cmake-single-platform.yml
name: CMake on multiple platforms

on:
push:
branches: [ "master" ]
pull_request:
branches: [ "master" ]

jobs:
build:
runs-on: ${{ matrix.os }}

strategy:
# Set fail-fast to false to ensure that feedback is delivered for all matrix combinations. Consider changing this to true when your workflow is stable.
fail-fast: false

# Set up a matrix to run the following 3 configurations:
# 1. <Windows, Release, latest MSVC compiler toolchain on the default runner image, default generator>
# 2. <Linux, Release, latest GCC compiler toolchain on the default runner image, default generator>
# 3. <Linux, Release, latest Clang compiler toolchain on the default runner image, default generator>
#
# To add more build types (Release, Debug, RelWithDebInfo, etc.) customize the build_type list.
matrix:
os: [ubuntu-latest, windows-latest]
build_type: [Release]
c_compiler: [gcc, clang, cl]
include:
- os: windows-latest
c_compiler: cl
cpp_compiler: cl
- os: ubuntu-latest
c_compiler: gcc
cpp_compiler: g++
- os: ubuntu-latest
c_compiler: clang
cpp_compiler: clang++
exclude:
- os: windows-latest
c_compiler: gcc
- os: windows-latest
c_compiler: clang
- os: ubuntu-latest
c_compiler: cl

steps:
- uses: actions/checkout@v4

- name: Set reusable strings
# Turn repeated input strings (such as the build output directory) into step outputs. These step outputs can be used throughout the workflow file.
id: strings
shell: bash
run: |
echo "build-output-dir=${{ github.workspace }}/build" >> "$GITHUB_OUTPUT"
- name: Configure CMake
# Configure CMake in a 'build' subdirectory. `CMAKE_BUILD_TYPE` is only required if you are using a single-configuration generator such as make.
# See https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html?highlight=cmake_build_type
run: >
cmake -B ${{ steps.strings.outputs.build-output-dir }}
-DCMAKE_CXX_COMPILER=${{ matrix.cpp_compiler }}
-DCMAKE_C_COMPILER=${{ matrix.c_compiler }}
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
-S ${{ github.workspace }}
- name: Build
# Build your program with the given configuration. Note that --config is needed because the default Windows generator is a multi-config generator (Visual Studio generator).
run: cmake --build ${{ steps.strings.outputs.build-output-dir }} --config ${{ matrix.build_type }}

- name: Test
working-directory: ${{ steps.strings.outputs.build-output-dir }}
# Execute tests defined by the CMake configuration. Note that --build-config is needed because the default Windows generator is a multi-config generator (Visual Studio generator).
# See https://cmake.org/cmake/help/latest/manual/ctest.1.html for more detail
run: ctest --build-config ${{ matrix.build_type }}
54 changes: 27 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,47 +6,47 @@ Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we

## Bling Fire Tokenizer Overview

Bling Fire Tokenizer provides state of the art performance for Natural Language text tokenization. Bling Fire supports the following tokenization algorithms:
Bling Fire Tokenizer provides state-of-the-art performance for Natural Language text tokenization. Bling Fire supports the following tokenization algorithms:

1. Pattern-based tokenization
2. [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenization
3. [SentencePiece](https://github.com/google/sentencepiece) Unigram LM
4. [SentencePiece](https://github.com/google/sentencepiece) BPE
5. Induced/learned syllabification patterns (identifies possible hyphenation points within a token)

Bling Fire provides uniform interface for working with all four algorithms so there is no difference for the client whether to use tokenizer for XLNET, BERT or your own custom model.
Bling Fire provides a uniform interface for working with all four algorithms, so there is no difference for the client whether to use tokenizer for XLNET, BERT, or your own custom model.

Model files describe the algorithms they are built for and are loaded on demand from external file. There are also two default models for NLTK-style tokenization and sentence breaking, which does not need to be loaded. The default tokenization model follows logic of NLTK, except hyphenated words are split and a few "errors" are fixed.
Model files describe the algorithms they are built for and are loaded on demand from external file. There are also two default models for NLTK-style tokenization and sentence breaking, which do not need to be loaded. The default tokenization model follows logic of NLTK, except hyphenated words are split and a few "errors" are fixed.

Normalization can be added to each model, but is optional.
Normalization can be added to each model but is optional.

Diffrences between algorithms are [summarized here](https://github.com/microsoft/BlingFire/blob/master/doc/Bling_Fire_Tokenizer_Algorithms.pdf).

Bling Fire Tokenizer high level API designed in a way that it requires minimal or no configuration, or initialization, or additional files and is friendly for use from languages like Python, Ruby, Rust, C#, JavaScript (via WASM), etc.
Bling Fire Tokenizer high-level API designed in a way that it requires minimal or no configuration, or initialization, or additional files and is friendly for use from languages like Python, Ruby, Rust, C#, JavaScript (via WASM), etc.

We have precompiled some popular models and listed with the source code reference below:

| File Name | Models it should be used for | Algorithm | Source Code |
|------------|---------------------------------------|----|----|
| wbd.bin | Default Tokenization Model | Pattern-based | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/wbd) |
| sbd.bin | Default model for Sentence breaking | Pattern-based | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/sbd) |
| bert_base_tok.bin | BERT Base/Large | WordPiece | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bert_base_tok) |
| bert_base_cased_tok.bin | BERT Base/Large Cased | WordPiece | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bert_base_cased_tok) |
| bert_chinese.bin | BERT Chinese | WordPiece | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bert_chinese) |
| bert_multi_cased.bin | BERT Multi Lingual Cased | WordPiece | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bert_multi_cased) |
| xlnet.bin | XLNET Tokenization Model | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/xlnet) |
| xlnet_nonorm.bin | XLNET Tokenization Model /wo normalization | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/xlnet_nonorm) |
| bpe_example.bin | A model to test BPE tokenization | BPE | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bpe_example) |
| xlm_roberta_base.bin | XLM Roberta Tokenization | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/xlm_roberta_base) |
| laser(100k\|250k\|500k).bin | Trained on balanced by language WikiMatrix corpus of 80+ languages | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/laser100k) |
| uri(100k\|250k\|500k).bin | URL tokenization model trained on a large set of random URLs from the web | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/uri250k) |
| gpt2.bin | Byte-BPE tokenization model for GPT-2 | byte BPE | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/gpt2) |
| roberta.bin | Byte-BPE tokenization model for Roberta model | byte BPE | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta) |
| syllab.bin | Multi lingual model to identify allowed hyphenation points inside a word. | W2H | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/syllab) |

Oh yes, it is also the fastest! We did a comparison of Bling Fire with tokenizers from Hugging Face, [Bling Fire runs 4-5 times faster than Hugging Face Tokenizers](https://github.com/Microsoft/BlingFire/wiki/Comparing-performance-of-Bling-Fire-and-Hugging-Face-Tokenizers), see also [Bing Blog Post](https://blogs.bing.com/Developers-Blog/march-2020/Bling-FIRE-Tokenizer-for-BERT). We did comparison of Bling Fire Unigram LM and BPE implementaion to the same one in [SentencePiece](https://github.com/google/sentencepiece) library and our implementation is ~2x faster, see [XLNET benchmark](https://github.com/microsoft/BlingFire/blob/master/ldbsrc/xlnet/README.TXT) and [BPE benchmark](https://github.com/microsoft/BlingFire/blob/master/ldbsrc/bpe_example/README.TXT). Not to mention our default models are 10x faster than the same functionality from [SpaCy](https://github.com/explosion/spaCy), see [benchmark wiki](https://github.com/Microsoft/BlingFire/wiki/Benchmark-Guide) and this [Bing Blog Post](https://blogs.bing.com/Developers-Blog/2019-04/bling-fire-tokenizer-released-to-open-source).

So if low latency inference is what you need then you have to try Bling Fire!
| File Name | Models it should be used for | Algorithm | Source Code |
|-----------------------------|---------------------------------------------------------------------------|---------------|--------------------------------------------------------------------------------------|
| wbd.bin | Default Tokenization Model | Pattern-based | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/wbd) |
| sbd.bin | Default model for Sentence breaking | Pattern-based | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/sbd) |
| bert_base_tok.bin | BERT Base/Large | WordPiece | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bert_base_tok) |
| bert_base_cased_tok.bin | BERT Base/Large Cased | WordPiece | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bert_base_cased_tok) |
| bert_chinese.bin | BERT Chinese | WordPiece | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bert_chinese) |
| bert_multi_cased.bin | BERT Multi Lingual Cased | WordPiece | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bert_multi_cased) |
| xlnet.bin | XLNET Tokenization Model | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/xlnet) |
| xlnet_nonorm.bin | XLNET Tokenization Model /wo normalization | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/xlnet_nonorm) |
| bpe_example.bin | A model to test BPE tokenization | BPE | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/bpe_example) |
| xlm_roberta_base.bin | XLM Roberta Tokenization | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/xlm_roberta_base) |
| laser(100k\|250k\|500k).bin | Trained on balanced by language WikiMatrix corpus of 80+ languages | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/laser100k) |
| uri(100k\|250k\|500k).bin | URL tokenization model trained on a large set of random URLs from the web | Unigram LM | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/uri250k) |
| gpt2.bin | Byte-BPE tokenization model for GPT-2 | byte BPE | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/gpt2) |
| roberta.bin | Byte-BPE tokenization model for Roberta model | byte BPE | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta) |
| syllab.bin | Multi lingual model to identify allowed hyphenation points inside a word. | W2H | [src](https://github.com/microsoft/BlingFire/tree/master/ldbsrc/syllab) |

Oh yes, it is also the fastest! We did a comparison of Bling Fire with tokenizers from Hugging Face, [Bling Fire runs 4-5 times faster than Hugging Face Tokenizers](https://github.com/Microsoft/BlingFire/wiki/Comparing-performance-of-Bling-Fire-and-Hugging-Face-Tokenizers), see also [Bing Blog Post](https://blogs.bing.com/Developers-Blog/march-2020/Bling-FIRE-Tokenizer-for-BERT). We did comparison of Bling Fire Unigram LM and BPE implementation to the same one in [SentencePiece](https://github.com/google/sentencepiece) library, and our implementation is ~2x faster, see [XLNET benchmark](https://github.com/microsoft/BlingFire/blob/master/ldbsrc/xlnet/README.TXT) and [BPE benchmark](https://github.com/microsoft/BlingFire/blob/master/ldbsrc/bpe_example/README.TXT). Not to mention our default models are 10x faster than the same functionality from [SpaCy](https://github.com/explosion/spaCy), see [benchmark wiki](https://github.com/Microsoft/BlingFire/wiki/Benchmark-Guide) and this [Bing Blog Post](https://blogs.bing.com/Developers-Blog/2019-04/bling-fire-tokenizer-released-to-open-source).

So if low-latency inference is what you need, then you have to try Bling Fire!


## Python API Description
Expand All @@ -57,7 +57,7 @@ If you simply want to use it in Python, you can install the latest release using


## Examples
### 1. Python example, using default pattern-based tokenizer:
### 1. Python example, using the default pattern-based tokenizer:
```python
from blingfire import *

Expand Down