Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update python readme prerequisites #1074

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 68 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,11 @@ with the extension of direct training from raw sentences. SentencePiece allows u
**This is not an official Google product.**

## Technical highlights

- **Purely data driven**: SentencePiece trains tokenization and detokenization
models from sentences. Pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is not always required.
- **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
- **Multiple subword algorithms**: **BPE** [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported.
- **Multiple subword algorithms**: **BPE** [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported.
- **Subword regularization**: SentencePiece implements subword sampling for [subword regularization](https://arxiv.org/abs/1804.10959) and [BPE-dropout](https://arxiv.org/abs/1910.13267) which help to improve the robustness and accuracy of NMT models.
- **Fast and lightweight**: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
- **Self-contained**: The same tokenization/detokenization is obtained as long as the same model file is used.
Expand All @@ -31,27 +32,30 @@ with the extension of direct training from raw sentences. SentencePiece allows u

For those unfamiliar with SentencePiece as a software/algorithm, one can read [a gentle introduction here](https://medium.com/@jacky2wong/understanding-sentencepiece-under-standing-sentence-piece-ac8da59f6b08).


## Comparisons with other implementations
|Feature|SentencePiece|[subword-nmt](https://github.com/rsennrich/subword-nmt)|[WordPiece](https://arxiv.org/pdf/1609.08144.pdf)|
|:---|:---:|:---:|:---:|
|Supported algorithm|BPE, unigram, char, word|BPE|BPE*|
|OSS?|Yes|Yes|Google internal|
|Subword regularization|[Yes](#subword-regularization-and-bpe-dropout)|No|No|
|Python Library (pip)|[Yes](python/README.md)|No|N/A|
|C++ Library|[Yes](doc/api.md)|No|N/A|
|Pre-segmentation required?|[No](#whitespace-is-treated-as-a-basic-symbol)|Yes|Yes|
|Customizable normalization (e.g., NFKC)|[Yes](doc/normalization.md)|No|N/A|
|Direct id generation|[Yes](#end-to-end-example)|No|N/A|

| Feature | SentencePiece | [subword-nmt](https://github.com/rsennrich/subword-nmt) | [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) |
| :-------------------------------------- | :--------------------------------------------: | :-----------------------------------------------------: | :-----------------------------------------------: |
| Supported algorithm | BPE, unigram, char, word | BPE | BPE\* |
| OSS? | Yes | Yes | Google internal |
| Subword regularization | [Yes](#subword-regularization-and-bpe-dropout) | No | No |
| Python Library (pip) | [Yes](python/README.md) | No | N/A |
| C++ Library | [Yes](doc/api.md) | No | N/A |
| Pre-segmentation required? | [No](#whitespace-is-treated-as-a-basic-symbol) | Yes | Yes |
| Customizable normalization (e.g., NFKC) | [Yes](doc/normalization.md) | No | N/A |
| Direct id generation | [Yes](#end-to-end-example) | No | N/A |

Note that BPE algorithm used in WordPiece is slightly different from the original BPE.

## Overview

### What is SentencePiece?

SentencePiece is a re-implementation of **sub-word units**, an effective way to alleviate the open vocabulary
problems in neural machine translation. SentencePiece supports two segmentation algorithms, **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)]. Here are the high level differences from other implementations.
problems in neural machine translation. SentencePiece supports two segmentation, **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)]. Here are the high level differences from other implementations.

#### The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed
vocabulary. Unlike most unsupervised word segmentation algorithms, which
assume an infinite vocabulary, SentencePiece trains the segmentation model such
Expand All @@ -62,10 +66,12 @@ Note that SentencePiece specifies the final vocabulary size for training, which
The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

#### Trains from raw sentences

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.
The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

#### Whitespace is treated as a basic symbol

The first step of Natural Language processing is text tokenization. For
example, a standard English tokenizer would segment the text "Hello world." into the
following three tokens.
Expand Down Expand Up @@ -96,15 +102,16 @@ Note that we cannot apply the same lossless conversions when splitting the
sentence with standard word segmenters, since they treat the whitespace as a
special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.

* (en) Hello world. → [Hello] [World] [.] \(A space between Hello and World\)
* (ja) こんにちは世界。 → [こんにちは] [世界] [。] \(No space between こんにちは and 世界\)
- (en) Hello world. → [Hello] [World] [.] \(A space between Hello and World\)
- (ja) こんにちは世界。 → [こんにちは] [世界] [。] \(No space between こんにちは and 世界\)

#### Subword regularization and BPE-dropout

Subword regularization [[Kudo.](https://arxiv.org/abs/1804.10959)] and BPE-dropout [Provilkov et al](https://arxiv.org/abs/1910.13267) are simple regularization methods
that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library
([C++](doc/api.md#sampling-subword-regularization)/[Python](python/README.md)) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of [Python library](python/README.md). You can find that 'New York' is segmented differently on each ``SampleEncode (C++)`` or ``encode with enable_sampling=True (Python)`` calls. The details of sampling parameters are found in [sentencepiece_processor.h](src/sentencepiece_processor.h).
([C++](doc/api.md#sampling-subword-regularization)/[Python](python/README.md)) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of [Python library](python/README.md). You can find that 'New York' is segmented differently on each `SampleEncode (C++)` or `encode with enable_sampling=True (Python)` calls. The details of sampling parameters are found in [sentencepiece_processor.h](src/sentencepiece_processor.h).

```
>>> import sentencepiece as spm
Expand All @@ -122,6 +129,7 @@ To enable subword regularization, you would like to integrate SentencePiece libr
## Installation

### Python module

SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation.
You can install Python binary package of SentencePiece with.

Expand All @@ -132,20 +140,23 @@ pip install sentencepiece
For more detail, see [Python module](python/README.md)

### Build and install SentencePiece command line tools from C++ source

The following tools and libraries are required to build SentencePiece:

* [cmake](https://cmake.org/)
* C++11 compiler
* [gperftools](https://github.com/gperftools/gperftools) library (optional, 10-40% performance improvement can be obtained.)
- [cmake](https://cmake.org/)
- C++11 compiler
- [gperftools](https://github.com/gperftools/gperftools) library (optional, 10-40% performance improvement can be obtained.)

On Ubuntu, the build tools can be installed with apt-get:

```
% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
```

Then, you can build and install command line tools as follows.

```
% git clone https://github.com/google/sentencepiece.git
% git clone https://github.com/google/sentencepiece.git
% cd sentencepiece
% mkdir build
% cd build
Expand All @@ -154,6 +165,7 @@ Then, you can build and install command line tools as follows.
% sudo make install
% sudo ldconfig -v
```

On OSX/macOS, replace the last command with `sudo update_dyld_shared_cache`

### Build and install using vcpkg
Expand All @@ -172,60 +184,72 @@ The sentencepiece port in vcpkg is kept up to date by Microsoft team members and

You can download the wheel from the [GitHub releases page](https://github.com/google/sentencepiece/releases/latest).
We generate [SLSA3 signatures](slsa.dev) using the OpenSSF's [slsa-framework/slsa-github-generator](https://github.com/slsa-framework/slsa-github-generator) during the release process. To verify a release binary:

1. Install the verification tool from [slsa-framework/slsa-verifier#installation](https://github.com/slsa-framework/slsa-verifier#installation).
2. Download the provenance file `attestation.intoto.jsonl` from the [GitHub releases page](https://github.com/google/sentencepiece/releases/latest).
3. Run the verifier:

```shell
slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>
```

pip install wheel_file.whl

## Usage instructions

### Train SentencePiece Model

```
% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
```
* `--input`: one-sentence-per-line **raw** corpus file. No need to run

- `--input`: one-sentence-per-line **raw** corpus file. No need to run
tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes
the input with Unicode NFKC. You can pass a comma-separated list of files.
* `--model_prefix`: output model name prefix. `<model_name>.model` and `<model_name>.vocab` are generated.
* `--vocab_size`: vocabulary size, e.g., 8000, 16000, or 32000
* `--character_coverage`: amount of characters covered by the model, good defaults are: `0.9995` for languages with rich character set like Japanese or Chinese and `1.0` for other languages with small character set.
* `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.
- `--model_prefix`: output model name prefix. `<model_name>.model` and `<model_name>.vocab` are generated.
- `--vocab_size`: vocabulary size, e.g., 8000, 16000, or 32000
- `--character_coverage`: amount of characters covered by the model, good defaults are: `0.9995` for languages with rich character set like Japanese or Chinese and `1.0` for other languages with small character set.
- `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.

Use `--help` flag to display all parameters for training, or see [here](doc/options.md) for an overview.

### Encode raw text into sentence pieces/ids

```
% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output
```

Use `--extra_options` flag to insert the BOS/EOS markers or reverse the input sequence.

```
% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)
```

SentencePiece supports nbest segmentation and segmentation sampling with `--output_format=(nbest|sample)_(piece|id)` flags.

```
% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output
```

### Decode sentence pieces/ids into raw text

```
% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output
```

Use `--extra_options` flag to decode the text in reverse order.

```
% spm_decode --extra_options=reverse < input > output
```

### End-to-End Example

```
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
Expand All @@ -244,28 +268,34 @@ trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab
% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.
```

You can find that the original input sentence is restored from the vocabulary id sequence.

### Export vocabulary list

```
% spm_export_vocab --model=<model_file> --output=<output file>
```
```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

`<output file>` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

### Redefine special meta tokens
By default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;/s&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

By default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;/s&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

```
% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...
```
When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknown id cannot be disabled. We can define an id for padding (&lt;pad&gt;) as ```--pad_id=3```.  

When setting -1 id e.g., `bos_id=-1`, this special token is disabled. Note that the unknown id cannot be disabled. We can define an id for padding (&lt;pad&gt;) as `--pad_id=3`.

If you want to assign another special tokens, please see [Use custom symbols](doc/special_symbols.md).

### Vocabulary restriction
```spm_encode``` accepts a ```--vocabulary``` and a ```--vocabulary_threshold``` option so that ```spm_encode``` will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).

The usage is basically the same as that of ```subword-nmt```. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:
`spm_encode` accepts a `--vocabulary` and a `--vocabulary_threshold` option so that `spm_encode` will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).

The usage is basically the same as that of `subword-nmt`. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

```
% cat {train_file}.L1 {train_file}.L2 | shuffle > train
Expand All @@ -274,21 +304,20 @@ The usage is basically the same as that of ```subword-nmt```. Assuming that L1 a
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2
```

```shuffle``` command is used just in case because ```spm_train``` loads the first 10M lines of corpus by default.
`shuffle` command is used just in case because `spm_train` loads the first 10M lines of corpus by default.

Then segment train/test corpus with `--vocabulary` option

Then segment train/test corpus with ```--vocabulary``` option
```
% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2
```

## Advanced topics

* [SentencePiece Experiments](doc/experiments.md)
* [SentencePieceProcessor C++ API](doc/api.md)
* [Use custom text normalization rules](doc/normalization.md)
* [Use custom symbols](doc/special_symbols.md)
* [Python Module](python/README.md)
* [Segmentation and training algorithms in detail]

- [SentencePiece Experiments](doc/experiments.md)
- [SentencePieceProcessor C++ API](doc/api.md)
- [Use custom text normalization rules](doc/normalization.md)
- [Use custom symbols](doc/special_symbols.md)
- [Python Module](python/README.md)
- [Segmentation and training algorithms in detail]
Loading