Skip to content

Commit 4a7a62b

Browse files
authored
Merge branch 'master' into feature/chat
2 parents fbe324f + 5c6427e commit 4a7a62b

File tree

4 files changed

+21
-14
lines changed

4 files changed

+21
-14
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Train the Llama 2 LLM architecture in PyTorch then inference it with one simple
88

99
As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing.
1010

11-
Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compred to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.
11+
Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compared to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.
1212

1313
## feel the magic
1414

@@ -175,7 +175,7 @@ python tinystories.py train_vocab --vocab_size=4096
175175
python tinystories.py pretokenize --vocab_size=4096
176176
```
177177

178-
The `train_vocab` stage will call the `train_vocab.sh` script, which calls the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.
178+
The `train_vocab` stage will call the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.
179179

180180
A quick note of interest is that vocab size of 4096 trained specifically on tinystories creates integer sequences with about the same sequence length per example as the default Llama 2 tokenizer of 32000 tokens! This means that our custom, tailored tokenizer is a lot better adapted to our specific text, and can compress it very effectively. So our trained models are smaller and faster.
181181

@@ -339,6 +339,7 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
339339
- runq.c (int8 quantization) add
340340
- run.cu (CUDA) investigate and merge
341341
- add more tests inside [test.c](test.c)
342+
- add Engine class for use in sample.py that does efficient inference in PyTorch, e.g. KV cache keeping
342343
- make it easier to add a new dataset with not too much pain
343344
- (LoRA) finetuning and export of Llama 2 models
344345

run.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -949,7 +949,7 @@ int main(int argc, char *argv[]) {
949949
// build the Transformer via the model .bin file
950950
Transformer transformer;
951951
build_transformer(&transformer, checkpoint_path);
952-
if (steps == 0) steps = transformer.config.seq_len; // ovrerride to ~max length
952+
if (steps == 0 || steps > transformer.config.seq_len) steps = transformer.config.seq_len; // ovrerride to ~max length
953953

954954
// build the Tokenizer via the tokenizer .bin file
955955
Tokenizer tokenizer;

tinystories.py

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313

1414
import numpy as np
1515
import requests
16+
import sentencepiece as spm
1617
import torch
1718
import torch.distributed as dist
1819
from tqdm import tqdm
@@ -97,16 +98,21 @@ def train_vocab(vocab_size):
9798
of.write(text + "\n")
9899
print(f"Size is: {os.path.getsize(tiny_file) / 1024 / 1024:.2f} MB")
99100

100-
# 2) run the train_vocab.sh script that trains the sentencepiece model
101-
print("Will now train the vocab with:")
102-
cmd = f"bash train_vocab.sh {tiny_file} {prefix} {vocab_size}"
103-
print(cmd)
104-
print("OK? [y/N] ")
105-
dec = input()
106-
if dec.lower() != "y":
107-
print("Exiting...")
108-
return
109-
os.system(cmd)
101+
# 2) train the sentencepiece model
102+
print("Will now train the vocab...")
103+
spm.SentencePieceTrainer.train(input=tiny_file,
104+
model_prefix=prefix,
105+
model_type="bpe",
106+
vocab_size=vocab_size,
107+
self_test_sample_size=0,
108+
input_format="text",
109+
character_coverage=1.0,
110+
num_threads=os.cpu_count(),
111+
split_digits=True,
112+
allow_whitespace_only_pieces=True,
113+
byte_fallback=True,
114+
unk_surface=r" \342\201\207 ",
115+
normalization_rule_name="identity")
110116

111117
# 3) optional cleanup, ask the user if they'd like to delete tiny.txt
112118
dec = input(f"Delete the temporary file {tiny_file}? [y/N] ")

train.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ def get_lr(it):
271271
"loss/val": losses["val"],
272272
"lr": lr,
273273
"mfu": running_mfu * 100, # convert to percentage
274-
}
274+
}, step = iter_num
275275
)
276276
except Exception as e:
277277
print(f"logging to wandb failed: {e}")

0 commit comments

Comments
 (0)