-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llama 3 tokenizer #850
base: main
Are you sure you want to change the base?
Conversation
@@ -67,7 +67,7 @@ class Version(enum.Enum): | |||
VOCAB_SIZE = { | |||
Version.V1: 32 * 1024, | |||
Version.V2: 32 * 1024, | |||
Version.V3: 128256, | |||
Version.V3: 128 * 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama 3 is supposed to be 128256. Are there any plans to stick to standard llama 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add -tiktoken configs and they will use 128256.
the configs without -tiktoken will stay unchanged. vocab_size is still 128*1024 and using bpe tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nice! I didn't catch that part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you confirm with @kelvin-zou that the new configs still work on TPUs and GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor follow-up comments.
@@ -301,6 +305,10 @@ def model_config( | |||
lm_head=lm_head_cfg, | |||
dropout_rate=dropout_rate, | |||
) | |||
if pad_token_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since pad_token_id=0
is a valid token:
if pad_token_id: | |
if pad_token_id is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -301,6 +305,10 @@ def model_config( | |||
lm_head=lm_head_cfg, | |||
dropout_rate=dropout_rate, | |||
) | |||
if pad_token_id: | |||
decoder_cfg.set(pad_token_id=pad_token_id) | |||
if eos_token_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if eos_token_id: | |
if eos_token_id is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
128256: "bpe_128k_c4.model", | ||
} | ||
|
||
def _vocab_cfg(size: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit -
def _vocab_cfg(size: int): | |
def _vocab_cfg(vocab_size: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if size == 128256: | ||
# TikToken. | ||
return config_for_class(FujiV3Vocabulary).set(filename="Llama-3-tokenizer.json") | ||
raise ValueError(f"size {size} tokenizer does not exist.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError(f"size {size} tokenizer does not exist.") | |
raise ValueError(f"Tokenizer with vocab size {size} does not exist.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pad_token_id: Optional[int] = None, | ||
eos_token_id: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update docstring accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
axlearn/experiments/text/gpt/fuji.py
Outdated
@@ -423,6 +421,9 @@ def get_trainer_kwargs( | |||
raise NotImplementedError(f"Unknown model size {model_size}.") | |||
model_kwargs = trainer_kwargs.pop("model_kwargs") | |||
model_kwargs.setdefault("vocab_size", vocab_size) | |||
if version == Version.V3 and vocab_size == 128256: # tiktoken tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a private const for the tiktoken size?
_TIKTOKEN_VOCAB_SIZE = 128256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a new version called V3_tiktoken. I think it will make things cleaner.
axlearn/experiments/text/gpt/fuji.py
Outdated
vocab_size = VOCAB_SIZE[version] | ||
if tiktoken: | ||
suffix += "-tiktoken" | ||
vocab_size = 128256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the const here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -27,6 +27,65 @@ | |||
# Use cpu for the test. | |||
jax.config.update("jax_platform_name", "cpu") | |||
|
|||
config_dict_1b = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to the json config as a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
filename = os.path.join(data_dir, "tokenizers", "hf", filename) | ||
if filename.startswith("gs:") or filename.startswith("s3:"): | ||
# Create a different file for each usage. | ||
tmp = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we cleanup tmp
after tokenizer has been initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
model.decoder.transformer.repeat.drop_output.fn: 'axlearn.common.repeat._drop_by_regex' | ||
model.decoder.transformer.repeat.drop_output.rules[0]: 'module_outputs.*' | ||
model.decoder.transformer.repeat.klass: 'axlearn.common.attention._TransformerRepeat' | ||
model.decoder.vocab_size: 128256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samos123 for example here it is using tiktoken and vocab_size is 128256.
@@ -301,6 +305,10 @@ def model_config( | |||
lm_head=lm_head_cfg, | |||
dropout_rate=dropout_rate, | |||
) | |||
if pad_token_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -301,6 +305,10 @@ def model_config( | |||
lm_head=lm_head_cfg, | |||
dropout_rate=dropout_rate, | |||
) | |||
if pad_token_id: | |||
decoder_cfg.set(pad_token_id=pad_token_id) | |||
if eos_token_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
128256: "bpe_128k_c4.model", | ||
} | ||
|
||
def _vocab_cfg(size: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if size == 128256: | ||
# TikToken. | ||
return config_for_class(FujiV3Vocabulary).set(filename="Llama-3-tokenizer.json") | ||
raise ValueError(f"size {size} tokenizer does not exist.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pad_token_id: Optional[int] = None, | ||
eos_token_id: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
axlearn/experiments/text/gpt/fuji.py
Outdated
@@ -423,6 +421,9 @@ def get_trainer_kwargs( | |||
raise NotImplementedError(f"Unknown model size {model_size}.") | |||
model_kwargs = trainer_kwargs.pop("model_kwargs") | |||
model_kwargs.setdefault("vocab_size", vocab_size) | |||
if version == Version.V3 and vocab_size == 128256: # tiktoken tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a new version called V3_tiktoken. I think it will make things cleaner.
axlearn/experiments/text/gpt/fuji.py
Outdated
vocab_size = VOCAB_SIZE[version] | ||
if tiktoken: | ||
suffix += "-tiktoken" | ||
vocab_size = 128256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -27,6 +27,65 @@ | |||
# Use cpu for the test. | |||
jax.config.update("jax_platform_name", "cpu") | |||
|
|||
config_dict_1b = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
filename = os.path.join(data_dir, "tokenizers", "hf", filename) | ||
if filename.startswith("gs:") or filename.startswith("s3:"): | ||
# Create a different file for each usage. | ||
tmp = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
3e8d4f9
to
81f3e7b
Compare
add a new version called V3_TIKTOKEN. other edits based on suggestions.
No description provided.