Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harcoded incorrect (and repeated) validation example #796

Closed
DavidGOrtega opened this issue Dec 6, 2023 · 20 comments · Fixed by #1804
Closed

Harcoded incorrect (and repeated) validation example #796

DavidGOrtega opened this issue Dec 6, 2023 · 20 comments · Fixed by #1804
Labels
enhancement New feature or request

Comments

@DavidGOrtega
Copy link

DavidGOrtega commented Dec 6, 2023

👋 I have found this code to be surprisingly incorrect as my dataset has nothing to do with this validation example. Its also repeated in every single finetuning script.

# produce an example:
    instruction = "Recommend a movie for me to watch during the weekend and explain the reason."
    fabric.print(instruction)
    sample = {"instruction": instruction, "input": ""}
    prompt = generate_prompt(sample)
    encoded = tokenizer.encode(prompt, device=fabric.device)
    with fabric.init_tensor():
        # do not set `max_seq_length=max_returned_token` because memory is not a concern here
        model.set_kv_cache(batch_size=1)
    output = generate(model, encoded, max_returned_tokens=len(encoded) + eval_max_new_tokens, temperature=0.8)
    model.clear_kv_cache()
    output = tokenizer.decode(output)
    fabric.print(output)
@DavidGOrtega DavidGOrtega changed the title Incorrect (and repeated) validation example Harcoded incorrect (and repeated) validation example Dec 6, 2023
@carmocca
Copy link
Contributor

carmocca commented Dec 7, 2023

This piece of validation was useful to subjectively evaluate the output of a fine-tuned model on the alpaca dataset as training progressed. Reusing it is on purpose so that you can see how it evolves with more steps.

You are correct that it might be completely irrelevant for other datasets. Feel free to completely remove those lines from your script or change the instruction to be more relevant to your use case.

@carmocca carmocca added the question Further information is requested label Dec 7, 2023
@DavidGOrtega
Copy link
Author

why not just take it from your test split?

@carmocca
Copy link
Contributor

carmocca commented Dec 8, 2023

You could do that, but then it might be different if you shuffle or change your dataset splits

@awaelchli
Copy link
Contributor

When I added that prompt to the scripts, I was legit actually looking for a movie to watch and that's why it is there.

@DavidGOrtega
Copy link
Author

DavidGOrtega commented Dec 9, 2023

You could do that, but then it might be different if you shuffle or change your dataset splits

we are indeed randomly shuffling the dataset when preparing it but the seed is always the same in every prepare script 😉

When I added that prompt to the scripts, I was legit actually looking for a movie to watch and that's why it is there.

Not questioning that sir but it took me a bit to check that I was not messing with the datasets 😁

@awaelchli
Copy link
Contributor

awaelchli commented Dec 10, 2023

@DavidGOrtega What do you think we could do? The requirement is that the lit-gpt scripts stay as simple as possible. We want people to read the entire script and understand it in and out. The scripts are more like templates that the user adapts to their need. So we wouldn't want to add too much code here to select a sample from the dataset directly. As Carlos said that would require instantiating the dataset separately.

Maybe we could add a comment above the code that selects the sample, something like
"Fixed example prompt as a sanity check to compare across validation runs, please adapt to your data."
?

I would also be fine with removing it if it leads to too much confusion. I just thought that printing some output as a sanity check could help.

@DavidGOrtega

This comment was marked as off-topic.

@DavidGOrtega
Copy link
Author

Any thought here @awaelchli ?

@awaelchli
Copy link
Contributor

awaelchli commented Dec 16, 2023

My thoughts are feel free make a concrete suggestion in form of a PR regarding the sample, or remove it entirely. We appreciate and encourage active contributions if they are within the project's scope.

Unfortunately, I find the scripts pretty difficult to understand by obscurity...

A human error of leaving one redundant line of code in there is nowhere near anything like a deliberate act of obfuscation or anything like that. Let's just remove this line

"input_ids_no_response": encoded_full_prompt,

and the world is in balance again. If you see dead code elsewhere please call it out but we need some concrete evidence first that there are real problems with the code being obscure and hard to understand (like what else is there that you are hinting at?). Of course, lit-gpt is an NLP-focused template so certain terminology is expected to be known. But by all means if we think that terms like eos, bos are too abbreviated, let's add a comment to explain them at the appropriate places!

The scripts being versions of each other is intentional. Lit-GPT actively tries not to be a framework, and thus favors single-file templates over abstractions, even if that means certain blocks of code are repeated. Think of them as templates!

@DavidGOrtega

This comment was marked as off-topic.

@awaelchli
Copy link
Contributor

What does that mean concretely in lit-gpt? I'm still missing concrete examples. I'd like to talk specifics to make progress on the issue.

@DavidGOrtega
Copy link
Author

DavidGOrtega commented Dec 18, 2023

I can do a PR if you are able to accept it.
My proposal for this issue is to take one example from the eval split

@awaelchli
Copy link
Contributor

Happy to take a look at your proposal. Since the data comes pre-tokenized, you'll have to change the code to decode it (to print it for the human).

@DavidGOrtega
Copy link
Author

Happy to hear! Let me prepare something.

@rhiever
Copy link

rhiever commented Mar 31, 2024

Did anything ever come of this issue/PR? I notice that the hardcoded example is still in the code, and I also ran into it as a bug.

@carmocca
Copy link
Contributor

carmocca commented Apr 2, 2024

This has come up several times already and people are often confused by it (see also #1179 (comment)). What if we simply remove this bit @rasbt, @awaelchli? This is also a challenge for thunder because generation has a prefill and next-token stages with different shapes

@carmocca carmocca added enhancement New feature or request and removed question Further information is requested labels Apr 2, 2024
@rasbt
Copy link
Collaborator

rasbt commented Apr 2, 2024

How about we add a "validate_prompt: Your text" to the configs but leave it empty (disable) it by default. This way users can choose to use it because it can sometimes be useful, but it will be off by default to not confuse people.

@carmocca
Copy link
Contributor

carmocca commented Apr 2, 2024

That works but the counterargument is to keep things simple. If we won't have it by default then I wouldn't have it. But if we don't want anybody having to write any code then we should have it.

@rasbt
Copy link
Collaborator

rasbt commented Apr 2, 2024

But if we don't want anybody having to write any code then we should have it.

For my part, I actually like seeing the text generation for a fixed prompt during training because it's useful, so I'd like to have that option in the config files to (re)enable it for selfish reasons 😅

@rhiever
Copy link

rhiever commented Apr 2, 2024

IMO the proposed solution to use a fixed query from the test split makes a lot of sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants