-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harcoded incorrect (and repeated) validation example #796
Comments
This piece of validation was useful to subjectively evaluate the output of a fine-tuned model on the alpaca dataset as training progressed. Reusing it is on purpose so that you can see how it evolves with more steps. You are correct that it might be completely irrelevant for other datasets. Feel free to completely remove those lines from your script or change the instruction to be more relevant to your use case. |
why not just take it from your test split? |
You could do that, but then it might be different if you shuffle or change your dataset splits |
When I added that prompt to the scripts, I was legit actually looking for a movie to watch and that's why it is there. |
we are indeed randomly shuffling the dataset when preparing it but the seed is always the same in every prepare script 😉
Not questioning that sir but it took me a bit to check that I was not messing with the datasets 😁 |
@DavidGOrtega What do you think we could do? The requirement is that the lit-gpt scripts stay as simple as possible. We want people to read the entire script and understand it in and out. The scripts are more like templates that the user adapts to their need. So we wouldn't want to add too much code here to select a sample from the dataset directly. As Carlos said that would require instantiating the dataset separately. Maybe we could add a comment above the code that selects the sample, something like I would also be fine with removing it if it leads to too much confusion. I just thought that printing some output as a sanity check could help. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Any thought here @awaelchli ? |
My thoughts are feel free make a concrete suggestion in form of a PR regarding the sample, or remove it entirely. We appreciate and encourage active contributions if they are within the project's scope.
A human error of leaving one redundant line of code in there is nowhere near anything like a deliberate act of obfuscation or anything like that. Let's just remove this line
and the world is in balance again. If you see dead code elsewhere please call it out but we need some concrete evidence first that there are real problems with the code being obscure and hard to understand (like what else is there that you are hinting at?). Of course, lit-gpt is an NLP-focused template so certain terminology is expected to be known. But by all means if we think that terms like eos, bos are too abbreviated, let's add a comment to explain them at the appropriate places! The scripts being versions of each other is intentional. Lit-GPT actively tries not to be a framework, and thus favors single-file templates over abstractions, even if that means certain blocks of code are repeated. Think of them as templates! |
This comment was marked as off-topic.
This comment was marked as off-topic.
What does that mean concretely in lit-gpt? I'm still missing concrete examples. I'd like to talk specifics to make progress on the issue. |
I can do a PR if you are able to accept it. |
Happy to take a look at your proposal. Since the data comes pre-tokenized, you'll have to change the code to decode it (to print it for the human). |
Happy to hear! Let me prepare something. |
Did anything ever come of this issue/PR? I notice that the hardcoded example is still in the code, and I also ran into it as a bug. |
This has come up several times already and people are often confused by it (see also #1179 (comment)). What if we simply remove this bit @rasbt, @awaelchli? This is also a challenge for thunder because generation has a prefill and next-token stages with different shapes |
How about we add a |
That works but the counterargument is to keep things simple. If we won't have it by default then I wouldn't have it. But if we don't want anybody having to write any code then we should have it. |
For my part, I actually like seeing the text generation for a fixed prompt during training because it's useful, so I'd like to have that option in the config files to (re)enable it for selfish reasons 😅 |
IMO the proposed solution to use a fixed query from the test split makes a lot of sense. |
👋 I have found this code to be surprisingly incorrect as my dataset has nothing to do with this validation example. Its also repeated in every single finetuning script.
The text was updated successfully, but these errors were encountered: