-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of dataset validation for Causal Language Models #865
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @us8945 , I like the improved visuals.
Though, I think the process is still too difficult to understand for first time users. And the error message might now even be misleading.
Please go back and verify whether the problem type and other settings were set properly.
This pops up when I am setting up parent_id
but let's say my id
column is called msg_id
. It isn't really clear for users how this can be fixed. I'd really go into detail explaining how the dataframe needs to be structured and set-up to be used in conversation chains.
It also doesn't have to be on second page for such expected issues, but can be just an overlay that can be clicked away.
Might as well give the option to select another column as id
column instead if it only is a naming issue. In that case, modify the dataframe before saving the dataset. And it will prevent that the user needs to go back to modify the dataframe outside of LLM Studio.
In the error message to the user, remove "copy" button. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc edits look LGTM. Suggested a minor nitpick change.
Co-authored-by: Sherene Mahanama <[email protected]>
Co-authored-by: Sherene Mahanama <[email protected]>
Completed with the latest changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this issue.
In the UI, changing the parent_id
resets the state for other fields. This should not happen. See video below:
reset-2024-10-09_10.10.05.mp4
If I change the ID column in the experiment settings now (while having a parent id set), the error is again too cryptic to understand:
ValueError: Found input variables with inconsistent numbers of samples: [13026, 12860]
Nitpick: the error is in a code field which is color coded and has a copy button. Neither is needed and adds distraction.
Hi @pascal-pfeiffer , I have corrected two out of three problems you have raised. I could not replicated the problem with the experiment parameters change that is resulted in this exception:
Could you please provide steps to recreate and the dataset? |
Hi @pascal-pfeiffer , the problem is with "Experiment" window is not handling properly changes in the "Parent Id Column" and "Id Column". I will add code to perform validation checks on those fields in a similar way done in the Dataset set up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes made to perform additional sanity checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! All working great!
Though, the reload brings another issue w.r.t. merging with existing datasets. The button reappears when parent_id is changed.
I think it has always been present when the validation failed. Now just gets more obvious when the UI reloads on parent_id change.
A minor request before merging: please remove the commented out code that you moved to step 5
Resolve issue #855
Changes:
Test: