Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of dataset validation for Causal Language Models #865

Merged
merged 16 commits into from
Oct 22, 2024

Conversation

us8945
Copy link
Contributor

@us8945 us8945 commented Sep 20, 2024

Resolve issue #855
Changes:

  • Add handling of Assert exception and improve user displayed message
  • Improve assert errors to add information that will be relied back to user on how to fix the error

Test:

  • When populating parent-id, select column that is populated for all records
  • Try to Import dataset with no "id" column

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @us8945 , I like the improved visuals.

Though, I think the process is still too difficult to understand for first time users. And the error message might now even be misleading.

Please go back and verify whether the problem type and other settings were set properly.

This pops up when I am setting up parent_id but let's say my id column is called msg_id. It isn't really clear for users how this can be fixed. I'd really go into detail explaining how the dataframe needs to be structured and set-up to be used in conversation chains.

It also doesn't have to be on second page for such expected issues, but can be just an overlay that can be clicked away.

Might as well give the option to select another column as id column instead if it only is a naming issue. In that case, modify the dataframe before saving the dataset. And it will prevent that the user needs to go back to modify the dataframe outside of LLM Studio.

@us8945
Copy link
Contributor Author

us8945 commented Sep 24, 2024

In the error message to the user, remove "copy" button.

Copy link
Collaborator

@sherenem sherenem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc edits look LGTM. Suggested a minor nitpick change.

documentation/docs/tooltips/experiments/_id-column.mdx Outdated Show resolved Hide resolved
@us8945 us8945 dismissed pascal-pfeiffer’s stale review October 2, 2024 16:26

Completed with the latest changes

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this issue.

In the UI, changing the parent_id resets the state for other fields. This should not happen. See video below:

reset-2024-10-09_10.10.05.mp4

If I change the ID column in the experiment settings now (while having a parent id set), the error is again too cryptic to understand:

ValueError: Found input variables with inconsistent numbers of samples: [13026, 12860]

Nitpick: the error is in a code field which is color coded and has a copy button. Neither is needed and adds distraction.
image

@us8945
Copy link
Contributor Author

us8945 commented Oct 10, 2024

Hi @pascal-pfeiffer , I have corrected two out of three problems you have raised. I could not replicated the problem with the experiment parameters change that is resulted in this exception:

ValueError: Found input variables with inconsistent numbers of samples: [13026, 12860]

Could you please provide steps to recreate and the dataset?

@us8945
Copy link
Contributor Author

us8945 commented Oct 11, 2024

Hi @pascal-pfeiffer , the problem is with "Experiment" window is not handling properly changes in the "Parent Id Column" and "Id Column". I will add code to perform validation checks on those fields in a similar way done in the Dataset set up.

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still getting the same error. The latest changes don't seem to get triggered.

ValueError: Found input variables with inconsistent numbers of samples: [13026, 12860]

This is just oasst defaults with Parent_ID = parent_id and setting a wrong ID_Column = instruction

image

@us8945 us8945 dismissed pascal-pfeiffer’s stale review October 22, 2024 14:56

Changes made to perform additional sanity checks

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! All working great!

Though, the reload brings another issue w.r.t. merging with existing datasets. The button reappears when parent_id is changed.
I think it has always been present when the validation failed. Now just gets more obvious when the UI reloads on parent_id change.

A minor request before merging: please remove the commented out code that you moved to step 5

llm_studio/app_utils/sections/dataset.py Outdated Show resolved Hide resolved
@us8945 us8945 merged commit 4d9d59e into h2oai:main Oct 22, 2024
4 checks passed
@us8945 us8945 deleted the uri/issue-855 branch October 22, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[UX] Missing ID column when a parent_id was specified, give a nice warning instead of a traceback
3 participants