Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHEL AI / InstructLab CPT support #117

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

dbutenhof
Copy link
Collaborator

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization
  • Documentation Update

Description

Support CPT needs for the RHEL AI / InstructLab performance team.

This PR adds three components:

  1. A "Crucible" service to help project code interpret Crucible CDM data
  2. An "ilab" project providing APIs built on top of the Crucible service to discover InstructLab runs and provide configuration data and metrics.
  3. An "ilab" UI tab building on top of the API to discover and display InstructLab performance runs.

The crucible_svc is intended to be "reasonably general purpose" to support use by additional projects over time.

Related Tickets & Documents

Various Jira stories under Epic PANDA-496.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.

Testing

InstructLab CPT is using a persistent Crucible controller system in RDU3, tied to a 4-way L40S test system. The data store (a private OpenSearch instance) contains a set of Crucible runs capturing both training and SDG runs.

image
image

dbutenhof and others added 5 commits September 27, 2024 16:30
GET localhost:8000/api/v1/ilab/runs?benchmark=ilab will query the
ilab.crucible OpenSearch instance and return a list of ilab benchmark
runs.
Add Crucible readme file.

Cleanups and refactoring
Also added the option to override the default graph title generator using the
new `Graph.title` field.
Copy link
Collaborator Author

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments on my own re-re-re-review. 🤣

The backend comments are already fixed, but I'm posting them "for posterity" (whomever Mr Posterity may be); not quite ready to push yet. I haven't done anything about the UI comments ...

frontend/src/actions/ilabActions.js Outdated Show resolved Hide resolved
backend/app/api/v1/endpoints/ilab/ilab.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Show resolved Hide resolved
local-compose.sh Outdated Show resolved Hide resolved
Comment on lines +21 to +22
export const ILABS_JOBS_API_V1 = "/api/v1/ilab/runs";
export const ILAB_GRAPH_API_V1 = "/api/v1/ilab/runs/";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these identical? Are both used? Could they be combined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After loading the page many times, I'm starting to think that we should open the graph accordion with just the primary metric(s) graphed. That is, we shouldn't wait until someone makes a selection from the secondary metric pulldown -- but when they do, we can add the second graph.

frontend/src/actions/ilabActions.js Outdated Show resolved Hide resolved
This cleans up my direct API call to get the run's periods for graphing, to
use a separate action and a reducer.

I also experimented with trying to improve error diagnosis by looking at some
of the error responses to "toast" instead of just saying something went wrong.
Add a Crucible `close` method, and use a FastAPI yield dependency to ensure
every API connection is closed cleanly.
@dbutenhof dbutenhof marked this pull request as ready for review October 4, 2024 12:53
Copy link
Collaborator

@mfleader mfleader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I have right now. I'm working on ilab.py and crucible_svc.py.

backend/app/main.py Outdated Show resolved Hide resolved
backend/app/main.py Outdated Show resolved Hide resolved
backend/scripts/start-reload.sh Outdated Show resolved Hide resolved
@@ -1,5 +1,5 @@

# Openshift Performance Dashbaord
# Openshift Performance Dashboard
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be factored out into a tidying pr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a one letter swap? That seems ... excessive? 😦

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why you bundle the tidyings together, but yeah I'll let it slide

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real question is more from the other side -- at which point, finding small incidental fixes and helpful refactorings during a development task, does it become necessary (or even "advisable") to stop "making progress" in order to pull out all those changes and separate them into another PR?

Reformatting an entire codebase (which I'd love to do to cpt-dashboard, by the way, because it's wildly inconsistent and often ugly), for example, is clearly not something to even attempt in the middle of something else.

I backed out the fallback exception handler for you, which I kinda regret as I did some refactoring and had to guess at where a problem was because there's no traceback. (I thought I might try to investigate where FastAPI or some other part of the framework is dropping the ball, but it's awkward to even figure out where to start, and I've not had time.) This stuff could go in a separate PR, sure; but one needs it when writing significant new code, and we don't have a quick pipeline in cpt-dashboard so separating it means it might as well not be there at all. (And possibly never will be.) 🤷🏻

Minor stuff like fixing a one letter typo isn't worth tracking, and putting it off inevitably means it never gets done. It's isolated, it's trivial to review, and it gets done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I realized later that I hadn't backed out the fallback handler ... but it somehow stopped working for me. After some investigation I realize that it was because I added the yield "dependency" mechanism to manage the Crucible connection, and my except block there was absconding with my "unhandled" exception. I think I fixed it. Again, while this is technically separate from the "InstructLab/Crucible support", it was an important development tool I can't believe people were living without before (though I still don't understand why the built-in FastAPI unhandled exception handler isn't logging traceback, even with FastAPI(debug=True,...), and some effort spent trying to figure that out was ultimately a complete waste of time... 😦 )

frontend/src/components/atoms/PlotGraph/index.jsx Outdated Show resolved Hide resolved
frontend/src/components/templates/ILab/index.jsx Outdated Show resolved Hide resolved
const onSelect = (_event, value) => {
console.log("selected", value);
const run = value.split("*");
//setSelected(run[1].trim());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this dead code?


export const setSelectedMetrics = (id, metrics) => (dispatch, getState) => {
const metrics_selected = cloneDeep(getState().ilab.metrics_selected);
// if (id in metrics_selected) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this dead code?

const { start_date, end_date, size, offset } = getState().ilab;
const response = await API.get(API_ROUTES.ILABS_JOBS_API_V1, {
params: {
...(start_date && { start_date }),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this line mean?

Copy link
Collaborator Author

@dbutenhof dbutenhof Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. ... is the spread operator, which unrolls an array or object. I think start_date && { start_date } should result in ...{ start_date } (which will be spread to start_date: start_date in the surrounding object as an API query parameter) iff start_date is non-null.

But when I try a simplified version of this interactively in the Chrome console, it's unhappy about the paren, so I wonder if this'll actually work:

Uncaught SyntaxError: Unexpected token '('

@MVarshini ??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are still pretty confusing. I think a comment in the code may be worth it.

backend/app/services/crucible_readme.md Outdated Show resolved Hide resolved
frontend/src/actions/toastActions.js Outdated Show resolved Hide resolved
backend/app/main.py Outdated Show resolved Hide resolved
@dbutenhof dbutenhof self-assigned this Oct 8, 2024
@dbutenhof dbutenhof added the enhancement New feature or request label Oct 8, 2024
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
Copy link
Member

@jaredoconnell jaredoconnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for the backend.

backend/app/api/v1/endpoints/ilab/ilab.py Show resolved Hide resolved
backend/app/services/crucible_readme.md Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
Copy link
Member

@jaredoconnell jaredoconnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any blockers. I'll leave the final approval up to the developers that maintain this repository.

const { start_date, end_date, size, offset } = getState().ilab;
const response = await API.get(API_ROUTES.ILABS_JOBS_API_V1, {
params: {
...(start_date && { start_date }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are still pretty confusing. I think a comment in the code may be worth it.

frontend/src/actions/ilabActions.js Outdated Show resolved Hide resolved
Comment on lines +40 to +44
const checkAndFetch = (_evt, newPage) => {
if (props.type === "ilab") {
dispatch(fetchNextJobs(newPage));
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because it's only fetched per page for iLab, and the others do one fetch for all pages?

Comment on lines +28 to +29
data={getGraphData(item.id)[0]?.data}
layout={getGraphData(item.id)[0]?.layout}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason it's a list? And why the first item?

import { Title } from "@patternfly/react-core";
import { uid } from "@/utils/helper";

const MetaRow = (props) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be helpful to add a brief comment stating what a meta row is for.


return hasData;
};
/* Metrics select */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason a block comment is used for a single exclusive line?

status: "Status",
};

return (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for this thing to be split up? If so, do you think it would be worth it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it could be refactored into subcomponents; whether that'd be worthwhile is a different question. There isn't a lot of logic here above the external components it's already using.

backend/app/services/crucible_svc.py Outdated Show resolved Hide resolved
+ add some method documentation
+ misc review feedback
@dbutenhof
Copy link
Collaborator Author

I'll split this up along functional lines. That'll be a bit neater after #118 is resolved, as it factors out the Python dependency changes.

@dbutenhof dbutenhof mentioned this pull request Oct 16, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants