-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support LLaVA #273
Comments
Hey @briankariuki! It looks like Llava is a composite model, so it will likely be closest to
Once it is officially implemented, they will probably update the HF repo, or have a separate one that reflects that implementation. It may be worth waiting for the HF PR to be finalized to see what decisions they make, but we can also prototype sooner and sync once they merge :) |
Hey @jonatanklosko Thanks for the explanation above. I've been able to implement a few things: One is the I've also implemented One problem I'm facing is how to convert the outputs of the vision model so that I can pass them to the text model. From the official implementation and the hf pull request there's a function that prepares the inputs for multimodal. I'm not sure how I would go about implementing that in bumblebee. You can find the implementation of the function here Thanks. |
By a brief look this function processes the output of the vision model before we feed it into the text model and it actually uses some NN layers |
Thanks @jonatanklosko . Did you get a chance to look at the LLava model and code? |
It looks like the upstream PR moved to this one and is closer to crossing the finish line. Looking again I think this part is going to be really challenging unfortunately. In a way the implementation is really stitching the models, it embeds the image with vision model, and it embeds the text with a specific layer from the text model, then it combines these embeddings (separately for each batch entry) and passes through the text model. |
Any ideas or progress on this? Run up against it again today and wondering if there is anything I can do to help push this over |
Hello. I got stuck on how to implement the projector part that extracts the image features and embeds them into the LLM as tokens |
I was able to implement LLavaVision and LLavaText, which are very similar to ClipVision and LlamaText. The piece missing is the multimodal projector. |
It looks like llama.cpp ran into some of the same types of issues pulling it in to their interfaces -- here is a pr that shows how they are working through them ggerganov/llama.cpp#5267 I am still trying to map back to bumblebee to see if there is any parallels that can be had. |
The multimodal projector should just be a FFN between the image and LLM, I can take a look at this |
I'm working on adding LLaVA to bumblebee as a learning exercise.
I need some guidance on a few things:
The transformers package has not added support for LLaVA but there's an ongoing PR that can be found here but has not been merged yet.
Thanks.
The text was updated successfully, but these errors were encountered: