Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying the quantization pipeline #9

Open
kamalojasv181 opened this issue Mar 20, 2023 · 6 comments
Open

Simplifying the quantization pipeline #9

kamalojasv181 opened this issue Mar 20, 2023 · 6 comments

Comments

@kamalojasv181
Copy link
Contributor

The quantization pipeline seems very hard to use. Besides manually adding support for popular models, I think it would be a good idea if we could further automate the quantization pipeline.

As far as I can think, we only need a dict mapping the names of layers in the original model and the layers in the gpt model and we can use the same script to handle any architecture.

Thoughts @tejasvaidhyadev @Ayushk4 ?

@Ayushk4
Copy link
Member

Ayushk4 commented Mar 20, 2023

This is true. A lot of the code needs refactoring. We need to make it easy to add new models.

They good thing is that once we add support for any major model (like GPT-J), it becomes very easy to add support for its derivatives (like GPT-JT).

I would greatly welcome suggestions on how can we improve on this.

@A2va
Copy link
Contributor

A2va commented Mar 20, 2023

I'm wonder if it's possible to do the whole quantization process in the python conversion script. I feel like this is much simpler than a two-step process with two different programs.

@Ayushk4
Copy link
Member

Ayushk4 commented Mar 21, 2023

That's a good suggestion @A2va .

Do you have any suggestions on how we can quantize and save in a fast manner in python?

@kamalojasv181
Copy link
Contributor Author

I dont think we need python for that. Like we have all the weights saved in the ggml model. We just need the computation graph. Onnx does this by doing a forward pass and saving a static graph. We could potentially do something like that or perhaps start with onnx itself.

@Ayushk4
Copy link
Member

Ayushk4 commented Mar 21, 2023

ONNX is a general purpose - ggml does not support all the operations like slicing and all. If we go that route, then we will have to add support for reading their computation graph and map it to GGML computation graph, write computation-graph specific rules to substitute for the missing operations. If it is worth it in the long run, we could do it. But, it will take a long time to get something tangible - a minimal viable prototype of converting ONNX computation graph to GGML for a single model.

@A2va
Copy link
Contributor

A2va commented Mar 21, 2023

Do you have any suggestions on how we can quantize and save in a fast manner in python?

Not directly, but I found this script in llma.cpp which take a already quantized pytorch model and convert it to a ggml model.

Quantization of LLaMa and OPT model in python: https://github.com/qwopqwop200/GPTQ-for-LLaMa

I have no idea if this is fast, but it's certainly slower than the C++ version.
I had no idea until I read the README that GPTQ and Int4 quantization is different. So which of those methods the cpp programs use to quantize ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants