Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ggml #417

Closed
philwee opened this issue Apr 18, 2023 · 18 comments
Closed

Support for ggml #417

philwee opened this issue Apr 18, 2023 · 18 comments
Assignees
Labels
good first issue Good for newcomers help wanted Contributors and extra help welcome.

Comments

@philwee
Copy link
Contributor

philwee commented Apr 18, 2023

Could there be support for ggml added to this soon - 4bit quantized models are said to be pretty decent, but there is no reliable way to test this out. It would be nice if support for it could be added to this.

Thank you!

@jon-tow
Copy link
Member

jon-tow commented Apr 22, 2023

@philwee tagging the python bindings you shared which should make it much easier to add ggml support:

https://github.com/abetlen/llama-cpp-python

@haileyschoelkopf
Copy link
Collaborator

If someone wants to work on this I’d be happy to give pointers! All that’s required is a new LM subclass akin to #395 .

I may take a look at working on this integration on our end in ~1 month from now, if no one else has started a PR by then.

@philwee
Copy link
Contributor Author

philwee commented Apr 29, 2023

I can try to work on this, could you give some pointers?

@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Apr 29, 2023

Of course! I’d recommend looking at the PR I linked to get a sense of what the scope might be.

The process would look something like:
-make a new file in lm_eval/models called “ggml_model.py” or similar

  • in that file make a BaseLM subclass called GGMLLM or similar
  • This class should do the following:
  • In initialization, instantiate a model using the Python bindings @jon-tow linked
  • Implement the loglikelihood_rolling(), loglikelihood(), and greedy_until() class methods to support all 3 completion types (see gpt3.py or BaseLM for a template to compare to)
  • add any helper methods for those functions!

Lmk if this makes sense!

@StellaAthena
Copy link
Member

I asked about this in the ggml library and the Response contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

@StellaAthena StellaAthena added help wanted Contributors and extra help welcome. good first issue Good for newcomers labels Apr 30, 2023
@StellaAthena
Copy link
Member

Carson Poole reports:

ggml is doing the compute in int4 rather than just the weight storage. it's how it can be so much faster than a typical cpu impl because CPUs are more compute bound than GPUs for gemms
it's also egregiously slow for long input context lengths. a very unoptimized WebGPU implementation will obliterate ggml's speed on like 500-1000 tokens input

So it may be worth lowering the priority on this. Of course, implementing it would enable us to better evaluate these claims 🙃

@Green-Sky
Copy link

Green-Sky commented Apr 30, 2023

a very unoptimized WebGPU implementation will obliterate ggml's speed on like 500-1000 tokens input

there exists BLAS support (OpenBLAS, cuBLAS, clblast), which outperforms larger batchsizes of just the simd tuned code. (openblas -> cpu, cublas and clblast -> gpu)

the blas acceleration can already make a difference with single digit batchsizese

edit: also since only the logits are of interest, eval can be done in very large batchsizes (even better for blas)

@Green-Sky
Copy link

I asked about this in the ggml library and the ggerganov/ggml#120 (comment) contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

Personally I think this one is better (no need to call that one a "starting point").

@StellaAthena
Copy link
Member

I asked about this in the ggml library and the ggerganov/ggml#120 (comment) contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

Personally I think this one is better (no need to call that one a "starting point").

I saw that, but per the issue at abetlen/llama-cpp-python#71 it appears to be 5x slower than the underlying implantation.

@Green-Sky
Copy link

It might be bc it does not build the llama.so/.dll properly / only in 1 configuration. so simd might be disabled. There is also the fact that there is no official BLAS enabled build available anywhere. (see abetlen/llama-cpp-python#117 )

@Green-Sky
Copy link

but they are "easy" to fix after the fact, since you can build the llama.dll yourself with the buildoptions that you like and replace the one shipped with the bindings (recommended right now).

@StellaAthena
Copy link
Member

@Green-Sky I have almost no experience with C, but if you can do that and demonstrate acceptable speed that works for me.

@gjmulder
Copy link

@StellaAthena If you want to give me a representative test prompt I can compare llama-cpp-python to native llama.cpp. I also have both a 16 core CPU w/128GB of RAM and a shiny new 3090Ti w/24GB if you need some test cycles.

Here's my (short run comparative) perplexity scores to date with the models I have on hand.

@StellaAthena
Copy link
Member

@gjmulder i haven’t had the bandwidth to test it yet, but this PR supports saving the actual predictions to disk: #492

you can run Lambada, HellaSwag, and ANLI with a limit of 20. If that ends up identical I think assuming it generalizes is safe. Maybe throw in a math problem too

@gjmulder
Copy link

llama-cpp-python attempts to implement the OpenAI API, so I may look at simply pointing the harness at an instance of llama-cpp-python and running a few smoke tests.

@StellaAthena
Copy link
Member

Sounds great!

@matthoffner
Copy link

Started adding support for a llama-cpp-python server here: #617

@haileyschoelkopf
Copy link
Collaborator

Courtesy of @matthoffner , lm-eval now supports GGML Llama models via llama-cpp-python!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Contributors and extra help welcome.
Projects
None yet
Development

No branches or pull requests

7 participants