Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards a C++ library #36

Open
A2va opened this issue Mar 30, 2023 · 3 comments
Open

Towards a C++ library #36

A2va opened this issue Mar 30, 2023 · 3 comments

Comments

@A2va
Copy link
Contributor

A2va commented Mar 30, 2023

Development of roadmap ideas:

Restructure the codebase to reuse.
Switch to Pybind11 rather than Subprocess - expected speedup: 3-4x

I'm particularly interested in this project for a C++ library. This could allow to multiple project to use the code.
I feel like this worth mentioning CTranslate2 which is a C++ library for mainly translator transformers but also does text generation with BLOOM or OPT.

Anyway, I found the current project structure not very practical for this task. So I propose to move all the C files to a src and include directories in the root folder of the repo. This would allow simplifying the usage/compilation of the C backend.

Currently, the model is loaded every time a prompt is submitted, which slows down the process. Thus, instead of using an executable program, an API could be used along with pybind11 to enhance the performance of the model.
That API might look something like this:

bloom_load(model_path) -> load the weights, return an pointer to the model ctx.
bloom_eval(ctx_pointer) -> inference, return the output
bloom_free(ctx_pointer) -> free the model context
@Ayushk4
Copy link
Member

Ayushk4 commented Mar 30, 2023

I 100% agree with this. This is also what I intend for this project to be.

Loading once should be enough, and there must be an option to cache key & value cache to avoid re-computation in a multi-turn (chat-style) mode.

@A2va
Copy link
Contributor Author

A2va commented Apr 3, 2023

A lot has happened on the llama.cpp repo:

  • They have introduced an API, so we could take some inspiration from there.

  • It has now memory map, but sadly the format of the model has been changed.

  • The python conversion script is rewritten. It seems that the quantization process is now done in Python. This new script does GPTQ quantization, but the C++ counterpart it seems to be the same as cformers which is Int4 quantization. I'm not so much into ML, so I could be wrong.

Currently, in the cformers repo there is only one Makefile for build, which is only supported on POSIX systems. We could add a CMakeLists like in the llama.cpp repo. But these build files to maintain for different OS. I found not that practical, and I used for some time XMake, it's an alternative to CMake with lua scripting.

Small example (already working with cformers code):

add_rules("mode.debug", "mode.release")
set_languages("cxx11", "c11")

target("cformers")
    set_kind("$(kind)")
    set_default(true)

    add_files("src/**.cpp")
    add_files("src/**.c")

    if is_plat("linux") then 
        add_syslinks("pthread")
        add_cflags("-D_POSIX_C_SOURCE=199309L")
    end

    add_headerfiles("include/**.h")
    add_includedirs("include", {public = true})

target("quantize_bloom")
    set_kind("binary")
    add_files("quantize/quantize_bloom.cpp")
    add_deps("cformers")

The quantization program could be run with:xmake run quantize_bloom arg1 arg2. You do not need to invoke it from the location of the executable.

It has another advantages, which is it can use package (700+ on xrepo). I noticed that llama.cpp supports OpenBLAS so with xmake it could be like this:

add_requires("openblas")

target("ggml")
	set_kind("static")
	add_packages("openblas")

I know that it can be difficult to start with a new tool, but I feel like it's easier to get started with than CMake. It's really a pleasure to working with. I have a setup script with 170 lines of code, it downloads some models convert them to their C++ version, install python, ...

What do you think ?

@mallorbc
Copy link
Contributor

mallorbc commented Apr 5, 2023

This project seems to use pybindings to not have to load the model into memory each time. Taking inspiration from the work there may be a good idea.
https://github.com/nomic-ai/pyllamacpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants