Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ollama: adding support for num_ctx parameter based on max_input_tokens #806

Closed
wants to merge 1 commit into from

Conversation

peat
Copy link

@peat peat commented Aug 27, 2024

It does what it says on the tin.

Ollama uses the num_ctx option to specify the maximum context length for a request. If absent, ollama (v 0.3.6) defaults to 2048 tokens, regardless of what an aichat user specifies as max_input_tokens in their local configuration.

@peat
Copy link
Author

peat commented Aug 27, 2024

In this case, both require_max_tokens and max_output_tokens are required in the config file. Would you like me to update this for the config generator? It skips require_max_tokens so it's disabled out of the box.

Separate (but related) -- the num_predict option is determined by the same value, although it has a different meaning. num_predict tells Ollama the limit on how many tokens it can generate.

I recommend that if max_tokens_param returns Some, we should:

  • Set num_ctx to the value
  • Set num_predict to -1 which tells Ollama that it can use the remaining token space for its output.

Alternately, we can use max_output_tokens as the num_predict value, since I think that's more closely aligned with the intended use.

Let me know your thoughts and I'll do the bit fiddling. Thanks!

@sigoden
Copy link
Owner

sigoden commented Aug 27, 2024

My apologies. max_tokens_param is for max_output_tokens, not max_input_tokens.

Ollama is already aware of the max_tokens_param.

aichat/src/client/ollama.rs

Lines 236 to 238 in 11022f8

if let Some(v) = model.max_tokens_param() {
body["options"]["num_predict"] = v.into();
}

The parameter num_ctx should derive from max_input_tokens, please revert your latest commit.

@sigoden
Copy link
Owner

sigoden commented Aug 28, 2024

Usually, we don't need to pass max_input_tokens to the API server. Ollama is the only client that may use max_input_tokens, and we haven't figured out yet whether to derive num_ctx directly from max_input_tokens.

The currently available solution is using patch.

    patch:
      chat_completions:
        'llama3.1':
          body:
            options:
              num_ctx: 131072

@peat
Copy link
Author

peat commented Aug 28, 2024

Ok, cool. I can use patch in the meantime, while you're sorting out the details. There are quite a few other Ollama specific parameters, and I'm happy to help if you decide to expand those capabilities. Cheers!

@sigoden
Copy link
Owner

sigoden commented Aug 28, 2024

ollama/ollama#6504 (comment) The ollama author says that num_ctx will be set automatically based on available VRAM and compute.

Also, we've decided to deprecate the ollama client in the future and instead use an openai-compatible client to access the ollama API.

So we decided not to derive num_ctx directly from max_input_tokens, this will disrupt the current tacit understanding and bring about a break chang on the future.

Suggest setting num_ctx using patch.

Thank you for your contribution.

@peat
Copy link
Author

peat commented Aug 28, 2024

Oh snap, that's great. Thank you!

@peat peat closed this Aug 28, 2024
@AllyourBaseBelongToUs
Copy link

ollama/ollama#6504 (comment) The ollama author says that num_ctx will be set automatically based on available VRAM and compute.

Also, we've decided to deprecate the ollama client in the future and instead use an openai-compatible client to access the ollama API.

So we decided not to derive num_ctx directly from max_input_tokens, this will disrupt the current tacit understanding and bring about a break chang on the future.

Suggest setting num_ctx using patch.

Thank you for your contribution.

would be nice if you let us have a config file somewhere for server settings, without fiddling with environment variables

right now the server overrides it stil to 2k context window unless the modelfile is specified with the right parameters (even if you ollama run MODEL and then /set parameters)

especially painful when using a close source program which uses the ollama localhost, but has its own stupid max token settings etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants