Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrote the arguments page after the litellm integration #3

Merged
merged 1 commit into from
Sep 3, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 16 additions & 140 deletions universal_api/arguments.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
title: 'Arguments'
---

### Introduction

With so many LLMs and providers constantly coming onto the scene, each of these is
increasingly striving to provide unique value to end users, and this means that there
are often diverging features offered behind the API.
Expand All @@ -11,133 +13,36 @@ others support function calling, tool use, image processing, audio,
structured output (such as json mode),
and many other increasingly complex modes of operation.

We *could* adopt a design for our universal API where we only support the lowest common
denominator across all of the APIs. However, this would necessarily leave out many of
the most exciting bleeding edge features, limiting the utility of
our API for more forward-thinking applications.

Similarly, we *could* try to create a universal interface to the full superset of features
across *all* providers, ensuring that the input-output behaviour is consistent regardless
of the backend provider selected. This would require a huge amount of ongoing
maintenance to keep pace with the fast-changing API specs, and the wrong choice of
abstraction for the unification effort could break compatibility across APIs.

We have instead opted for a compromise with our API, where we support:
### Supported Arguments

- [Platform Arguments](#platform-arguments): specific to the Unify platform
- [Unified Arguments](#unified-arguments): from the OpenAI Standard, unifed across **all** endpoints
- [Partially Unified Arguments](#partially-unified-arguments): from the OpenAI Standard, unifed across **some** endpoints
- [Passthrough Arguments](#passthrough-arguments): any extra model-specific or provider-specific arguments,
passed straight through to the backend http request
To *simplify* the design, we have built our API on top of LiteLLM and so the unification logic
for the arguments passed is handled by LiteLLM. We recommend you to go through their chat completions
[docs](https://docs.litellm.ai/docs/completion) to find out the arguments supported.

## Platform Arguments

The following arguments of the chat completions
[endpoint](http://localhost:3000/api-reference/querying_llms/get_completions)
are solely related to the *Unify platform*:
There are some providers (e.g. Lepton AI) that aren't supported by LiteLLM but are supported under our
API. We've tried to maintain the same argument signature for those providers as well.

Alongside the arguments accepted by LiteLLM in the input, we accept a few other arguments specific to our
platform. We're calling these **Platform Arguments**
- `signature` specifying how the API was called (Unify Python Client, NodeJS client, Console etc.)
- `use_custom_keys` specifying whether to use custom keys or the unified keys with the provider.
- `tags`: to mark a prompt with string-metadata which can be used for filtering later on.
- `drop_params`: in case arguments passed aren't supported by certain providers, uses [this](https://docs.litellm.ai/docs/completion/drop_params)

We therefore refer to them as the *platform* arguments,
to distinguish them from those in the OpenAI Standard (see below).

## Unified Arguments

The *unified* arguments of the chat completions
[endpoint](http://localhost:3000/api-reference/querying_llms/get_completions)
are as follows:

- `model` - The model@provider pair (the endpoint) to use in the backend.
- `messages` - A list of messages comprising the conversation so far.
- `temperature` - What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more
random, while lower values like 0.2 will make it more focused and deterministic.
- `stream` - If set, partial message deltas will be sent. Tokens will be sent as data-only server-sent events as they
become available, with the stream terminated by a `data: [DONE]` message.
- `max_tokens` - The maximum number of tokens that can be generated in the chat completion. The total length of input
tokens and generated tokens is limited by the model's context length.
- `stop` - Up to 4 sequences where the API will stop generating further tokens.

These are all taken directly from the
[OpenAI Standard](https://platform.openai.com/docs/api-reference/chat/create).
The only argument which deviates from OpenAI is `model`, which in the case of OpenAI of course is only OpenAI models,
whereas our API supports all major models and providers in the format `model@provider`.

These arguments are all **fully supported by all models and providers in Unify**.
This means you can switch models and providers totally freely when making use of the *unified arguments*,
without changing the code in any way.

These *unified* arguments are also all mirrored in the
[generate](https://docs.unify.ai/python/clients#generate) function of the
[Unify](https://docs.unify.ai/python/clients#unify) client and
[AsyncUnify](https://docs.unify.ai/python/clients#asyncunify) client
in the Python SDK.

## Partially Unified Arguments

Most arguments in the [OpenAI Standard](https://platform.openai.com/docs/api-reference/chat/create) are only supported
by *some* models and providers, but *not all* of them. These arguments are referred to as *partially* unified, given
that they are unified to the OpenAI standard for the subset of models and providers which support these features (or
support features which are sufficiently similar to be unified into the standard).

These *partially unified* arguments are as follows:

- `frequency_penalty` - Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing
frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
- `logit_bias` - Modify the likelihood of specified tokens appearing in the completion.
Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from
-100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect
will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100
or 100 should result in a ban or exclusive selection of the relevant token.
- `logprobs` - Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities
of each output token returned in the `content` of `message`.
- `top_logprobs` - An integer between 0 and 20 specifying the number of most likely tokens to return at each token
position, each with an associated log probability. `logprobs` must be set to `true` if this parameter is used.
- `n` - How many chat completion choices to generate for each input message. Note that you will be charged based on the
number of generated tokens across all of the choices. Keep `n` as `1` to minimize costs.
- `presence_penalty` - Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in
the text so far, increasing the model's likelihood to talk about new topics.
- `response_format` - An object specifying the format that the model must output.
Setting to `{ "type": "json_schema", "json_schema": {...} }` enables Structured Outputs which ensures the model will
match your supplied JSON schema. Learn more in the Structured Outputs guide.
Setting to `{ "type": "json_object" }` enables JSON mode, which ensures the message the model generates is valid JSON.
- `seed` - This feature is in Beta. If specified, our system will make a best effort to sample deterministically, such
that repeated requests with the same `seed` and parameters should return the same result. Determinism is not guaranteed,
and you should refer to the `system_fingerprint` response parameter to monitor changes in the backend.
- `stream_options` - Options for streaming response. Only set this when you set `stream: true`.
- `top_p` - An alternative to sampling with temperature, called nucleus sampling, where the model considers the results
of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are
considered. Generally recommended to alter this *or* temperature, but not both.
- `tools` - A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a
list of functions the model may generate JSON inputs for. A max of 128 functions are supported.
- `tool_choice` - Controls which (if any) tool is called by the model. `none` means the model will not call any tool and
instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools.
`required` means the model must call one or more tools. Specifying a particular tool via
`{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool.
`none` is the default when no tools are present. `auto` is the default if tools are present.
- `parallel_tool_calls` - Whether to enable parallel function calling during tool use.

Most of these *partially unified* arguments are provider-specific, but others are model specific.

You can see which models and providers support these partially unified arguments in this [live dashboard](),
which is determined directly based on the latest unit tests.

Despite only being supported by some models and providers, these arguments are *also* explicitly mirrored in the
All these arguments (i.e. the ones accepted by LiteLLM's API and the Platform Arguments) are explicitly mirrored in the
[generate](https://docs.unify.ai/python/clients#generate) function of the
[Unify](https://docs.unify.ai/python/clients#unify) client and
[AsyncUnify](https://docs.unify.ai/python/clients#asyncunify) client
in the Python SDK.

If you believe one of these arguments *could* be supported by a certain model or provider, but is not currently
supported, then feel free to let us know [on discord](https://discord.com/invite/sXyFF8tDtm)
and we'll get it added as soon as possible! ⚡
and we'll get it supported as soon as possible! ⚡

### Tool Use Example

OpenAI and Anthropic have different interfaces for tool use.
Since we adhere to the OpenAI standard, we accept tools as specified by the OpenAI standard, and convert the format so
that they work with Anthropic models.
Since we adhere to the OpenAI standard, we accept tools as specified by the OpenAI standard.

This is the default function calling example from OpenAI, working with an Anthropic model:

Expand Down Expand Up @@ -202,37 +107,8 @@ and direct `**kwargs` of the [generate function](https://docs.unify.ai/python/cl

### Anthropic-Only Example

Anthropic exposes the `top_k` argument, which isn't provided by OpenAI.
If you include this argument, it will be sent straight to the model.
If you send this argument to a provider that does not support `top_k`, you will get an error.

```shell
curl --request POST \
--url 'https://api.unify.ai/v0/chat/completions' \
--header 'Authorization: Bearer $UNIFY_KEY' \
--header 'Content-Type: application/json' \
--data '{
"model": "claude-3.5-sonnet@anthropic",
"messages": [
{
"content": "Tell me a joke",
"role": "user"
}
],
"top_k": 5,
"max_tokens": 1024,
}'
```

This can also be done in the Unify Python SDK, as follows:

```python
client = unify.Unify("claude-3-haiku@anthropic")
client.generate("hello world!", top_k=5)
```

The same is true for headers. For example, beta features are sometimes released,
which can be accessed via specific headers, as explained in
Features supported by providers outside of the OpenAI standard are sometimes released
as beta features, which can be accessed via specific headers, as explained in
[this tweet](https://x.com/alexalbert__/status/1812921642143900036) from Anthropic.

These headers can be queried directly from the Unify API like so:
Expand Down
Loading