Skip to content

Commit

Permalink
Updates to docs (28/08)
Browse files Browse the repository at this point in the history
Updates to docs
  • Loading branch information
tomatillos authored Aug 28, 2024
2 parents a949381 + ee31399 commit 086b292
Show file tree
Hide file tree
Showing 17 changed files with 7,933 additions and 89 deletions.
34 changes: 30 additions & 4 deletions benchmarking/benchmarks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,17 +25,43 @@ displayed on the [dashboard]() (see [next section]()). However, the evaluations
still be displayed on the right hand table.

If you would like to compare your custom endpoints in terms of speed and cost on the
dashboard, then you simply need to publish speed and cost values to the `X` endpoint,
dashboard, then you simply need to publish speed and cost values to the `benchmark` endpoint,
as follows:

CODE
```shell
curl -X POST "'https://api.unify.ai/v0/benchmark" \
--header 'Authorization: Bearer $UNIFY_KEY' \
--header 'Content-Type: application/json' \
--data '{
"endpoint_name": "llama_3_8b_local_ollama",
"metric_name": "time-to-first-token",
"value": 132,
}'
```

or via Python:

```python
client.benchmark.upload(
endpoint_name="llama_3_8b_local_ollama",
metric_name="time-to-first-token",
value=132
)
```

The timestamp of the submission is automatically detected, and the data can be streamed
to this endpoint in a recurring basis if so desired, similar to how we do it for the
public endpoints. If the time of submission does not align with the time of measurement,
then the timestamp can be provided explicitly via the `x` argument, as follows:
then the timestamp can be provided explicitly via the `measured_at` argument, as follows:

CODE
```python
client.benchmark.upload(
endpoint_name="llama_3_8b_local_ollama",
metric_name="time-to-first-token",
value=132,
measured_at="2024-08-12T04:20:32.808410"
)
```

In the next section, we explain how to use the [dashboard]() to view the benchmarking
results in a clear and intuitive way!
78 changes: 44 additions & 34 deletions benchmarking/datasets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,12 @@ This data is especially important, as it represents the *true distribution* obse
before deployment.

It's easy to extract any prompt queries previously made to the API,
via the [X]() endpoint, as explained [here]().
For example, the last 100 prompts for subject `Y` can be extracted as follows:
via the [`prompt_history`](benchmarks/get_prompt_history) endpoint, as explained [here]().
For example, the last 100 prompts with the tag `physics` can be extracted as follows:

CODE
```python
phyiscs_prompts = client.prompt_history(tag="phyiscs", limit=100)
```

We can then add this to the local `.jsonl` file as follows:

Expand All @@ -64,57 +66,65 @@ CODE
## Uploading Datasets

As shown above, the representation for prompt datasets is `.jsonl`,
which is a effectively a list of json structures (or in Python, a list of dicts).
which is a file format where each line is a json object (or in Python, a list of dicts).

Lets upload our `english_language.jsonl` dataset.

We can do this via the REST API as follows:

```
import requests
url = "https://api.unify.ai/v0/dataset"
headers = {"Authorization": "Bearer $UNIFY_API_KEY",}
data = {"name": "english_language"}
files = {"file": open('/path/to/english_language.jsonl' ,'rb')}
response = requests.post(url, data=data, files=files, headers=headers)
```shell
curl --request POST \
--url 'https://api.unify.ai/v0/dataset' \
--header 'Authorization: Bearer <UNIFY_KEY>' \
--header 'Content-Type: multipart/form-data' \
--form 'file=@english_language.jsonl'\
--form 'name=english_language'
```

Or we can create a `Dataset` instance in Python,
and then call `.upload("english_language")` as follows:

CODE
{/* ToDo: the Dataset instance isn't implemented */}

```python
client.dataset.upload(
path="english_language.jsonl",
name="english_language"
)
```

## Deleting Datasets

We can delete the dataset just as easily as we created it.

First, using the REST API:

```
import requests
url = "https://api.unify.ai/v0/dataset"
headers = {"Authorization": "Bearer $UNIFY_API_KEY"}
data = {"name": "english_language"}
response = requests.delete(url, params=data, headers=headers)
```shell
curl --request DELETE \
--url 'https://api.unify.ai/v0/dataset?name=english_language' \
--header 'Authorization: Bearer <UNIFY_KEY>'
```

Or via Python:

CODE
```python
client.datasets.delete(name="english_language")
```

## Listing Datasets

We can retrieve a list of our uploaded datasets using the `/dataset/list` endpoint.

```
import requests
url = "https://api.unify.ai/v0/dataset/list"
headers = {"Authorization": "Bearer $UNIFY_API_KEY"}
response = requests.get(url, headers=headers)
print(response.text)
```shell
curl --request GET \
--url 'https://api.unify.ai/v0/dataset/list' \
--header 'Authorization: Bearer <UNIFY_KEY>'
```


```python
datasets = client.datasets.list()
print(datasets)
```


## Renaming Datasets
Expand All @@ -126,23 +136,23 @@ and `english language`.
We can easily rename the dataset without deleting and re-uploading,
via the following REST API command:

```
import requests
url = "https://api.unify.ai/v0/dataset/rename"
headers = {"Authorization": "Bearer $UNIFY_API_KEY"}
data = {"name": "english", "new_name": "english_literature"}
response = requests.post(url, params=data, headers=headers)
```shell
curl --request POST \
--url 'https://api.unify.ai/v0/dataset/rename?name=english&new_name=english_literature' \
--header 'Authorization: Bearer <UNIFY_KEY>'
```

Or via Python:

CODE
```python
client.datasets.rename(name="english", new_name="english_literature")
```

## Appending to Datasets

As explained above, we might want to add to an existing dataset, either because we have
[generated some synthetic examples](), or perhaps because we have some relevant
[production traffic]().
[production traffic](datasets#production-data).

In the examples above, we simply appended to these datasets locally,
before then uploading the full `.jsonl` file. However,
Expand Down
63 changes: 42 additions & 21 deletions benchmarking/evaluations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,40 +11,51 @@ To trigger an LLM evaluation using a pre-configured [LLM evaluator](), you simpl
to specify the LLM endpoint, the dataset, and the pre-configured evaluator you would like to
use, as follows:

```
url = "https://api.unify.ai/v0/evals/trigger"
headers = {"Authorization": f"Bearer {UNIFY_API_KEY}"}
params = {
"dataset": "computer_science_homework_1",
"endpoint": "llama-3-70b-chat@aws-bedrock",
"eval_name": "computer_science_judge",
}
response = requests.post(url, params=params, headers=headers)
```python
client.evaluation(
evaluator="computer_science_judge",
dataset="computer_science_challenges",
endpoint="llama-3-70b-chat@aws-bedrock",
)
```

You will receive en email once the evaluation is finished.
We will explain how to visualize the results of your evaluations in the next section.

## Checking Evaluations

You can check the status of an evaluation using the endpoint X, as follows:
You can check the status of an evaluation using the `evaluation/status` endpoint, as follows:

{/* TODO (in api): If the evaluation is still running, the status code returned will be this. */}
```python
status = client.evaluation_status(
evaluator="computer_science_judge",
dataset="computer_science_challenges",
endpoint="llama-3-70b-chat@aws-bedrock",
)
print(status)
```

You can get the aggregated scores across the dataset as follows:

```
url = "https://api.unify.ai/v0/evals/get_scores"
headers = {"Authorization": f"Bearer {UNIFY_API_KEY}"}
params = {
"dataset": "computer_science_homework_1",
"eval_name": "computer_science_judge",
}
response = requests.get(url, params=params, headers=headers)
```python
scores = client.evaluation_scores(
evaluator="computer_science_judge",
dataset="computer_science_challenges",
)
print(scores)
```

You can also get more granular results, with per-prompt scores by passing `per_prompt=True`.

```python
per_prompt_scores = client.evaluation_scores(
evaluator="computer_science_judge",
dataset="computer_science_challenges",
per_prompt="True",
)
print(per_prompt_scores)
```

{/* ToDo in API */}
If the dataset has been updated since the evaluation was run, then the status `this`
will be shown when making the query (see [Partial Evaluations]() below).
Expand Down Expand Up @@ -78,16 +89,26 @@ in the dataset, the results will be uploaded via the X endpoint, using the Y arg

### Client side scores

If you want to submit evaluations that you obtained locally, you can via the `/evals/trigger` endpoint, by passing
If you want to submit evaluations that you obtained locally, you can via the `/evaluator` endpoint, by passing
`client_side_scores` as the file.

The file should be in JSONL format, with entries having `prompt` and `score` keys:

```
{"prompt": "Write Hello World in C", "score": 1.0}
{"prompt": "Write a travelling salesman algorithm in Rust", "score": 0.2}
```
The prompts must be the same prompts as the ones from the `dataset`.

The evaluator must be created with `client_side=True`.

```python
client.evaluation(
evaluator="computer_science_judge",
dataset="computer_science_challenges",
endpoint="llama-3-70b-chat@aws-bedrock",
client_side_scores="/path/to/scores.jsonl"
)
```

## Partial Evaluations

Expand Down
57 changes: 27 additions & 30 deletions benchmarking/evaluators.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,10 @@ of datasets.

## LLM as a Judge

Evaluators are configured using the `/create_eval` endpoint, as follows:
Evaluators are configured using the `evaluator` endpoint, as follows:

```
url = "https://api.unify.ai/v0/evals/create"
headers = {"Authorization": f"Bearer {KEY}"}
params = {"eval_name": "my_first_eval"}
response = requests.post(url, json=params, headers=headers)
```python
client.evaluator(name="my_first_eval")
```

As per our [example](), let's assume we first want to choose an evaluator for
Expand All @@ -38,11 +35,11 @@ good choice for our English Literature, where creativity is important.

The judges can be configured via the `judge_models` parameter as follows:

```
url = "https://api.unify.ai/v0/evals/create"
headers = {"Authorization": f"Bearer {KEY}"}
params = {"eval_name": "computer_science_demo", "judge_models": "claude-3.5-sonnet@aws-bedrock"}
response = requests.post(url, json=params, headers=headers)
```python
client.evaluator(
name="coding_demo",
judge_models=["claude-3.5-sonnet@aws-bedrock"]
)
```

## LLM Jury
Expand All @@ -55,19 +52,16 @@ and A, B and C for English Literature, again as per the [Scale AI X Leaderboard]

The juries can be configured as follows:

```
url = "https://api.unify.ai/v0/evals/create"
headers = {"Authorization": f"Bearer {KEY}"}
params = {
"eval_name": "computer_science_jury",
"judge_models": ["claude-3.5-sonnet@aws-bedrock", "gpt-4o@openai"],
}
response = requests.post(url, json=params, headers=headers)
```python
client.evaluator(
name="computer_science_jury",
judge_models=["claude-3.5-sonnet@aws-bedrock", "gpt-4o@openai"]
)
```

## Custom System Prompt

The default system prompt is as follows:
The default judge system prompt is as follows:

```
Please act as an impartial judge and evaluate the quality of the response provided by an assistant to the user question displayed below.
Expand All @@ -85,7 +79,7 @@ nor is it optimized for English literature.
We can create unique system prompts for these two subjects as follows,
based on some simple best practices for these domain areas:

```
```python
computer_science_system_prompt = """
Please evaluate the quality of the student's code provided in response to the examination question below.
Your job is to evaluate how good the student's answer is.
Expand All @@ -98,14 +92,11 @@ Are there any edge cases that the code would break for? Is the code laid out nea
Be as objective as possible.
"""

url = "https://api.unify.ai/v0/evals/create"
headers = {"Authorization": f"Bearer {$UNIFY_API_KEY}"}
params = {
"eval_name": "computer_science_judge",
"judge_models": "claude-3.5-sonnet@aws-bedrock",
"system_prompt": computer_science_system_prompt,
}
response = requests.post(url, json=params, headers=headers)
client.evaluator(
name="computer_science_judge",
system_prompt=computer_science_system_prompt,
judge_models="claude-3.5-sonnet@aws-bedrock",
)
```

{/* TODO: English Literature system prompt. */}
Expand All @@ -117,13 +108,19 @@ If you want to be really prescriptive about the criteria that responses are mark

For example

```
```python
class_config = [
{"label": "Excellent", "score": 1.0, "description": "Correct code which is easy to read"},
{"label": "Good", "score": 0.75, "description": "Correct code but structured badly"},
{"label": "Good", "score": 0.5, "description": "Correct code but not using the most efficient method"},
{"label": "Bad", "score": 0.0, "description": "Incorrect code that does not solve the problem"}
]

client.evaluator(
name="comp_sci_custom_class",
judge_models="claude-3.5-sonnet@aws-bedrock",
class_config=class_config
)
```


Expand Down
Loading

0 comments on commit 086b292

Please sign in to comment.