Updates to docs (28/08)

Updates to docs
unifyai · Aug 28, 2024 · 086b292 · 086b292
2 parents a949381 + ee31399
commit 086b292
Show file tree

Hide file tree

Showing 17 changed files with 7,933 additions and 89 deletions.
diff --git a/benchmarking/benchmarks.mdx b/benchmarking/benchmarks.mdx
@@ -25,17 +25,43 @@ displayed on the [dashboard]() (see [next section]()). However, the evaluations
 still be displayed on the right hand table.
 
 If you would like to compare your custom endpoints in terms of speed and cost on the
-dashboard, then you simply need to publish speed and cost values to the `X` endpoint,
+dashboard, then you simply need to publish speed and cost values to the `benchmark` endpoint,
 as follows:
 
-CODE
+```shell
+curl -X POST "'https://api.unify.ai/v0/benchmark" \
+  --header 'Authorization: Bearer $UNIFY_KEY' \
+  --header 'Content-Type: application/json' \
+  --data '{
+    "endpoint_name": "llama_3_8b_local_ollama",
+    "metric_name": "time-to-first-token",
+    "value": 132,
+}'
+```
+
+or via Python:
+
+```python
+client.benchmark.upload(
+    endpoint_name="llama_3_8b_local_ollama",
+    metric_name="time-to-first-token",
+    value=132
+)
+```
 
 The timestamp of the submission is automatically detected, and the data can be streamed
 to this endpoint in a recurring basis if so desired, similar to how we do it for the
 public endpoints. If the time of submission does not align with the time of measurement,
-then the timestamp can be provided explicitly via the `x` argument, as follows:
+then the timestamp can be provided explicitly via the `measured_at` argument, as follows:
 
-CODE
+```python
+client.benchmark.upload(
+    endpoint_name="llama_3_8b_local_ollama",
+    metric_name="time-to-first-token",
+    value=132,
+    measured_at="2024-08-12T04:20:32.808410"
+)
+```
 
 In the next section, we explain how to use the [dashboard]() to view the benchmarking
 results in a clear and intuitive way!
diff --git a/benchmarking/datasets.mdx b/benchmarking/datasets.mdx
@@ -52,10 +52,12 @@ This data is especially important, as it represents the *true distribution* obse
 before deployment.
 
 It's easy to extract any prompt queries previously made to the API,
-via the [X]() endpoint, as explained [here]().
-For example, the last 100 prompts for subject `Y` can be extracted as follows:
+via the [`prompt_history`](benchmarks/get_prompt_history) endpoint, as explained [here]().
+For example, the last 100 prompts with the tag `physics` can be extracted as follows:
 
-CODE
+```python
+phyiscs_prompts = client.prompt_history(tag="phyiscs", limit=100)
+```
 
 We can then add this to the local `.jsonl` file as follows:
 
@@ -64,57 +66,65 @@ CODE
 ## Uploading Datasets
 
 As shown above, the representation for prompt datasets is `.jsonl`,
-which is a effectively a list of json structures (or in Python, a list of dicts).
+which is a file format where each line is a json object (or in Python, a list of dicts).
 
 Lets upload our `english_language.jsonl` dataset.
 
 We can do this via the REST API as follows:
 
-```
-import requests
-url = "https://api.unify.ai/v0/dataset"
-headers = {"Authorization": "Bearer $UNIFY_API_KEY",}
-data = {"name": "english_language"}
-files = {"file": open('/path/to/english_language.jsonl' ,'rb')}
-response = requests.post(url, data=data, files=files, headers=headers)
+```shell
+curl --request POST \
+  --url 'https://api.unify.ai/v0/dataset' \
+  --header 'Authorization: Bearer <UNIFY_KEY>' \
+  --header 'Content-Type: multipart/form-data' \
+  --form 'file=@english_language.jsonl'\
+  --form 'name=english_language'
 ```
 
 Or we can create a `Dataset` instance in Python,
 and then call `.upload("english_language")` as follows:
 
-CODE
+{/* ToDo: the Dataset instance isn't implemented */}
+
+```python
+client.dataset.upload(
+    path="english_language.jsonl",
+    name="english_language"
+)
+```
 
 ## Deleting Datasets
 
 We can delete the dataset just as easily as we created it.
 
 First, using the REST API:
 
-```
-import requests
-url = "https://api.unify.ai/v0/dataset"
-headers = {"Authorization": "Bearer $UNIFY_API_KEY"}
-data = {"name": "english_language"}
-response = requests.delete(url, params=data, headers=headers)
+```shell
+curl --request DELETE \
+  --url 'https://api.unify.ai/v0/dataset?name=english_language' \
+  --header 'Authorization: Bearer <UNIFY_KEY>'
 ```
 
 Or via Python:
 
-CODE
+```python
+client.datasets.delete(name="english_language")
+```
 
 ## Listing Datasets
 
 We can retrieve a list of our uploaded datasets using the `/dataset/list` endpoint.
 
-```
-import requests
-url = "https://api.unify.ai/v0/dataset/list"
-headers = {"Authorization": "Bearer $UNIFY_API_KEY"}
-response = requests.get(url, headers=headers)
-print(response.text)
+```shell
+curl --request GET \
+  --url 'https://api.unify.ai/v0/dataset/list' \
+  --header 'Authorization: Bearer <UNIFY_KEY>'
 ```
 
-
+```python
+datasets = client.datasets.list()
+print(datasets)
+```
 
 
 ## Renaming Datasets
@@ -126,23 +136,23 @@ and `english language`.
 We can easily rename the dataset without deleting and re-uploading,
 via the following REST API command:
 
-```
-import requests
-url = "https://api.unify.ai/v0/dataset/rename"
-headers = {"Authorization": "Bearer $UNIFY_API_KEY"}
-data = {"name": "english", "new_name": "english_literature"}
-response = requests.post(url, params=data, headers=headers)
+```shell
+curl --request POST \
+  --url 'https://api.unify.ai/v0/dataset/rename?name=english&new_name=english_literature' \
+  --header 'Authorization: Bearer <UNIFY_KEY>'
 ```
 
 Or via Python:
 
-CODE
+```python
+client.datasets.rename(name="english", new_name="english_literature")
+```
 
 ## Appending to Datasets
 
 As explained above, we might want to add to an existing dataset, either because we have
 [generated some synthetic examples](), or perhaps because we have some relevant
-[production traffic]().
+[production traffic](datasets#production-data).
 
 In the examples above, we simply appended to these datasets locally,
 before then uploading the full `.jsonl` file. However,

diff --git a/benchmarking/evaluations.mdx b/benchmarking/evaluations.mdx
@@ -11,40 +11,51 @@ To trigger an LLM evaluation using a pre-configured [LLM evaluator](), you simpl
 to specify the LLM endpoint, the dataset, and the pre-configured evaluator you would like to
 use, as follows:
 
-```
-url = "https://api.unify.ai/v0/evals/trigger"
-headers = {"Authorization": f"Bearer {UNIFY_API_KEY}"}
-params = {
-    "dataset": "computer_science_homework_1",
-    "endpoint": "llama-3-70b-chat@aws-bedrock",
-    "eval_name": "computer_science_judge",
-}
-response = requests.post(url, params=params, headers=headers)
+```python
+client.evaluation(
+    evaluator="computer_science_judge",
+    dataset="computer_science_challenges",
+    endpoint="llama-3-70b-chat@aws-bedrock",
+)
 ```
 
 You will receive en email once the evaluation is finished.
 We will explain how to visualize the results of your evaluations in the next section.
 
 ## Checking Evaluations
 
-You can check the status of an evaluation using the endpoint X, as follows:
+You can check the status of an evaluation using the `evaluation/status` endpoint, as follows:
 
-{/* TODO (in api): If the evaluation is still running, the status code returned will be this. */}
+```python
+status = client.evaluation_status(
+    evaluator="computer_science_judge",
+    dataset="computer_science_challenges",
+    endpoint="llama-3-70b-chat@aws-bedrock",
+)
+print(status)
+```
 
 You can get the aggregated scores across the dataset as follows:
 
-```
-url = "https://api.unify.ai/v0/evals/get_scores"
-headers = {"Authorization": f"Bearer {UNIFY_API_KEY}"}
-params = {
-    "dataset": "computer_science_homework_1",
-    "eval_name": "computer_science_judge",
-}
-response = requests.get(url, params=params, headers=headers)
+```python
+scores = client.evaluation_scores(
+    evaluator="computer_science_judge",
+    dataset="computer_science_challenges",
+)
+print(scores)
 ```
 
 You can also get more granular results, with per-prompt scores by passing `per_prompt=True`.
 
+```python
+per_prompt_scores = client.evaluation_scores(
+    evaluator="computer_science_judge",
+    dataset="computer_science_challenges",
+    per_prompt="True",
+)
+print(per_prompt_scores)
+```
+
 {/* ToDo in API */}
 If the dataset has been updated since the evaluation was run, then the status `this`
 will be shown when making the query (see [Partial Evaluations]() below).
@@ -78,16 +89,26 @@ in the dataset, the results will be uploaded via the X endpoint, using the Y arg
 
 ### Client side scores
 
-If you want to submit evaluations that you obtained locally, you can via the `/evals/trigger` endpoint, by passing
+If you want to submit evaluations that you obtained locally, you can via the `/evaluator` endpoint, by passing
 `client_side_scores` as the file.
 
 The file should be in JSONL format, with entries having `prompt` and `score` keys:
+
 ```
 {"prompt": "Write Hello World in C", "score": 1.0}
 {"prompt": "Write a travelling salesman algorithm in Rust", "score": 0.2}
 ```
 The prompts must be the same prompts as the ones from the `dataset`.
-
+The evaluator must be created with `client_side=True`.
+
+```python
+client.evaluation(
+    evaluator="computer_science_judge",
+    dataset="computer_science_challenges",
+    endpoint="llama-3-70b-chat@aws-bedrock",
+    client_side_scores="/path/to/scores.jsonl"
+)
+```
 
 ## Partial Evaluations
 

diff --git a/benchmarking/evaluators.mdx b/benchmarking/evaluators.mdx
@@ -17,13 +17,10 @@ of datasets.
 
 ## LLM as a Judge
 
-Evaluators are configured using the `/create_eval` endpoint, as follows:
+Evaluators are configured using the `evaluator` endpoint, as follows:
 
-```
-url = "https://api.unify.ai/v0/evals/create"
-headers = {"Authorization": f"Bearer {KEY}"}
-params = {"eval_name": "my_first_eval"}
-response = requests.post(url, json=params, headers=headers)
+```python
+client.evaluator(name="my_first_eval")
 ```
 
 As per our [example](), let's assume we first want to choose an evaluator for
@@ -38,11 +35,11 @@ good choice for our English Literature, where creativity is important.
 
 The judges can be configured via the `judge_models` parameter as follows:
 
-```
-url = "https://api.unify.ai/v0/evals/create"
-headers = {"Authorization": f"Bearer {KEY}"}
-params = {"eval_name": "computer_science_demo", "judge_models": "claude-3.5-sonnet@aws-bedrock"}
-response = requests.post(url, json=params, headers=headers)
+```python
+client.evaluator(
+    name="coding_demo",
+    judge_models=["claude-3.5-sonnet@aws-bedrock"]
+)
 ```
 
 ## LLM Jury
@@ -55,19 +52,16 @@ and A, B and C for English Literature, again as per the [Scale AI X Leaderboard]
 
 The juries can be configured as follows:
 
-```
-url = "https://api.unify.ai/v0/evals/create"
-headers = {"Authorization": f"Bearer {KEY}"}
-params = {
-    "eval_name": "computer_science_jury",
-    "judge_models": ["claude-3.5-sonnet@aws-bedrock", "gpt-4o@openai"],
-}
-response = requests.post(url, json=params, headers=headers)
+```python
+client.evaluator(
+    name="computer_science_jury",
+    judge_models=["claude-3.5-sonnet@aws-bedrock", "gpt-4o@openai"]
+)
 ```
 
 ## Custom System Prompt
 
-The default system prompt is as follows:
+The default judge system prompt is as follows:
 
 ```
 Please act as an impartial judge and evaluate the quality of the response provided by an assistant to the user question displayed below.
@@ -85,7 +79,7 @@ nor is it optimized for English literature.
 We can create unique system prompts for these two subjects as follows,
 based on some simple best practices for these domain areas:
 
-```
+```python
 computer_science_system_prompt = """
 Please evaluate the quality of the student's code provided in response to the examination question below.
 Your job is to evaluate how good the student's answer is.
@@ -98,14 +92,11 @@ Are there any edge cases that the code would break for? Is the code laid out nea
 Be as objective as possible.
 """
 
-url = "https://api.unify.ai/v0/evals/create"
-headers = {"Authorization": f"Bearer {$UNIFY_API_KEY}"}
-params = {
-    "eval_name": "computer_science_judge",
-    "judge_models": "claude-3.5-sonnet@aws-bedrock",
-    "system_prompt": computer_science_system_prompt,
-}
-response = requests.post(url, json=params, headers=headers)
+client.evaluator(
+    name="computer_science_judge",
+    system_prompt=computer_science_system_prompt,
+    judge_models="claude-3.5-sonnet@aws-bedrock",
+)
 ```
 
 {/* TODO: English Literature system prompt. */}
@@ -117,13 +108,19 @@ If you want to be really prescriptive about the criteria that responses are mark
 
 For example
 
-```
+```python
 class_config = [
     {"label": "Excellent", "score": 1.0, "description": "Correct code which is easy to read"},
     {"label": "Good", "score": 0.75, "description": "Correct code but structured badly"},
     {"label": "Good", "score": 0.5, "description": "Correct code but not using the most efficient method"},
     {"label": "Bad", "score": 0.0, "description": "Incorrect code that does not solve the problem"}
 ]
+
+client.evaluator(
+    name="comp_sci_custom_class",
+    judge_models="claude-3.5-sonnet@aws-bedrock",
+    class_config=class_config
+)
 ```