Skip to content

Test your prompts, models, RAGs. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality. LLM evals for OpenAI/Azure GPT, Anthropic Claude, VertexAI Gemini, Ollama, Local & private models like Mistral/Mixtral/Llama with CI/CD

License

Notifications You must be signed in to change notification settings

mileeyu/promptfoo

 
 

Repository files navigation

promptfoo: test your LLM app

npm GitHub Workflow Status MIT license Discord

promptfoo is a tool for testing and evaluating LLM output quality.

With promptfoo, you can:

  • Systematically test prompts, models, and RAGs with predefined test cases
  • Evaluate quality and catch regressions by comparing LLM outputs side-by-side
  • Speed up evaluations with caching and concurrency
  • Score outputs automatically by defining test cases
  • Use as a CLI, library, or in CI/CD
  • Use OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, or integrate custom API providers for any LLM API

The goal: test-driven LLM development instead of trial-and-error.

promptfoo produces matrix views that let you quickly evaluate outputs across many prompts.

Here's an example of a side-by-side comparison of multiple prompts and inputs:

prompt evaluation matrix - web viewer

It works on the command line too:

Prompt evaluation

Why choose promptfoo?

There are many different ways to evaluate prompts. Here are some reasons to consider promptfoo:

  • Battle-tested: promptfoo was built to eval & improve LLM apps serving over 10 million users in production. The tooling is flexible and can be adapted to many setups.
  • Simple, declarative test cases: Define your evals without writing code or working with heavy notebooks.
  • Language agnostic: Use Javascript, Python, or whatever else you're working in.
  • Share & collaborate: Built-in share functionality & web viewer for working with teammates.
  • Open-source: LLM evals are a commodity and should be served by 100% open-source projects with no strings attached.
  • Private: This software runs completely locally. Your evals run on your machine and talk directly with the LLM.

Workflow

Start by establishing a handful of test cases - core use cases and failure cases that you want to ensure your prompt can handle.

As you explore modifications to the prompt, use promptfoo eval to rate all outputs. This ensures the prompt is actually improving overall.

As you collect more examples and establish a user feedback loop, continue to build the pool of test cases.

LLM ops

Usage

To get started, run this command:

npx promptfoo@latest init

This will create some placeholders in your current directory: prompts.txt and promptfooconfig.yaml.

After editing the prompts and variables to your liking, run the eval command to kick off an evaluation:

npx promptfoo@latest eval

Configuration

The YAML configuration format runs each prompt through a series of example inputs (aka "test case") and checks if they meet requirements (aka "assert").

See the Configuration docs for a detailed guide.

prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, ollama:llama2:70b]
tests:
  - description: 'Test translation to French'
    vars:
      language: French
      input: Hello world
    assert:
      - type: contains-json
      - type: javascript
        value: output.length < 100

  - description: 'Test translation to German'
    vars:
      language: German
      input: How's it going?
    assert:
      - type: model-graded-closedqa
        value: does not describe self as an AI, model, or chatbot
      - type: similar
        value: was geht
        threshold: 0.6 # cosine similarity

Supported assertion types

See Test assertions for full details.

Deterministic eval metrics

Assertion Type Returns true if...
equals output matches exactly
contains output contains substring
icontains output contains substring, case insensitive
regex output matches regex
starts-with output starts with string
contains-any output contains any of the listed substrings
contains-all output contains all list of substrings
icontains-any output contains any of the listed substrings, case insensitive
icontains-all output contains all list of substrings, case insensitive
is-json output is valid json (optional json schema validation)
contains-json output contains valid json (optional json schema validation)
javascript provided Javascript function validates the output
python provided Python function validates the output
webhook provided webhook returns {pass: true}
rouge-n Rouge-N score is above a given threshold
levenshtein Levenshtein distance is below a threshold
latency Latency is below a threshold (milliseconds)
perplexity Perplexity is below a threshold
cost Cost is below a threshold (for models with cost info such as GPT)
is-valid-openai-function-call Ensure that the function call matches the function's JSON schema
is-valid-openai-tools-call Ensure that all tool calls match the tools JSON schema

Model-assisted eval metrics

Assertion Type Method
similar Embeddings and cosine similarity are above a threshold
classifier Run LLM output through a classifier
llm-rubric LLM output matches a given rubric, using a Language Model to grade output
answer-relevance Ensure that LLM output is related to original query
context-faithfulness Ensure that LLM output uses the context
context-recall Ensure that ground truth appears in context
context-relevance Ensure that context is relevant to original query
factuality LLM output adheres to the given facts, using Factuality method from OpenAI eval
model-graded-closedqa LLM output adheres to given criteria, using Closed QA method from OpenAI eval
select-best Compare multiple outputs for a test case and pick the best one

Every test type can be negated by prepending not-. For example, not-equals or not-regex.

Tests from spreadsheet

Some people prefer to configure their LLM tests in a CSV. In that case, the config is pretty simple:

prompts: [prompts.txt]
providers: [openai:gpt-3.5-turbo]
tests: tests.csv

See example CSV.

Command-line

If you're looking to customize your usage, you have a wide set of parameters at your disposal.

Option Description
-p, --prompts <paths...> Paths to prompt files, directory, or glob
-r, --providers <name or path...> One of: openai:chat, openai:completion, openai:model-name, localai:chat:model-name, localai:completion:model-name. See API providers
-o, --output <path> Path to output file (csv, json, yaml, html)
--tests <path> Path to external test file
-c, --config <paths> Path to one or more configuration files. promptfooconfig.js/json/yaml is automatically loaded if present
-j, --max-concurrency <number> Maximum number of concurrent API calls
--table-cell-max-length <number> Truncate console table cells to this length
--prompt-prefix <path> This prefix is prepended to every prompt
--prompt-suffix <path> This suffix is append to every prompt
--grader Provider that will conduct the evaluation, if you are using LLM to grade your output

After running an eval, you may optionally use the view command to open the web viewer:

npx promptfoo view

Examples

Prompt quality

In this example, we evaluate whether adding adjectives to the personality of an assistant bot affects the responses:

npx promptfoo eval -p prompts.txt -r openai:gpt-3.5-turbo -t tests.csv

This command will evaluate the prompts in prompts.txt, substituing the variable values from vars.csv, and output results in your terminal.

You can also output a nice spreadsheet, JSON, YAML, or an HTML file:

Table output

Model quality

In the next example, we evaluate the difference between GPT 3 and GPT 4 outputs for a given prompt:

npx promptfoo eval -p prompts.txt -r openai:gpt-3.5-turbo openai:gpt-4 -o output.html

Produces this HTML table:

Side-by-side evaluation of LLM model quality, gpt3 vs gpt4, html output

Usage (node package)

You can also use promptfoo as a library in your project by importing the evaluate function. The function takes the following parameters:

  • testSuite: the Javascript equivalent of the promptfooconfig.yaml

    interface EvaluateTestSuite {
      providers: string[]; // Valid provider name (e.g. openai:gpt-3.5-turbo)
      prompts: string[]; // List of prompts
      tests: string | TestCase[]; // Path to a CSV file, or list of test cases
    
      defaultTest?: Omit<TestCase, 'description'>; // Optional: add default vars and assertions on test case
      outputPath?: string | string[]; // Optional: write results to file
    }
    
    interface TestCase {
      // Optional description of what you're testing
      description?: string;
    
      // Key-value pairs to substitute in the prompt
      vars?: Record<string, string | string[] | object>;
    
      // Optional list of automatic checks to run on the LLM output
      assert?: Assertion[];
    
      // Additional configuration settings for the prompt
      options?: PromptConfig & OutputConfig & GradingConfig;
    
      // The required score for this test case.  If not provided, the test case is graded pass/fail.
      threshold?: number;
    }
    
    interface Assertion {
      type: string;
      value?: string;
      threshold?: number; // Required score for pass
      weight?: number; // The weight of this assertion compared to other assertions in the test case. Defaults to 1.
      provider?: ApiProvider; // For assertions that require an LLM provider
    }
  • options: misc options related to how the tests are run

    interface EvaluateOptions {
      maxConcurrency?: number;
      showProgressBar?: boolean;
      generateSuggestions?: boolean;
    }

Example

promptfoo exports an evaluate function that you can use to run prompt evaluations.

import promptfoo from 'promptfoo';

const results = await promptfoo.evaluate({
  prompts: ['Rephrase this in French: {{body}}', 'Rephrase this like a pirate: {{body}}'],
  providers: ['openai:gpt-3.5-turbo'],
  tests: [
    {
      vars: {
        body: 'Hello world',
      },
    },
    {
      vars: {
        body: "I'm hungry",
      },
    },
  ],
});

This code imports the promptfoo library, defines the evaluation options, and then calls the evaluate function with these options.

See the full example here, which includes an example results object.

Configuration

  • Main guide: Learn about how to configure your YAML file, setup prompt files, etc.
  • Configuring test cases: Learn more about how to configure expected outputs and test assertions.

Installation

See installation docs

API Providers

We support OpenAI's API as well as a number of open-source models. It's also to set up your own custom API provider. See Provider documentation for more details.

Development

Here's how to build and run locally:

git clone https://github.com/promptfoo/promptfoo.git
cd promptfoo

npm i
cd path/to/experiment-with-promptfoo   # contains your promptfooconfig.yaml
npx path/to/promptfoo-source eval

The web UI is located in src/web/nextui. To run it in dev mode, run npm run local:web. This will host the web UI at http://localhost:3000. The web UI expects promptfoo view to be running separately.

You may also have to set some placeholder envars (it is not necessary to sign up for a supabase account):

NEXT_PUBLIC_SUPABASE_URL=http://
NEXT_PUBLIC_SUPABASE_ANON_KEY=abc

Contributions are welcome! Please feel free to submit a pull request or open an issue.

promptfoo includes several npm scripts to make development easier and more efficient. To use these scripts, run npm run <script_name> in the project directory.

Here are some of the available scripts:

  • build: Transpile TypeScript files to JavaScript
  • build:watch: Continuously watch and transpile TypeScript files on changes
  • test: Run test suite
  • test:watch: Continuously run test suite on changes

About

Test your prompts, models, RAGs. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality. LLM evals for OpenAI/Azure GPT, Anthropic Claude, VertexAI Gemini, Ollama, Local & private models like Mistral/Mixtral/Llama with CI/CD

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 93.5%
  • JavaScript 4.4%
  • CSS 1.7%
  • Other 0.4%