Phi 3 Vision Model: `microsoft/Phi-3.5-vision-instruct`

The Phi 3 Vision Model has support in the Rust, Python, and HTTP APIs. The Phi 3 Vision Model supports ISQ for increased performance.

The Python and HTTP APIs support sending images as:

URL
Path to a local image
Base64 encoded string

The Rust API takes an image from the image crate.

Note: The Phi 3 Vision model works best with one image although it is supported to send multiple images.

Note: when sending multiple images, they will be resized to the minimum dimension by which all will fit without cropping. Aspect ratio is not preserved in that case.

Note

The Phi 3 vision model does not automatically add the image tokens! They should be added to messages manually, and are of the format <|image_{N}|> where N starts from 1.

HTTP server

You can find this example here.

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

Note: The image_url may be either a path, URL, or a base64 encoded string.

Image:

Credit

Prompt:

<|image_1|>\nWhat is shown in this image? Write a detailed response analyzing the scene.

Output:

The image captures a breathtaking view of a mountain peak, bathed in the soft glow of sunlight. The peak, dusted with a layer of snow, stands tall against the backdrop of a clear blue sky. A trail, etched into the mountain's side by countless hikers before it, winds its way up to the summit. The trail's white color contrasts sharply with the surrounding landscape, drawing attention to its path and inviting exploration.

The perspective from which this photo is taken offers an expansive view of the mountain and its surroundings. It seems as if one could look down from this vantage point and see miles upon miles of untouched wilderness stretching out into the distance. The colors in the image are predominantly blue and white, reflecting both sky and snow-covered mountains respectively. However, there are also hints of green from trees dotting lower parts of mountainsides or valleys below them - adding another layer to this picturesque scene. This serene landscape evokes feelings of tranquility and adventure at once - an invitation to explore nature's grandeur while respecting its majesty at all times!

Start the server

Note

You should replace --features ... with one of the features specified here, or remove it for pure CPU inference.

cargo run --release --features ... -- --port 1234 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v

Send a request

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:1234/v1/"

completion = client.chat.completions.create(
    model="phi3v",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                    },
                },
                {
                    "type": "text",
                    "text": "<|image_1|>\nWhat is shown in this image? Write a detailed response analyzing the scene.",
                },
            ],
        },
    ],
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Rust

You can find this example here.

This is a minimal example of running the Phi 3 Vision model with a dummy image.

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionLoaderType, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model =
        VisionModelBuilder::new("microsoft/Phi-3.5-vision-instruct", VisionLoaderType::Phi3V)
            .with_isq(IsqType::Q4K)
            .with_logging()
            .build()
            .await?;

    let bytes = match reqwest::blocking::get(
        "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_phiv_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

Python

You can find this example here.

This example demonstrates loading and sending a chat completion request with an image.

Note: the image_url may be either a path, URL, or a base64 encoded string.

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
    which=Which.VisionPlain(
        model_id="microsoft/Phi-3.5-vision-instruct",
        arch=VisionArchitecture.Phi3V,
    ),
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="phi3v",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/e/e7/ Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "<|image_1|>\nWhat is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

You can find an example of encoding the image via base64 here.
You can find an example of loading an image locally here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHI3V.md

PHI3V.md

Phi 3 Vision Model: `microsoft/Phi-3.5-vision-instruct`

HTTP server

Credit

Rust

Python

Files

PHI3V.md

Latest commit

History

PHI3V.md

File metadata and controls

Phi 3 Vision Model: microsoft/Phi-3.5-vision-instruct

HTTP server

Credit

Rust

Python

Phi 3 Vision Model: `microsoft/Phi-3.5-vision-instruct`