Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow input of image properties #1337

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

zhugeyicixin
Copy link

Features
Allow image input as a dict[str, str] or a list of dict[str, str] so that image properties, such as detail (https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding), can be passed through. The images argument in llm.aask() can now be any of these:

  • None
  • str of image url or path (original feature)
  • list of str above (original feature)
  • dict as {'url': str of image url or path, 'detail': 'high' | 'low' | 'auto'} (new feature)
  • list of dict above (new feature)

Influence
The original usage is not affected. The new usage of passing image details is supported.

Result

import asyncio

from metagpt.llm import LLM
from metagpt.logs import logger


async def ask_w_image(
    question,
    images,
    llm,
    system_prompt,
) -> str:
    logger.info(f"Q: {question}")
    logger.info(f"image: {images}")
    rsp = await llm.aask(
        question,
        system_msgs=[system_prompt],
        images=images,
    )
    logger.info(f"A: {rsp}")
    return rsp


async def main():
    llm = LLM()
    await ask_w_image(
        question="Describe this image",
        images="https://raw.githubusercontent.com/geekan/MetaGPT/main/docs/resources/MetaGPT-new-log.png",
        llm=llm,
        system_prompt="You are a helpful AI assistant.",
    )
    await ask_w_image(
        question="Describe this image",
        images={
            "url": "https://raw.githubusercontent.com/geekan/MetaGPT/main/docs/resources/MetaGPT-new-log.png"
        },
        llm=llm,
        system_prompt="You are a helpful AI assistant.",
    )
    await ask_w_image(
        question="Describe this image",
        images={
            "url": "https://raw.githubusercontent.com/geekan/MetaGPT/main/docs/resources/MetaGPT-new-log.png",
            "detail": "low",
        },
        llm=llm,
        system_prompt="You are a helpful AI assistant.",
    )
    await ask_w_image(
        question="Describe this image",
        images={
            "url": "https://raw.githubusercontent.com/geekan/MetaGPT/main/docs/resources/MetaGPT-new-log.png",
            "detail": "high",
        },
        llm=llm,
        system_prompt="You are a helpful AI assistant.",
    )


if __name__ == "__main__":
    asyncio.run(main())

# output

2024-06-11 20:34:26.244 | INFO     | __main__:ask_w_image:20 - Q: Describe this image
2024-06-11 20:34:26.245 | INFO     | __main__:ask_w_image:21 - image: https://raw.githubusercontent.com/geekan/MetaGPT/main/docs/resources/MetaGPT-new-log.png
The image depicts a black and white abstract logo consisting of four interconnected shapes that resemble stylized, curved arrows or loops. These shapes form a symmetrical pattern, creating a sense of movement and flow. The design is bold and modern, with clean lines and a balanced composition. The overall effect is dynamic and visually engaging.
2024-06-11 20:34:29.599 | INFO     | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.001 | Max budget: $10.000 | Current cost: $0.001, prompt_tokens: 21, completion_tokens: 63
2024-06-11 20:34:29.599 | INFO     | __main__:ask_w_image:27 - A: The image depicts a black and white abstract logo consisting of four interconnected shapes that resemble stylized, curved arrows or loops. These shapes form a symmetrical pattern, creating a sense of movement and flow. The design is bold and modern, with clean lines and a balanced composition. The overall effect is dynamic and visually engaging.
2024-06-11 20:34:29.599 | INFO     | __main__:ask_w_image:20 - Q: Describe this image
2024-06-11 20:34:29.599 | INFO     | __main__:ask_w_image:21 - image: {'url': 'https://raw.githubusercontent.com/geekan/MetaGPT/main/docs/resources/MetaGPT-new-log.png'}
The image depicts a black and white abstract logo or symbol. It consists of four identical, curved shapes that are arranged in a circular pattern, creating a sense of motion and symmetry. Each shape has a pointed end and a rounded end, and they are interconnected in a way that forms a continuous loop. The overall design is dynamic and balanced, with a modern and sleek appearance. The use of black and white adds to the simplicity and elegance of the symbol.
2024-06-11 20:34:34.029 | INFO     | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.003 | Max budget: $10.000 | Current cost: $0.001, prompt_tokens: 21, completion_tokens: 91
2024-06-11 20:34:34.030 | INFO     | __main__:ask_w_image:27 - A: The image depicts a black and white abstract logo or symbol. It consists of four identical, curved shapes that are arranged in a circular pattern, creating a sense of motion and symmetry. Each shape has a pointed end and a rounded end, and they are interconnected in a way that forms a continuous loop. The overall design is dynamic and balanced, with a modern and sleek appearance. The use of black and white adds to the simplicity and elegance of the symbol.
2024-06-11 20:34:34.030 | INFO     | __main__:ask_w_image:20 - Q: Describe this image
2024-06-11 20:34:34.030 | INFO     | __main__:ask_w_image:21 - image: {'url': 'https://raw.githubusercontent.com/geekan/MetaGPT/main/docs/resources/MetaGPT-new-log.png', 'detail': 'low'}
The image features a black and white abstract design composed of four interlocking shapes. Each shape resembles a curved, elongated teardrop or a stylized leaf, and they are arranged in a circular pattern, creating a symmetrical and balanced appearance. The shapes are connected at their tips, forming a continuous loop that gives the design a sense of unity and flow. The overall effect is visually striking and harmonious.
2024-06-11 20:34:37.175 | INFO     | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.004 | Max budget: $10.000 | Current cost: $0.001, prompt_tokens: 21, completion_tokens: 80
2024-06-11 20:34:37.175 | INFO     | __main__:ask_w_image:27 - A: The image features a black and white abstract design composed of four interlocking shapes. Each shape resembles a curved, elongated teardrop or a stylized leaf, and they are arranged in a circular pattern, creating a symmetrical and balanced appearance. The shapes are connected at their tips, forming a continuous loop that gives the design a sense of unity and flow. The overall effect is visually striking and harmonious.
2024-06-11 20:34:37.175 | INFO     | __main__:ask_w_image:20 - Q: Describe this image
2024-06-11 20:34:37.175 | INFO     | __main__:ask_w_image:21 - image: {'url': 'https://raw.githubusercontent.com/geekan/MetaGPT/main/docs/resources/MetaGPT-new-log.png', 'detail': 'high'}
The image depicts a black and white abstract logo or symbol. It consists of four identical, curved shapes that are arranged in a circular pattern, creating a sense of motion and symmetry. Each shape has a pointed end and a rounded end, and they are interconnected in a way that forms a continuous loop. The overall design is dynamic and balanced, with a modern and sleek appearance. The use of black and white adds to the simplicity and elegance of the symbol.
2024-06-11 20:34:40.956 | INFO     | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.005 | Max budget: $10.000 | Current cost: $0.001, prompt_tokens: 21, completion_tokens: 91
2024-06-11 20:34:40.957 | INFO     | __main__:ask_w_image:27 - A: The image depicts a black and white abstract logo or symbol. It consists of four identical, curved shapes that are arranged in a circular pattern, creating a sense of motion and symmetry. Each shape has a pointed end and a rounded end, and they are interconnected in a way that forms a continuous loop. The overall design is dynamic and balanced, with a modern and sleek appearance. The use of black and white adds to the simplicity and elegance of the symbol.

Allow image input as a dict[str, str] or a list of dict[str, str] so that image properties, such as `detail` (https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding), can be passed through.
# image url or image base64
url = image if image.startswith("http") else f"data:image/jpeg;base64,{image}"
if not image_info["url"].startswith("http"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't check the image_info field name if passing a dict without url. maybe you should add validation.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 18.18182% with 9 lines in your changes missing coverage. Please review.

Project coverage is 62.56%. Comparing base (ab846f6) to head (11221e9).

Files with missing lines Patch % Lines
metagpt/provider/base_llm.py 18.18% 9 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1337      +/-   ##
==========================================
- Coverage   62.59%   62.56%   -0.03%     
==========================================
  Files         287      287              
  Lines       17589    17595       +6     
==========================================
  Hits        11009    11009              
- Misses       6580     6586       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants