[RFC] 011 - 会话支持上传图像 #427

arvinxx · 2023-11-08T03:44:40Z

arvinxx
Nov 8, 2023
Maintainer

背景

GPT-4 vision 已经在近日发布，LobeChat 需要支持图片会话的能力，以提升日常会话的使用体验，拓展 Agent 的能力边界

方案构思

GPT-4 vision 的官方示例代码：

import OpenAI from "openai";

const openai = new OpenAI();

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4-vision-preview",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url:
              "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        ],
      },
    ],
  });
  console.log(response.choices[0]);
}

vison model 和普通会话模型的最大差别，在于发送的消息数据结构不同。

对于普通会话模型来说，消息的数据结构如下所示， content 字段都是字符串。

{
  "messages": [
    { "role": "user", "content": "What's the capital of France?" },
    { "role": "assistant", "content": "Paris, as if everyone doesn't know that already." }
  ]
}

而 vision model 的 content 的入参，则是一个会话数组：

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What’s in this image?" },
    {
      "type": "image_url",
      "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    }
  ]
}

这就意味着整个 input 输入框需要变成支持图片输入的实现方案，不能再用之前单纯的纯文本输入框。形态上会往钉钉的输入框看齐。

向上述截图中的消息，发送给 vision 的数据结构如下：

{
  "role": "user",
  "content": [
    { "type": "text", "text": "输入" },
    {
      "type": "image_url",
      "image_url": "https://image-1.png"
    },
    { "type": "text", "text": "大萨达撒" },
    {
      "type": "image_url",
      "image_url": "https://image-2.png"
    },
  ]
}

考虑到 / 命令增强与 @角色 的需求，input 输入框的终态会是一个简易的富文本编辑器。而我们之前也在思考 LobeChat 的 Agent 文档模式的终态交互会如何（https://github.com/lobehub/lobe-chat/discussions/323）。

因此我们也在考虑是否直接使用富文本编辑器来实现该需求。

进度

arvinxx · 2023-11-10T04:54:13Z

arvinxx
Nov 10, 2023
Maintainer Author

编辑器对比

Plate VS tiptap

0 replies

arvinxx · 2023-11-10T04:54:49Z

arvinxx
Nov 10, 2023
Maintainer Author

普通输入框 + 上传文件

经过调研，我发现编辑器的方案对目前来说的数据结构改造很大。因此先采用相对降级的方案。

文本输入框+图片管理器

思路

交互状态

初步 demo：

vision.mp4

系分拆解：

拖拽上传

drag-to-upload.mp4

点击上传
截图粘贴上传
上传中
图片预览
图片加载中
图片详情(灯箱) : https://www.timellenberger.com/libraries/react-spring-lightbox
删除图片

数据流

FileStore

上传文件
拉取图片详情
删除图片

图片数据流接入会话消息

本地数据库
数据模型
存储图片二进制文件

支持 CDN 配置

S3
阿里云
腾讯云

0 replies

arvinxx · 2023-11-11T11:05:44Z

arvinxx
Nov 11, 2023
Maintainer Author

研发要点记录

数据模型

Schema 与类型定义：

import { z } from 'zod';

export const LocalFileSchema = z.object({
  /**
   * create Time
   */
  createdAt: z.number(),
  /**
   * file data array buffer
   */
  data: z.instanceof(ArrayBuffer),
  /**
   * file type
   * @example 'image/png'
   */
  fileType: z.string(),
  /**
   * file name
   * @example 'test.png'
   */
  name: z.string(),
  /**
   * the mode database save the file
   * local mean save the raw file into data
   * url mean upload the file to a cdn and then save the url
   */
  saveMode: z.enum(['local', 'url']),
  /**
   * file size
   */
  size: z.number(),
  /**
   * file url if saveMode is url
   */
  url: z.string().url().optional(),
});

export type LocalFile = z.infer<typeof LocalFileSchema>;

数据库层：

export const lobeDBSchema = {
  files: '&id, name, fileType, saveMode',
};

ORM 层：

import { LocalFile, LocalFileSchema } from '@/types/database/files';
import { nanoid } from '@/utils/uuid';

import { BaseModel } from './core';

class _FileModel extends BaseModel {
  constructor() {
    super('files', LocalFileSchema);
  }

  async create(file: LocalFile) {
    const id = nanoid();

    return this.add(file, `file-${id}`);
  }

  async findById(id: string) {
    return this.table.get(id);
  }

  async delete(id: string) {
    return this.table.delete(id);
  }
}

export const FileModel = new _FileModel();

FilePreview

2 replies

arvinxx Nov 11, 2023
Maintainer Author

问题一：任何图片，都只会回答一点点内容。

接口层直接发送请求也是这个情况：

查看了下官方的示例，并又测试了下，发现也是有问题的。

不过明确看到停止原因是 :

"finish_details": {
        "type": "max_tokens"
      }

即没有设置 maxTokens。而官方的 api 示例中倒是有带上：

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4-vision-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What’s in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

在 GPT3.5 和之前的 4 中，max_tokens 好像并没有什么作用。但是目前看在 vision model 里，默认的 max_tokens 有点设定得太小了。因此我们需要改改这个值。感觉可能300应该够用。

arvinxx Nov 14, 2023
Maintainer Author

问题二：Vision Model 不支持 functions

目前先针对 vision model 不开启 tools 解决：

  // the rule that model can use tools:
  // 1. tools is not empty
  // 2. model is not in vision white list, because vision model can't use tools
  // TODO: we need to find some method to let vision model use tools
  const shouldUseTools = filterTools.length > 0 && !VISION_MODEL_WHITE_LIST.includes(payload.model);

  const functions = shouldUseTools ? filterTools : undefined;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] 011 - 会话支持上传图像 #427

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[RFC] 011 - 会话支持上传图像 #427

arvinxx Nov 8, 2023 Maintainer

背景

方案构思

进度

Replies: 3 comments · 2 replies

arvinxx Nov 10, 2023 Maintainer Author

编辑器对比

arvinxx Nov 10, 2023 Maintainer Author

普通输入框 + 上传文件

思路

交互状态

数据流

图片数据流接入会话消息

支持 CDN 配置

arvinxx Nov 11, 2023 Maintainer Author

研发要点记录

数据模型

arvinxx Nov 11, 2023 Maintainer Author

问题一：任何图片，都只会回答一点点内容。

arvinxx Nov 14, 2023 Maintainer Author

问题二：Vision Model 不支持 functions

arvinxx
Nov 8, 2023
Maintainer

Replies: 3 comments 2 replies

arvinxx
Nov 10, 2023
Maintainer Author

arvinxx
Nov 10, 2023
Maintainer Author

arvinxx
Nov 11, 2023
Maintainer Author

arvinxx Nov 11, 2023
Maintainer Author

arvinxx Nov 14, 2023
Maintainer Author