Skip to content

Latest commit

 

History

History
52 lines (46 loc) · 2.1 KB

README.md

File metadata and controls

52 lines (46 loc) · 2.1 KB

Inference

Preparation

Install the environment with the following specified config:

conda env create -f videosalmonn.yml

Create directory to store checkpoints (If modify the structure/rename directories, need to change config files and model files accordingly)

mkdir -p ckpt/MultiResQFormer
mkdir -p ckpt/pretrained_ckpt

Then download the following model checkpoints:

  1. Main video-SALMONN model checkpoint, then put it under ckpt/MultiResQFormer
  2. InstructBLIP checkpoint for Vicuna-13B model, then put it under ckpt/pretrained_ckpt
  3. EVA_VIT model checkpoint for InstructBLIP, then put it under ckpt/pretrained_ckpt
  4. BEATs encoder checkpoint, then put it under ckpt/pretrained_ckpt

Run inference

python inference.py --cfg-path config/test.yaml 

Check the result

The result is saved in the following path:

./ckpt/MultiResQFormer/<DateTime>/eval_result.json

Expecting the following result:

[
    {
        "id": "./dummy/4405327307.mp4_Describe the video and audio in detail",
        "conversation": [
            {
                "from": "human",
                "value": "Describe the video and audio in detail"
            },
            {
                "from": "gpt",
                "value": "None"
            }
        ],
        "task": "audiovisual_video_input",
        "ref_answer": "None",
        "gen_answer": "The video shows a group of musicians performing on stage, with a man singing into a microphone and playing the piano. There is also a drum set and a saxophone on stage. The audience is not visible in the video. The music is upbeat and energetic, and the performers seem to be enjoying themselves.</s>"
    }
]