ICLR2024
Can we better anticipate an actor’s future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively.
AntGPT is the proposed framework in our paper to leverage LLMs in video-based long-term action anticipation. AntGPT achieves state-of-the-art performance on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ by the time of publication.
Clone this repository.
git clone [email protected]:brown-palm/AntGPT.git
cd AntGPT
Set up python (3.9) virtual environment. Install pytorch with the right CUDA version.
python3 -m venv venv/forecasting
source venv/forecasting/bin/activate
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --extra-index-url https://download.pytorch.org/whl/cu117
Install CLIP.
pip install git+https://github.com/openai/CLIP.git
Install other packages.
pip install -r requirements.txt
Install llama-recipe packages following instructions here.
In our experiments, we used data from Ego4D, Epic-Kitchens-55, and EGTEA GAZE+. For Epic-Kitchens-55 and EGTEA GAZE+, we also used the data annotation and splits of EGO-TOPO. First start a data folder in the root directory.
mkdir data
Download Ego4D dataset, annotations and pretrained models from here.
Download Epic-Kitchens 55 dataset and annotations.
Download EGTEA GAZE+ dataset from here.
Download data annotations from EGO-TOPO. Please refer to their instructions.
You can find our preprocessed file including text prompts, goal features, etc here.
Downloaded and unzip both folders.
Place the goal_features
under data
folder.
Place the dataset
folder under Llama2_models
folder.
Make a symlink in the ICL
subfolder of the Llama2_models
folder.
ln -s {path_to_dataset} AntGPT/Llama2_models/ICL
We used CLIP to extract features from these datasets. You can use the feature extraction file under transformer_models to extract the features.
python -m transformer_models.generate_clip_img_embedding
We have a data folder structure like illustrated below. Feel free to use your own setup and remember to adjust the path configs accordingly.
data
├── ego4d
│ └── annotations
| │ ├── fho_lta_taxonomy.json
| │ ├── fho_test_unannotated.json
│ │ ├── ...
│ │
│ └── clips
│ ├── 0a7a74bf-1564-41dc-a516-f5f1fa7f75d1.mp4
│ ├── 0a975e6e-4b13-426d-be5f-0ef99b123358.mp4
│ ├── ...
│
├── ek
│ └── annotations
| │ ├── EPIC_many_shot_verbs.csv
│ │ ├── ...
│ │
│ └── clips
│ ├── rgb
│ ├── obj
│ └── flow
│
├── gaze
│ └── annotations
| │ ├── action_list_t+v.csv
│ │ ├── ...
│ │
│ └── clips
│ ├── OP01-R01-PastaSalad.mp4
│ ├── ...
│
├── goal_features
│ ├── ego4d_feature_gt_val.pkl
│ ├── ...
│
├── output_CLIP_img_embedding_ego4d
│
...
Our codebase consists of three parts: the transformer experiments, the GPT experiments, and the Llama2 experiments. Implementation of each modules are located in the transformer_models
folder, GPT_models
, and Llama2_models
folder respectively.
You can find our model checkpoints and output files for Ego4D LTA here.
Unzip both folders.
Place the ckpt
folder under the llama_recipe
subfolder of the Llama2_models
folder.
Place the ego4d_outputs
folder under the llama_recipe
subfolder of the Llama2_models
folder.
Submit the output files to leaderboard.
cd Llama2_models/Finetune/llama-recipes
CUDA_VISIBLE_DEVICES=0 python inference/inference_lta.py --model_name {your llama checkpoint path} --peft_model {pretrained model path} --prompt_file ../dataset/test_nseg8_recog_egovlp.jsonl --response_path {output file path}
To run an experiment on the transformer models, please use the following command
python -m transformer_models.run --cfg transformer_models/configs/ego4d_image_pred_in8.yaml --exp_name ego4d_lta/clip_feature_in8
To run a GPT experiment, please use one of the workflow illustration notebooks.
To run a Llama2 experiment, please refer to the instructions in that folder.
Our paper is available on Arxiv. If you find our work useful, please consider citing us.
@article{zhao2023antgpt,
title = {AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?},
author = {Qi Zhao and Shijie Wang and Ce Zhang and Changcheng Fu and Minh Quan Do and Nakul Agarwal and Kwonjoon Lee and Chen Sun},
journal = {ICLR},
year = {2024}
}
This project is released under the MIT license.