AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

ICLR2024

Can we better anticipate an actor’s future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively.

AntGPT is the proposed framework in our paper to leverage LLMs in video-based long-term action anticipation. AntGPT achieves state-of-the-art performance on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ by the time of publication.

Setup Environment

Clone this repository.

git clone [email protected]:brown-palm/AntGPT.git
cd AntGPT

Set up python (3.9) virtual environment. Install pytorch with the right CUDA version.

python3 -m venv venv/forecasting
source venv/forecasting/bin/activate
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --extra-index-url https://download.pytorch.org/whl/cu117

Install CLIP.

pip install git+https://github.com/openai/CLIP.git

Install other packages.

pip install -r requirements.txt

Install llama-recipe packages following instructions here.

Prepare Data

In our experiments, we used data from Ego4D, Epic-Kitchens-55, and EGTEA GAZE+. For Epic-Kitchens-55 and EGTEA GAZE+, we also used the data annotation and splits of EGO-TOPO. First start a data folder in the root directory.

mkdir data

Datasets

Download Ego4D dataset, annotations and pretrained models from here.
Download Epic-Kitchens 55 dataset and annotations.
Download EGTEA GAZE+ dataset from here.
Download data annotations from EGO-TOPO. Please refer to their instructions.

Preprocessed Files

You can find our preprocessed file including text prompts, goal features, etc here.
Downloaded and unzip both folders.
Place the goal_features under data folder.
Place the dataset folder under Llama2_models folder.
Make a symlink in the ICL subfolder of the Llama2_models folder.

ln -s {path_to_dataset} AntGPT/Llama2_models/ICL

Features

We used CLIP to extract features from these datasets. You can use the feature extraction file under transformer_models to extract the features.

python -m transformer_models.generate_clip_img_embedding

Data Folder Structure

We have a data folder structure like illustrated below. Feel free to use your own setup and remember to adjust the path configs accordingly.

data
├── ego4d 
│   └── annotations
|   │   ├── fho_lta_taxonomy.json
|   │   ├── fho_test_unannotated.json
│   │   ├── ...
│   │
│   └── clips
│       ├── 0a7a74bf-1564-41dc-a516-f5f1fa7f75d1.mp4
│       ├── 0a975e6e-4b13-426d-be5f-0ef99b123358.mp4
│       ├── ...
│
├── ek 
│   └── annotations
|   │   ├── EPIC_many_shot_verbs.csv
│   │   ├── ...
│   │
│   └── clips
│       ├── rgb
│       ├── obj
│       └── flow
│
├── gaze 
│   └── annotations
|   │   ├── action_list_t+v.csv
│   │   ├── ...
│   │
│   └── clips
│       ├── OP01-R01-PastaSalad.mp4
│       ├── ...
│
├── goal_features
│    ├── ego4d_feature_gt_val.pkl 
│    ├── ...
│
├── output_CLIP_img_embedding_ego4d
│
...

Running Experiments

Our codebase consists of three parts: the transformer experiments, the GPT experiments, and the Llama2 experiments. Implementation of each modules are located in the transformer_models folder, GPT_models, and Llama2_models folder respectively.

Download Outputs and Checkpoints

You can find our model checkpoints and output files for Ego4D LTA here.
Unzip both folders.
Place the ckpt folder under the llama_recipe subfolder of the Llama2_models folder.
Place the ego4d_outputs folder under the llama_recipe subfolder of the Llama2_models folder.

Evalutation on Ego4D LTA

Submit the output files to leaderboard.

Inference on Ego4D LTA

cd Llama2_models/Finetune/llama-recipes

CUDA_VISIBLE_DEVICES=0 python inference/inference_lta.py --model_name {your llama checkpoint path} --peft_model {pretrained model path} --prompt_file ../dataset/test_nseg8_recog_egovlp.jsonl --response_path {output file path}

Transformer Experiments

To run an experiment on the transformer models, please use the following command

python -m transformer_models.run --cfg transformer_models/configs/ego4d_image_pred_in8.yaml --exp_name ego4d_lta/clip_feature_in8

GPT Experiments

To run a GPT experiment, please use one of the workflow illustration notebooks.

Llama2 Experiments

To run a Llama2 experiment, please refer to the instructions in that folder.

Our Paper

Our paper is available on Arxiv. If you find our work useful, please consider citing us.

@article{zhao2023antgpt,
  title   = {AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?},
  author  = {Qi Zhao and Shijie Wang and Ce Zhang and Changcheng Fu and Minh Quan Do and Nakul Agarwal and Kwonjoon Lee and Chen Sun},
  journal = {ICLR},
  year    = {2024}
}

License

This project is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
GPT_models		GPT_models
Llama2_models		Llama2_models
assets		assets
transformer_models		transformer_models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Contents

Setup Environment

Prepare Data

Datasets

Preprocessed Files

Features

Data Folder Structure

Running Experiments

Download Outputs and Checkpoints

Evalutation on Ego4D LTA

Inference on Ego4D LTA

Transformer Experiments

GPT Experiments

Llama2 Experiments

Our Paper

License

About

Contributors 4

Languages

License

brown-palm/AntGPT

Folders and files

Latest commit

History

Repository files navigation

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Contents

Setup Environment

Prepare Data

Datasets

Preprocessed Files

Features

Data Folder Structure

Running Experiments

Download Outputs and Checkpoints

Evalutation on Ego4D LTA

Inference on Ego4D LTA

Transformer Experiments

GPT Experiments

Llama2 Experiments

Our Paper

License

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

Languages