[FEATURE REQUEST] Enable Video Training #305

simplaj · 2024-08-08T13:20:42Z

Is your feature request related to a problem? Please describe.
I have been actively using this repository for multimodal training involving images and text. It has been incredibly helpful for my research and development. However, I am interested in expanding the capabilities to include video-based multimodal training. Currently, the repository does not support video inputs, which limits the scope of applications that can be developed.

Describe the workflow you want to enable.
I would like to enable a workflow where video data can be seamlessly integrated into the existing multimodal training pipeline. This would involve handling video frames as sequential data and allowing the model to learn from both visual and textual information extracted from videos.

Describe your proposed solution.
To address this, I propose the following:
Implement support for video data by extending the current data handling pipeline to process video frames.

Describe alternatives you've considered
An alternative solution could be to preprocess videos externally into a sequence of images and then feed these images into the existing image-based pipeline. However, this approach may not fully leverage the temporal information present in videos, and the preprocessing step could introduce additional complexity.

Additional context
Supporting video inputs could significantly enhance the repository's utility for a wider range of applications, such as video captioning, action recognition, and video question answering.

Are you willing to help implement this feature?
Yes, I am very keen to contribute to this feature. I have experience in handling video data and training multimodal models. I expect it might take a few weeks to implement and test the feature, depending on the complexity. I would appreciate any guidance or support from the OpenFlamingo team to ensure seamless integration with the existing codebase.

anas-awadalla · 2024-08-09T02:30:53Z

I agree adding video would be great! While we aren't making major changes to the codebase at the moment, I think you will find this to be partially supported already. For instance the resampler already can take in multiple frames, denoted by an F, in the current implementation this is always 1 but if you pass in multiple frames (a video) it will also work. I think you will still need to work on dataloader etc though. Hope this is helpful! I am excited to see what you train :).

simplaj · 2024-08-15T05:14:34Z

I agree adding video would be great! While we aren't making major changes to the codebase at the moment, I think you will find this to be partially supported already. For instance the resampler already can take in multiple frames, denoted by an F, in the current implementation this is always 1 but if you pass in multiple frames (a video) it will also work. I think you will still need to work on dataloader etc though. Hope this is helpful! I am excited to see what you train :).

Thanks for the information! I’ll explore the resampler’s capability to handle multiple frames and will start working on integrating video support into the dataloader. I’ll keep you updated on my progress and let you know if I need any further assistance. Looking forward to contributing to this enhancement!

simplaj changed the title ~~[FEATURE REQUEST] YOUR DESCRIPTION HERE~~ [FEATURE REQUEST] Enable Video Training Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] Enable Video Training #305

[FEATURE REQUEST] Enable Video Training #305

simplaj commented Aug 8, 2024

anas-awadalla commented Aug 9, 2024 •

edited

Loading

simplaj commented Aug 15, 2024

[FEATURE REQUEST] Enable Video Training #305

[FEATURE REQUEST] Enable Video Training #305

Comments

simplaj commented Aug 8, 2024

anas-awadalla commented Aug 9, 2024 • edited Loading

simplaj commented Aug 15, 2024

anas-awadalla commented Aug 9, 2024 •

edited

Loading