In this project, we developed basic Encoder-Decoder model, and S2VT model to generate video captions. In addition, we also applied Attention machnism to improve performance.
The following instructions will get you a copy of the project and running on your local machine for testing purposes.
The following are some toolkits and their version you need to install for running this project
- Python 3.6 - The Python version used
- Pytorch 0.3 - Deep Learning for Python
- Pandas 0.21.0 - Data Analysis Library for Python
In addition, it is required to use GPU to run this project.
The following are the model structures we implemented in Pytorch from scratch:
- [Baseline Model]
- [S2VT Model] In order to improve performance, we also implemented Bahdanau Attention and Luong Attention
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly
learning to align and translate. arXiv:1409.0473
[2] Minh-Thang Luong, Hieu Pham, Christopher D. Manning. 2015. Effective Approaches to Attention-based
Neural Machine Translation
[3] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer. 2015. Scheduled Sampling for Sequence
Prediction with Recurrent Neural Networks
[4] Natsuda Laokulrat, Sang Phan, Noriki Nishida. 2016. Generating Video Description using
Sequence-to-sequence Model with Temporal Attention