This project present a multimodal recurrent neural network model (m-RNN) to generate image captions.
The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model.
A previous version of this work appears in the NIPS 2014 Deep Learning Workshop with the title: "Explain Images with Multimodal Recurrent Neural Networks". We observed subsequent arXiv papers which also use recurrent neural networks in this topic and cite our work. We gratefully acknowledge them.
We adopt a simple method to boost the performance of the m-RNN model (~3 points for BLEU4 and ~10 points for CIDEr). The code is available at github.
We also release the features refined by the m-RNN model as well as the original VggNet features for MS COCO train, val and test 2014. You can download them here or by a bash script here. The details of this method is introduced in Section 8 of the updated version of the m-RNN paper.
Below are some examples of the generated sentences from Microsoft COCO dataset. It is generated by sampling the maximum likelihood word. It seems that in this dataset, "a" is the most common first word in the sentence, so all the generated sentences begin with "a". The results are expected to be further improved by using beam search.
For the IAPR-TC12 and the Flickr8K dataset, we use the standard dataset partition.
For the Flickr30K and the MS COCO dataset, the official dataset partitions have not been released when we started this project. We randomly split (details shown in the paper) these datasets and provide the image name lists of the train, validation and test sets as follows:
We thank Andrew Ng, Kai Yu, Chang Huang, Duohao Qin, Haoyuan Gao, Jason Eisner for useful discussions and technical support. We also thank the comments and suggestions of the anonymous reviewers from the NIPS 2014 deep learning workshop.