NEW! [TensorFlow m-RNN Reimplementation Code]

[arXiv for ICLR 2015] [bibtex] [poster for nips DLW 2014] [arXiv for NIPS 2014 DLW]

This project present a multimodal recurrent neural network model (m-RNN) to generate image captions.

The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model.

A previous version of this work appears in the NIPS 2014 Deep Learning Workshop with the title: "Explain Images with Multimodal Recurrent Neural Networks". We observed subsequent arXiv papers which also use recurrent neural networks in this topic and cite our work. We gratefully acknowledge them.

NEW! [Refined Image Features] [code for boosting the results]

We adopt a simple method to boost the performance of the m-RNN model (~3 points for BLEU4 and ~10 points for CIDEr). The code is available at github.

We also release the features refined by the m-RNN model as well as the original VggNet features for MS COCO train, val and test 2014. You can download them here or by a bash script here. The details of this method is introduced in Section 8 of the updated version of the m-RNN paper.

Below are some examples of the generated sentences from Microsoft COCO dataset. It is generated by sampling the maximum likelihood word. It seems that in this dataset, "a" is the most common first word in the sentence, so all the generated sentences begin with "a". The results are expected to be further improved by using beam search.


a cat sitting on a bench in front of a window

a man riding a snowboard down a snow covered slope

a train traveling down a train track next to a forest

a laptop computer sitting on top of a table


a man riding a wave on top of a surfboard

a clock tower with a clock on top of it

a person on skis is skiing down a hill

a pizza sitting on top of a table next to a box of pizza


a red fire hydrant sitting in the middle of a forest

a train traveling down a bridge over a river

a man riding a surfboard on top of a wave

a bus is driving down the street in front of a building


a person is holding a skateboard on a street

a close up of a bowl of food on a table

a giraffe standing next to a tree in a zoo

a bunch of oranges are on a table


a group of planes flying in the sky

a train is traveling down the tracks in a city

a baseball player is swinging at a ball

a cat laying on a bed with a stuffed animal


a giraffe standing in a field with trees in the background

a young girl brushing his teeth with a toothbrush

a group of people flying kites in a field

a man is doing a trick on a skateboard

close this list

a woman standing in front of a table with a cake

a kitchen with a stove , stove , and a refrigerator

a bird sitting on top of a tree branch

a zebra standing on a dirt road next to a tree


a man in a baseball uniform is holding a baseball bat

a man is playing tennis on a tennis court

a woman sitting at a table with a plate of food

a person holding a remote control in a hand

close this list

back to top

For the IAPR-TC12 and the Flickr8K dataset, we use the standard dataset partition.

For the Flickr30K and the MS COCO dataset, the official dataset partitions have not been released when we started this project. We randomly split (details shown in the paper) these datasets and provide the image name lists of the train, validation and test sets as follows:

[Flickr30K] [MS COCO]

back to top

Comming Soon!

back to top

We thank Andrew Ng, Kai Yu, Chang Huang, Duohao Qin, Haoyuan Gao, Jason Eisner for useful discussions and technical support. We also thank the comments and suggestions of the anonymous reviewers from the NIPS 2014 deep learning workshop.

back to top