Curriculum Vitae, Google Scholar

I am currently a Ph.D. candidate in computer vision at UCLA starting from the Fall, 2013 working with Prof. Alan L. Yuille. I am interested in addressing various topics in vision and language, such as image captioning, visual question answering and referring expressions, using deep learning methods.

During my PhD, I am lucky enough to take internships at Baidu Research USA working with Wei Xu, Yi Yang, and Jiang Wang, Google Research working with Kevin Murphy and Jonathan Huang, Pinterest working with Jiajing Xu and Kevin Jing, and Google X self-driving car project working with Congcong Li and Ury Zhilinsky,

I obtained my bachelor’s degree from Univeristy of Science and Technology of China (USTC). I received the Guo Moruo Scholarship, the supreme honor for undergraduate students in USTC and the Best Bachelor's Thesis Award. My undergraduate research advisors are Prof. Houqiang Li and Prof. Qi Tian.

NEW! We have released a part of the Pinterest 40M datset with 5 million images and their descriptions proposed in this paper. Please find the dataset and a toolbox on github.

NEW! I re-implement our m-RNN model with TensorFlow. Checkout the code on Github!

NEW! We have released the Google Refexp dataset proposed in this paper. It contains unambiguous object descriptions (i.e. referring expressions) for the objects in the MS COCO dataset. Please find the dataset and a toolbox for visualization and evaluation on this github page. The paper appears in CVPR 2016 as an oral presentation, and is covered by a TechCrunch article.

We propose a LSTM-CNN method for Image Question Answering (IQA) and construct a freestyle multilingual IQA dataset [arXiv]. The dataset is released here! This work was featured in a Bloomberg article, and will appear in NIPS 2015.

We adopt a simple method to boost the performance of the m-RNN model (~3 points for BLEU4 and ~10 points for CIDEr). The code is available at github. We also release the features refined by the m-RNN model for MS COCO train, val and test 2014. You can download them here or by a bash script here.

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images, NIPS 2016
Junhua Mao, Jiajing Xu, Yushi Jing, Alan Yuille
NEW! [Dataset and Toolbox], [arXiv], [Spotlight], [Project Page], [Paper]
We focus on training and evaluating effective word embeddings with both text and visual information. Multimodal datasets for both training and testing of word embeddings are proposed and will be gradually released on the project page.

Attention Correctness in Neural Image Captioning, AAAI 2017
Chenxi Liu, Junhua Mao, Fei Sha, Alan Yuille
[arXiv 1605.09553], [Suppl]
Quantitatively evaluate the correctness of the deep attentional model in image captioning task, and improve the performance by adding different levels of supervision for the attention in the training.

Generation and Comprehension of Unambiguous Object Descriptions, Oral Presentation, CVPR 2016; Media Coverage: [TechCrunch]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, Kevin Murphy
NEW! [Dataset and Toolbox], [arXiv: 1511.02283], [bibtex], [Slides]
We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We also present and release a new large-scale dataset for referring expressions, based on MS-COCO.

CNN-RNN: A Unified Framework for Multi-label Image Classification, Oral Presentation, CVPR 2016
Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu
[arXiv: 1604.04573]
An end-to-end trainable CNN-RNN framework for multi-lable image classification.

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, NIPS 2015; Media Coverage: [Bloomberg]
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu
NEW! [dataset and project page] [arXiv: 1505.05612], [bibtex],
We propose a the mQA model, which is able to answer questions about the content of an image. We construct a Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate our mQA model. It contains over 120,000 images and 250,000 freestyle Chinese question-answer pairs and their English translations.

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images, ICCV 2015
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang and Alan L. Yuille
[project page], [dataset], [arXiv: 1504.06692], [bibtex], [pdf]
We propose a novel method that allow the model to enlarge its word dictionary to describe the novel concepts using a few images with sentence descriptions. In particular, we do not need to retrain a large model from scratch every time we add a few images with new concepts.

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN), Oral Presentation, ICLR 2015
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang and Alan L. Yuille
NEW! [TensorFlow m-RNN Reimplementation Code]
[Refined Image Features] [code for boosting the results], [arXiv:1412.6632], [pdf], [Project Page], [Poster], [bibtex]
Previous version of paper [arXiv] has appeared in the NIPS 2014 Deep Learning Workshop with the title "Explain Images with Multimodal Recurrent Neural Networks"
We present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions and image-sentence bi-directional retrieval. The method is simple and performs at state of the art for both caption generation and image-sentence retrieval tasks.

Learning from Weakly Supervised Data by the Expectation Loss SVM (e-SVM), NIPS 2014
Jun Zhu, Junhua Mao, and Alan L. Yuille
[pdf], [bibtex]
We present an expectation loss SVM (e-SVM) method that is used to learn strong classifiers in a weakly supervised manner. We apply this method on the applciation of weakly supervised semantic segmentation using only bounding box annotation.

An Active Patch Model for Real World Texture and Appearance Classification, ECCV 2014
Junhua Mao, Jun Zhu, and Alan L. Yuille
[pdf], [Project Page and Datasets], [bibtex]
We develop a simple and intuitive method (Active Patch) that performs at state of the art on datasets ranging from homogeneous texture (e.g., material texture), to less homogeneous texture (e.g., the fur of animals), and to inhomogeneous texture (the appearance patterns of vehicles).

Scale-based Region Growing For Scene Text Detection, Oral Presentation, ACM Multimedia 2013
Junhua Mao, Houqiang Li, Wengang Zhou, Shuicheng Yan and Qi Tian
[Project Page], [pdf], [Short video demo], [bibtex]
We propose a scale-based region growing method based on SIFT descriptors and nerual networks and achieves state of the art performance on the task of scene text detection.

back to top

I have attended the Mathematical Contest in Modeling (MCM) 2011 & 2012 with Meihui Xie and Wei Wu. We are the Meritorious Winner for both year. I also attend the Robot Game Contest and our team entered final. You can find more information here.

I have done an interesting class project uses both methods from computer graphics and computer vision. Here is project page.

back to top