I2T: Image Parsing to Text Generation

Fig.1 Diagram of the I2T framework. Four key components (as highlighted by bold fonts) are: (1) An image parsing engine that converts input images or video frames into parse graphs. (2) An And-or Graph visual knowledge representation that provides top-down hypotheses during image parsing and serves as an ontology when converting parse graphs into semantic representations in RDF format. (3) A general knowledge base embedded in the Semantic Web that enriches the semantic representations by interconnecting several domain specific ontologies. (4) A text generation engine that converts semantic representations into human readable and query-able natural language descriptions.

Overview of the I2T framework

Fast growth of public photo and video sharing websites such as "Flickr" and "YouTube", provides a huge corpus of unstructured image and video data over the Internet. Searching and retrieving visual information from the Web, however, has been mostly limited to the use of meta-data, user-annotated tags, captions and surrounding text (e.g. the image search engine used by Google).

In this paper, we present an image parsing to text description (I2T) framework that generates text descriptions in natural language based on understanding of image and video content. Fig. 1 shows the diagram of the proposed I2T framework with four key components:

  • An image parsing engine. that parses input images into their constituent visual patterns, in a spirit similar to parsing sentences in natural language.
  • An And-or Graph (AoG) visual knowledge representation that embodies vocabularies of visual elements including primitives, parts, objects and scenes as well as a stochastic image grammar that specifies syntactic (compositional) relations and semantic relations (e.g. categorical, spatial, temporal and functional relations) between these visual elements. The categorical relationships are inherited from WordNet, a lexical semantic network of English. The AoG not only guides the image parsing engine with top-down hypotheses but also serves as an ontology for mapping parse graphs into semantic representation (formal and unambiguous knowledge representation).
  • A Semantic Web that interconnects different domain specific ontologies with semantic representation of parse graphs. This step helps to enrich parse graphs derived purely from visual cues with other sources of semantic information. For example, the input picture in Fig. 2 has a text tag "Oven's mouth river". With the help of a GIS database embedded in the Semantic Web, we are able to relate this picture to a geo-location: "Oven's mouth preserve of Maine state". Another benefit using Semantic Web technology is that end users not only can access the semantic information of an image by reading the natural language text report but can also query the Semantic Web using standard semantic querying languages.
  • A text generation engine that converts semantic representations into human readable and query-able natural language descriptions.


The task of the I2T framework can be illustrated in Fig. 2.

Fig.2 Two major tasks of the I2T framework: (a) image parsing and (b) text description..

As simple as the Fig.2 may seem to be for a human, it is by no means an easy task for any computer vision system today - especially when input images are of great diversity in contents (i.e. number and category of objects) and structures (i.e. spatial layout of objects), which is certainly the case for images from the Internet. But given certain controlled domain, for example the two case study systems, automatic image parsing is practical. For this reason, our objective in this paper is twofold:

  • We use a semi-automatic method (interactive) to parse general images from the Internet in order to build a large-scale ground truth image dataset. Then we learn the AoG from this dataset for visual knowledge representation. Our goal is to make the parsing process more and more automatic using the learned AoG models.
  • We use automatic methods to parse image/video in specific domains. For example, in the surveillance system, the camera is static, so we only need to parse the background (interactively) once at the beginning, and all other components are done automatically. In the automatic driving scene parsing system, the camera is forward looking at roads and streets. Although the image parsing algorithm may produce some errors, it is fully automatic.

And-or Graph as a unified model for categorical and symbolic representations

The major challenges for realizing the I2T task on general images from the Internet is a so-called Semantic gap, which is defined as the the discrepancy between human interpretations of image information and those currently derivable by a computer.
From an artificial intelligence (AI) point of view, bridging the Semantic gap is equivalent to solving a visual symbol grounding problem. Therefore, we may further decompose the symbol grounding problem in visual domain into two levels:

  • Categorical representations, which are learned and innate feature-detectors that pick out the invariant features of object and event categories from their sensory projections (images). Each category corresponds to an elementary symbol (visual element).
  • Symbolic representations, which consist of symbol strings describing semantic relations between elementary symbols, such as category membership relations (e.g. A zebra is a horse that has strips), functional relations (e.g. the man in Fig.2} is the owner of the backpack), and so on. With these semantic relationships, basic elementary symbols grounded in categorical representations (e.g. horse and strip) can be used to compose new grounded symbols using rules (e.g. zebra = horse + strips).

Categorical representation has been the mainstream goal of computer vision community for the past decades. Extensive previous work in image annotation and video annotation (e.g. reported under the TREC Video Retrieval Evaluation program (TRECVID) ) has been mainly focused on addressing categorical representation.
Symbolic representations gains attractions in recent years. Marszalek and Schmid used semantic hierarchies (categorical relations) of WordNet to integrate prior knowledge about inter-class relationships into the visual appearance learning. Following this line, Fei-fei L. et al. launched an Image-Net project aiming to populate a majority of the synsets in WordNet with an average of 500-1,000 images selected manually by human.

The And-or Graph (AoG) model proposed in this paper is a unified model for categorical and symbolic representations.

Fig. 3 Categorical representation with AoG. The left panel illustrates two parse graph of clocks. Right panel shows an AoG of clock generated from merging multiple parse graphs together. The dark arrows in the AoG illustrates one parse graph (a round clock). Some leaf nodes are omitted from graph for clarity.


(i) Relation between image primitives

(ii) Relation between object parts

(iii) Relation between objects

Fig.4 With Semantic relations learned from annotated data, AoG is also a symbolic representation system.

LHI dataset

Building an image dataset with manually annotated parse graphs provides the training examples needed for learning the categorical image representation in the AoG model. Properly annotated datasets also provide the training examples needed for learning semantic relationships. This dataset must be large-scale in order to provide enough instances to cover possible variations of objects.
To meet this need, the Song-Chun zhu and his colleagues founded an independent non-profit research institute called Lotus Hill Institute (LHI), which started to operate in the summer of 2005. It has a full time annotation team for parsing the image structures and a development team for the annotation tools and database construction. Each image or object is parsed interactively into a parse graph where objects are associated with WordNet synsets to inherit categorical relationships. Functional relationships such as carry, eat are also specified manually. Fig. 5 lists an inventory of the current ground truth dataset parsed at LHI. It now has over 1 million images (or video frames) parsed, covering about 300 object categories.

Fig. 6 Inventory of the LHI ground-truth image database. Annotation example of each category in this figure as well as a small publicly available dataset can be found on the Website www.imageparsing.com.

To cope with the need of labeling tens of thousands of images, an interactive image parsing software, named Interactive ImageParser (IIP), is developed to facilitate the manual annotation task. As stated in a report, this dataset provides ground truth annotation for a range of vision tasks from high level scene classification, object segmentation to low level edge detection, edge attributes annotation. Comparing with other public datasets collected in various groups, such as the MIT LabelMe, the ImageNet, the MSRC dataset, the Caltech 101 and 256 and the Berkeley segmentation, the LHI dataset not only provides finer segmentation but also provides extra information such as compositional hierarchies and functional relationship.

Semantic Representation, user queries and text generation

  • Semantic representation is a formal way of expressing inferred image and video content and provides a bridge to content knowledge management. This is needed for a variety of information exploitation tasks, including content-based image and video indexing and retrieval, forensic analysis, data fusion, and data mining. The representation should be unambiguous, well-formed, flexible, and extensible for representing different object classes, their properties, and relations. With an image ontology based on the AoG, we can convert the parse graph representation of an image into semantic representation using RDF format.
  • User queries. With visual content published in the OWL, user can now perform content-based searches using SPARQL, the query language for Semantic Web, released by the World Wide Web Consortium (W3C). With SPARQL, users or autonomous data mining engines can perform searches by expressing queries based on semantics. This improves usability as the details of database models are hidden from the user. The versatile nature of SPARQL allows user to query multiple OWL documents collectively and this enhances data integration from multiple knowledge sources. For example, suppose that a car in an image is annotated as a ``sedan'' while the user performs a search using the term ``automobile''; SPARQL is still able to retrieve the result because the WordNet identifies that the two words are synonyms.
  • Text generation. While OWL provides an unambiguous representation for image and video content, it is not easy for human to read. Natural language text remains the best way for describing the image and video content to human and can be used for image captions, scene description, and event alerts. The text generation process is usually designed as a pipeline of two distinct tasks: text planning and text realization.
    • Text planner selects the content to be expressed, and decides how to organize the content into sections, paragraphs, and sentences.
    • Text realizer generates each sentence using the correct grammatical structure

Case studies: Video surveillance system

The overall architecture of the system is shown in Fig. 7 (a), which resembles the diagram for static images shown in Fig. 1 except two extra steps for analyzing video content, namely object-tracking and event-inference.

  • 1. An interactive image parser generates a parse graph of scene context from the first frame of the input video. The parse graph is further translated into a semantic representation using techniques described previously. Since the camera in this system is static, the parsing result of the first frame can be used through out the entire video. .
  • 2. The system tracks moving objects in the video (e.g. vehicles and pedestrians) and generates their trajectories automatically.
  • 3. From the scene context and object trajectories, an event inference engine extracts descriptive information about video events, including semantic and contextual information, as well as, relationships between activities performed by different agents. The Video Event Markup Language (VEML) \cite{NevatiaEtAlVERL2004} is adopted for semantic representation of the events.
  • 4. A text generation engine is used to convert semantic representation of scene context and video events into a text description.
    (a) Diagram of the video-to-text system
    (b) Parse graph of the first frame
    Fig. 7 Diagram of the video surveillance system
    Fig. 8 Samples of generated text and corresponding video snapshots.

Case studies2: Automatic driving scene parsing

As illustrated by Fig. 9, we build a novel AoG for driving scenes using an X-shaped-rays model at low resolution to obtain efficient scene configuration estimation. Then we further detect interesting objects in ``foreground'' such as cars and pedestrians and classify regions at high resolution under the scene context. We exploit several useful features from the X-shaped-rays model to classify different scenes, such as: the four intersection angles, the area ratio of sky to the whole image, the area ratio of building to the whole image, etc. The final detection results for three different types of driving scenes are illustrated in Fig. 10.

Fig. 9 (a) Under a low resolution (e.g. 32*32 pixles), a driving scene can be approximated by a X-shape-rays model with four components (left, right, bottom and top). (b) The AoG used for parsing driving scenes.

    Fig. 10 Parsing and text generation results for three example driving scenes.

Related Publication:

Benjamin Yao, Xiong Yang, Liang Lin, Mun Wai Lee and Song-Chun Zhu I2T: Image Parsing to Text Description, Proceedings of IEEE (invited for the special issue on Internet Vision) [pdf].

[Back to Benjamin's homepage]