|
I2T:
Image Parsing to Text Generation
|
|
 |
|
Fig.1 Diagram of the I2T framework. Four key components (as highlighted by bold
fonts) are: (1) An image parsing engine that converts input images or video frames
into parse graphs. (2) An And-or Graph visual knowledge representation that provides
top-down hypotheses during image parsing and serves as an ontology when converting
parse graphs into semantic representations in RDF format. (3) A general knowledge
base embedded in the Semantic Web that enriches the semantic representations by
interconnecting several domain specific ontologies. (4) A text generation engine
that converts semantic representations into human readable and query-able natural
language descriptions.
|
Overview of the I2T framework
Fast growth of public photo and video sharing websites such as "Flickr" and "YouTube",
provides a huge corpus of unstructured image and video data over the Internet. Searching
and retrieving visual information from the Web, however, has been mostly limited
to the use of meta-data, user-annotated tags, captions and surrounding text (e.g.
the image search engine used by Google).
In this paper, we present an image parsing to text description (I2T) framework that
generates text descriptions in natural language based on understanding of image
and video content. Fig. 1 shows the diagram of the proposed
I2T framework with four key components:
- An image parsing engine. that parses input images
into their constituent visual patterns, in a spirit similar to parsing sentences
in natural language.
- An And-or Graph (AoG) visual knowledge representation
that embodies vocabularies of visual elements including primitives, parts, objects
and scenes as well as a stochastic image grammar that specifies syntactic (compositional)
relations and semantic relations (e.g. categorical, spatial, temporal and functional
relations) between these visual elements. The categorical relationships are inherited
from WordNet, a lexical semantic network
of English. The AoG not only guides the image parsing engine with top-down hypotheses
but also serves as an ontology for mapping parse graphs into semantic representation
(formal and unambiguous knowledge representation).
- A Semantic Web that interconnects different domain
specific ontologies with semantic representation of parse graphs. This step helps
to enrich parse graphs derived purely from visual cues with other sources of semantic
information. For example, the input picture in Fig. 2 has a
text tag "Oven's mouth river". With the help of a GIS database embedded
in the Semantic Web, we are able to relate this picture to a geo-location: "Oven's
mouth preserve of Maine state". Another benefit using Semantic Web technology
is that end users not only can access the semantic information of an image by reading
the natural language text report but can also query the Semantic Web using standard
semantic querying languages.
- A text generation engine that converts semantic
representations into human readable and query-able natural language descriptions.
|
|
|
The task of the I2T framework can be illustrated in Fig. 2.
|
 |
Fig.2 Two major tasks of the I2T framework: (a) image parsing
and (b) text description..
|
|
|
As simple as the Fig.2 may seem to be for a human, it is by
no means an easy task for any computer vision system today - especially when input
images are of great diversity in contents (i.e. number and category of objects)
and structures (i.e. spatial layout of objects), which is certainly the case for
images from the Internet. But given certain controlled domain, for example the two
case study systems, automatic image parsing is practical. For this reason, our objective
in this paper is twofold:
- We use a semi-automatic method (interactive) to parse general images from the Internet
in order to build a large-scale ground truth image dataset. Then we learn the AoG
from this dataset for visual knowledge representation. Our goal is to make the parsing
process more and more automatic using the learned AoG models.
- We use automatic methods to parse image/video in specific domains. For example,
in the surveillance system, the camera is static, so we only need to parse the background
(interactively) once at the beginning, and all other components are done automatically.
In the automatic driving scene parsing system, the camera is forward looking at
roads and streets. Although the image parsing algorithm may produce some errors,
it is fully automatic.
|
|
|
The major challenges for realizing the I2T task on general images from the Internet
is a so-called Semantic gap, which is defined as the the discrepancy
between human interpretations of image information and those currently derivable
by a computer.
From an artificial intelligence (AI) point of view, bridging the Semantic gap
is equivalent to solving a visual symbol grounding problem. Therefore, we may further
decompose the symbol grounding problem in visual domain into two levels:
- Categorical representations, which are learned and innate feature-detectors
that pick out the invariant features of object and event categories from their sensory
projections (images). Each category corresponds to an elementary symbol (visual
element).
- Symbolic representations, which consist of symbol strings describing
semantic relations between elementary symbols, such as category membership relations
(e.g. A zebra is a horse that has strips), functional
relations (e.g. the man in Fig.2} is the owner
of the backpack), and so on. With these semantic relationships, basic elementary
symbols grounded in categorical representations (e.g. horse and strip)
can be used to compose new grounded symbols using rules (e.g. zebra = horse
+ strips).
Categorical representation has been the mainstream goal of computer
vision community for the past decades. Extensive previous work in image annotation
and video annotation (e.g. reported under the TREC Video Retrieval Evaluation program
(TRECVID) ) has been mainly focused on addressing categorical representation.
Symbolic representations gains attractions in recent years. Marszalek
and Schmid used semantic hierarchies (categorical relations) of WordNet to integrate
prior knowledge about inter-class relationships into the visual appearance learning.
Following this line, Fei-fei L. et al. launched an
Image-Net project aiming to populate a majority of the synsets in WordNet
with an average of 500-1,000 images selected manually by human.
The And-or Graph (AoG) model proposed in this paper is a unified model for categorical
and symbolic representations.
|
|
|
|
Fig. 3 Categorical representation with AoG. The left panel illustrates two parse
graph of clocks. Right panel shows an AoG of clock generated from merging multiple
parse graphs together. The dark arrows in the AoG illustrates one parse graph (a
round clock). Some leaf nodes are omitted from graph for clarity.
|
 |
|
|
 |
 |
|
(i) Relation between image primitives
|
(ii) Relation between object parts
|
(iii) Relation between objects
|
|
|
Fig.4 With Semantic relations learned from annotated data, AoG is also a symbolic
representation system.
|
|
|
Building an image dataset with manually annotated parse graphs provides the training
examples needed for learning the categorical image representation in the AoG model.
Properly annotated datasets also provide the training examples needed for learning
semantic relationships. This dataset must be large-scale in order to provide enough
instances to cover possible variations of objects.
To meet this need, the Song-Chun zhu and his colleagues founded an independent non-profit
research institute called Lotus Hill Institute (LHI),
which started to operate in the summer of 2005. It has a full time annotation team
for parsing the image structures and a development team for the annotation tools
and database construction. Each image or object is parsed interactively into a parse
graph where objects are associated with WordNet synsets to inherit categorical relationships.
Functional relationships such as carry, eat are also specified manually. Fig. 5
lists an inventory of the current ground truth dataset parsed at LHI. It now has
over 1 million images (or video frames) parsed, covering about 300 object categories.
|
|
|
|
Fig. 6 Inventory of the LHI ground-truth image database. Annotation example of each
category in this figure as well as a small publicly available dataset can be found
on the Website www.imageparsing.com.
|
|
To cope with the need of labeling tens of thousands of images, an interactive image
parsing software, named Interactive ImageParser (IIP), is developed to facilitate
the manual annotation task. As stated in a report,
this dataset provides ground truth annotation for a range of vision tasks from high
level scene classification, object segmentation to low level edge detection, edge
attributes annotation. Comparing with other public datasets collected in various
groups, such as the MIT LabelMe, the
ImageNet, the MSRC dataset, the
Caltech 101 and 256 and the
Berkeley segmentation, the LHI dataset not only provides finer segmentation
but also provides extra information such as compositional hierarchies and functional
relationship.
|
|
- Semantic representation is a formal way of expressing inferred
image and video content and provides a bridge to content knowledge management. This
is needed for a variety of information exploitation tasks, including content-based
image and video indexing and retrieval, forensic analysis, data fusion, and data
mining. The representation should be unambiguous, well-formed, flexible, and extensible
for representing different object classes, their properties, and relations. With
an image ontology based on the AoG, we can convert the parse graph representation
of an image into semantic representation using RDF format.
- User queries. With visual content published in the OWL, user can
now perform content-based searches using
SPARQL, the query language for Semantic Web,
released by the World Wide Web Consortium (W3C).
With SPARQL, users or autonomous data mining engines can perform searches by expressing
queries based on semantics. This improves usability as the details of database models
are hidden from the user. The versatile nature of SPARQL allows user to query multiple
OWL documents collectively and this enhances data integration from multiple knowledge
sources. For example, suppose that a car in an image is annotated as a ``sedan''
while the user performs a search using the term ``automobile''; SPARQL is still
able to retrieve the result because the WordNet identifies that the two words are
synonyms.
- Text generation. While OWL provides an unambiguous representation
for image and video content, it is not easy for human to read. Natural language
text remains the best way for describing the image and video content to human and
can be used for image captions, scene description, and event alerts. The text generation
process is usually designed as a pipeline of two distinct tasks: text planning and
text realization.
- Text planner selects the content to be expressed, and decides how
to organize the content into sections, paragraphs, and sentences.
- Text realizer generates each sentence using the correct grammatical
structure
|
|
|
The overall architecture of the system is shown in Fig. 7 (a), which resembles the
diagram for static images shown in Fig. 1 except two extra steps for analyzing video
content, namely object-tracking and event-inference.
- 1. An interactive image parser generates a parse graph of scene context from the
first frame of the input video. The parse graph is further translated into a semantic
representation using techniques described previously. Since the camera in this system
is static, the parsing result of the first frame can be used through out the entire
video. .
- 2. The system tracks moving objects in the video (e.g. vehicles and pedestrians)
and generates their trajectories automatically.
- 3. From the scene context and object trajectories, an event inference engine extracts
descriptive information about video events, including semantic and contextual information,
as well as, relationships between activities performed by different agents. The
Video Event Markup Language (VEML) \cite{NevatiaEtAlVERL2004} is adopted for semantic
representation of the events.
- 4. A text generation engine is used to convert semantic representation of scene
context and video events into a text description.
|
 |
 |
(a) Diagram of the video-to-text system
|
(b) Parse graph of the first frame
|
|
Fig. 7 Diagram of the video surveillance system
|
 |
Fig. 8 Samples of generated text and corresponding video snapshots.
|
|
|
As illustrated by Fig. 9, we build a novel AoG for driving scenes using an X-shaped-rays
model at low resolution to obtain efficient scene configuration estimation. Then
we further detect interesting objects in ``foreground'' such as cars and pedestrians
and classify regions at high resolution under the scene context. We exploit several
useful features from the X-shaped-rays model to classify different scenes, such
as: the four intersection angles, the area ratio of sky to the whole image, the
area ratio of building to the whole image, etc. The final detection results for
three different types of driving scenes are illustrated in Fig. 10.
|
|
|
|
Fig. 9 (a) Under a low resolution (e.g. 32*32 pixles), a driving scene can be approximated
by a X-shape-rays model with four components (left, right, bottom and top). (b)
The AoG used for parsing driving scenes.
|
|
|
Fig. 10 Parsing and text generation results for three example driving scenes.
|
|
Benjamin Yao, Xiong Yang, Liang Lin, Mun Wai Lee and Song-Chun Zhu I2T: Image Parsing
to Text Description, Proceedings of IEEE (invited for the special issue
on Internet Vision) [pdf].
[Back to Benjamin's homepage]
|