Joint Parsing: Spatial, Temporal, and Causal Inference for
Understanding Images and Video
We demonstrate an end-to-end vision system for understanding images and video. The system integrates spatial, temporal, and causal (S/T/C) parsing tasks in a unified And-Or Graph (AOG) representation; outputs object, scene, events, intents and causality information in S/T/C parse graphs; generates narrative text descriptions in RDF format; and answers natural language queries for who, what, where, when and why.
An AOG is a hierarchical structure with alternate levels of And-nodes and Or-nodes.The And-nodes represent the decompositions of elements, e.g., in the spatial dimension a “table” can be decomposed into “top” and “legs”, and in the causal dimension a fluent change “light turns on” can be caused by a precondition “light off” and an action “touch light switch.” The Or-nodes represent the selections among sub-elements, e.g., in the temporal dimension a “use computer” event can be detailed by either “type on keyboard” or “move mouse.” In particular, there is a type of set Or-nodes with arbitrary number of children, e.g., in the spatial dimension there could be one, two, or more “chair” nodes under the “furnitures” node.