My Re-se-Arch Interests
My research focuses on computer Vis-Eye-on, often motivated by the task of building explainable visual Turing test and robot autonomy through lifelong communicative learning. To accomplish my research goals, I am interested in pursuing a unified framework for machines to ALTER (Ask, Learn, Test, Explain and Refine) recursively in a principled way.
• AOGNets: Deep AND-OR Grammar Networks for Interpretable, Adaptive and Context-Aware Deep Representaiton Learning
A deep AND-OR grammar network (AOGNet) consists of a number of stages each of which is composed of a number of And-Or grammar (AOG) building blocks. Here, a 3-stage network is shown in (a) with 1 building block in the first and third stage and 2 in the second stage. (b) illustrates the AOG building block. The input feature map is treated as a sentence of N words (e.g., N=4). The And-Or graph of the building block is constructed to explore all possible parses of the sentence w.r.t. a binary composition rule. Each node applies some basic operation (e.g., an example is show in (d) adapted from the bottleneck operator in ResNets to its input. The inputs are computed as shown in (c). There are three types of nodes: an And-node explores composition, whose input is computed by concatenating features of its child nodes; an Or-node represents alternative ways of composition in the spirit of exploitation, whose input is the element-wise sum of features of its child nodes; and a Terminal-node takes as input a channel-wise slice of the input feature map (i.e., a k-gram). Note that the output feature map usually has smaller spatial dimensionality through the sub-sampling used in the Terminal-node operations and larger number of channels. We also notice that different stages can use different And-Or graphs (we show the same one in (a) for simplicity), and before entering the first AOG stage we can apply multiple steps of Conv-BatchNorm-ReLu or front-end stages from other networks such as ResNets (thus AOGNets can be integrated with many other networks). Built on top of AOGNets based feature backbone, we can further integrate 2-D image grammar to learn interpretability-driven models.
• Deep Perception of the Visible and Deep Understanding of the Dark Jointly
A picture is worth a thousands of words. What are the words? They refer to, both visible and invisible, concepts and models (including patterns, symbols and logics). What are the structures orgainizing words? They refer to image/video and language grammar (hierarchical, compositional, reconfigurable, causal and explainable). In addition, "The more you look, the more you see" (quoted from Prof. Stuart Geman). My research focus on (i) statistical learning of large scale and highly expressive hierarchical and compositional models from heterogenous (big) data including images, videos and text, (ii) statistical inference by learning near-optimal cost-sensitive decision policies, and (iii) statistical theory of performance guaranteed learning algorithm and optimally scheduled inference procedure, i.e., maximizing the "gain" (accuracy) and minimizing the "pain" (computational costs).
• Lifelong Learning through ALTER Recursively (Demo)
Here, I articulate what I view as machines which can ALTER recursively using online object tracking as an example. Tracking is one of the innate capabilities in animals and humans for learning concepts (Susan Carey, The Origin of Concepts. Oxford Univ. Press, 2011).