Tianfu (Matt) Wu bio photo

Tianfu (Matt) Wu

天地有大美而不言,四时有明法而不议,万物有成理而不说。圣人者,原天地之美而达万物之理。--《庄子.知北游》~~ Beauty and Perfection of the World and the Nature as They are Undescribable. One Ultimate Goal is to Learn Interpretable Models and Explainable Algorithms.

Twitter   G. Scholar LinkedIn Github E-Mail

My Re-se-Arch Interests

My research focuses on computer Vis-Eye-on, often motivated by the task of building explainable visual Turing test and robot autonomy through lifelong communicative learning. To accomplish my research goals, I am interested in pursuing a unified framework for machines to ALTER (Ask, Learn, Test, Explain and Refine) recursively in a principled way.

• AOGNets: Deep AND-OR Grammar Networks for Interpretable, Adaptive and Context-Aware Deep Representaiton Learning

vision

A deep AND-OR grammar network (AOGNet) consists of a number of stages each of which is composed of a number of And-Or grammar (AOG) building blocks. Here, a 3-stage network is shown in (a) with 1 building block in the first and third stage and 2 in the second stage. (b) illustrates the AOG building block. The input feature map is treated as a sentence of N words (e.g., N=4). The And-Or graph of the building block is constructed to explore all possible parses of the sentence w.r.t. a binary composition rule. Each node applies some basic operation (e.g., an example is show in (d) adapted from the bottleneck operator in ResNets to its input. The inputs are computed as shown in (c). There are three types of nodes: an And-node explores composition, whose input is computed by concatenating features of its child nodes; an Or-node represents alternative ways of composition in the spirit of exploitation, whose input is the element-wise sum of features of its child nodes; and a Terminal-node takes as input a channel-wise slice of the input feature map (i.e., a k-gram). Note that the output feature map usually has smaller spatial dimensionality through the sub-sampling used in the Terminal-node operations and larger number of channels. We also notice that different stages can use different And-Or graphs (we show the same one in (a) for simplicity), and before entering the first AOG stage we can apply multiple steps of Conv-BatchNorm-ReLu or front-end stages from other networks such as ResNets (thus AOGNets can be integrated with many other networks). Built on top of AOGNets based feature backbone, we can further integrate 2-D image grammar to learn interpretability-driven models.

vision

• Deep Perception of the Visible and Deep Understanding of the Dark Jointly

vision

A picture is worth a thousands of words. What are the words? They refer to, both visible and invisible, concepts and models (including patterns, symbols and logics). What are the structures orgainizing words? They refer to image/video and language grammar (hierarchical, compositional, reconfigurable, causal and explainable). In addition, "The more you look, the more you see" (quoted from Prof. Stuart Geman). My research focus on (i) statistical learning of large scale and highly expressive hierarchical and compositional models from heterogenous (big) data including images, videos and text, (ii) statistical inference by learning near-optimal cost-sensitive decision policies, and (iii) statistical theory of performance guaranteed learning algorithm and optimally scheduled inference procedure, i.e., maximizing the "gain" (accuracy) and minimizing the "pain" (computational costs).

• Lifelong Learning through ALTER Recursively (Demo)

vision

Here, I articulate what I view as machines which can ALTER recursively using online object tracking as an example. Tracking is one of the innate capabilities in animals and humans for learning concepts (Susan Carey, The Origin of Concepts. Oxford Univ. Press, 2011).

  • Ask: to predict object states (bounding boxes) in subsequent frames after only the first one is given.
  • Learn: to explore/unfold the space of latent object structures with the optimal models learned at different stages (e.g., Model1 and Model172).
  • Test: to perform tracking-by-parsing using learned models in a new frame with intrackability measured.
  • Explain: to account for structural and appearance variations explicilty based on intrackability (e.g., transition from Model1 to Model172 due to partial occlusion).
  • Refine: to re-learn the structure and/or parameters based on tracking results with accuracy and efficiency balanced.