Abstract: This paper presents a method, called AOGTracker, for simultaneously tracking, learning and parsing (TLP) of unknown objects in video sequences with a hierarchical and compositional And-Or graph (AOG) representation. The TLP method is formulated in the Bayesian framework with a spatial and a temporal dynamic programming (DP) algorithms inferring object bounding boxes on-the-fly. During online learning, the AOG is discriminatively learned using latent SVM to account for appearance (e.g., lighting and partial occlusion) and structural (e.g., different poses and viewpoints) variations of a tracked object, as well as distractors (e.g., similar objects) in background. Three key issues in online inference and learning are addressed: (i) maintaining purity of positive and negative examples collected online, (ii) controling model complexity in latent structure learning, and (iii) identifying critical moments to re-learn the structure of AOG based on its intrackability. The intrackability measures uncertainty of an AOG based on its score maps in a frame. In experiments, our AOGTracker is tested on two popular tracking benchmarks with the same parameter setting: the TB-100/50/CVPR2013 benchmarks, and the VOT benchmarks — VOT 2013, 2014, 2015 and TIR2015 (thermal imagery tracking). In the former, our AOGTracker outperforms state-of-the-art tracking algorithms including two trackers based on deep convolutional network. In the latter, our AOGTracker outperforms all other trackers in VOT2013 and is comparable to the state-of-the-art methods in VOT2014, 2015 and TIR2015.
Typical Issues in Online Object Tracking
Illustration of some typical issues in online object tracking using the “skating1” video in the TB100 benchmark. Starting from the object specified in the first frame, a tracker needs to handle many variations in subsequent frames which include illuminative variation, scale variation, occlusion, deformation, fast motion, in-plane and out-of-plane rotation, background clutter, etc.
Overview of the Proposed AOGTracker
(a) Illustration of the tracking, learn- ing and parsing (TLP) framework. It consists of four components. (b) Examples of capturing structural and appearance variations of a tracked object by a series of object config- urations inferred on-the-fly over key frames #1, #173, #282, etc. (c) Illus- tration of an object AOG, a parse tree and an object configuration in frame #282. A parse tree is an instantia- tion of an AOG. A configuration is a layout of latent parts represented by terminal-nodes in a parse tree. An object AOG preserves ambiguities by capturing multiple parse trees.
Performance gain (in \%) of our AOGTracker in term of success rate and precision rate in the TB100 benchmark.
Performance comparison in TB-100 (1st row), TB-50 (2nd row) and TB-CVPR2013 (3rd row) in term of success plots of OPE (1st column), SRE (2nd column) and TRE (3rd colum). For clarity, only top 10 trackers are shown in color curves and listed in the legend. Two deep learning based trackers, CNT and SO-DLT, are evaluated in TB-CVPR2013 using OPE (with their performance plots manually added in the left-bottom figure).
We have presented a tracking, learning and parsing (TLP) framework and derived a spatial dynamic programming (DP) and a temporal DP algorithm for online object tracking with AOGs. We also have presented a method of online learning object AOGs including its structure and parameters. In experiments, we test our method on two main public benchmark datasets and experimental results show better or comparable performance.
In our on-going work, we are studying more flexible computing schemes in tracking with AOGs. The compositional property embedded in an AOG naturally leads to different bottom-up/top- down computing schemes such as the three computing processes studied by Wu and Zhu . We can track an object by matching the object template directly (i.e. α-process), or computing some discriminative parts first and then combine them into object (β- process), or doing both (α + β-process, as done in this paper). In tracking, as time evolves, the object AOG might grow through online learning, especially for objects with large variations in long- term tracking. Thus, faster inference is entailed for the sake of real time applications. We are trying to learn near optimal decision policies for tracking using the framework proposed by Wu and Zhu .
In our future work, we will extend the TLP framework by incorporating generic category-level AOGs  to scale up the TLP framework. The generic AOGs are pre-trained offline (e.g., using the PASCAL VOC  or the imagenet ), and will help the online learning of specific AOGs for a target object (e.g., help to maintain the purity of the positive and negative datasets collected online). The generic AOGs will also be updated online together with the specific AOGs. By integrating generic and specific AOGs, we aim at the life-long learning of objects in videos without annotations. Furthermore, we are also interested in integrating scene grammar  and event grammar  to leverage more top-down information.