iFRAME (inhomogeneous Filters Random Field And Maximum Entropy)

Experiment 5.3: Codebook Learning for Object Classification (Binary Classification)

Code and dataset

Experiment description

We evaluate the above "bag-of-word" representation extracted by a codebook of sparse FRAME templates on a binary classification task. We test it on a mixed datasets published in [1], which is a collection of 16 categories from Caltech-101 [2], all 5 categories from ETHZ Shape [3] and all 3 categories from Graz-02 [4] datasets. The task is to separate each category from a negative category. We resize all images to 150 x 150 pixels without changing their aspect ratios and convert them to grey level images. We randomly choose 30 positive and 30 negative images respectively as training data, and keep the rest as testing data. For Caltech-101 and Graz-02, negative images are chosen from background category, while for ETHZ, negative examples are chosen from images other than the target category. For each category, we learn a codebook of T = 10 sparse FRAME templates. Each template is of the size 100 x 100 and has n = 40 wavelets. We set scale S in {0.8,1,1.2} and orientation A in {-1, 0, +1} times Pi/16. Binary classification is done with linear logistic regression with regularization by L2 norm [5]. We compare our results with those obtained by SIFT [6] features and SVM classifier, where SIFT features are quantized into “words” by K-means clustering (K = 50, 100, 500) and fed into linear or kernel SVM. The best result among these six combinations (3 numbers of words x two types of SVM) is then reported. Table 2 shows the comparison results of the binary classification experiments. All experiments are repeated five times with different randomly selected training and test images, and the average accuracies and the 95% confident intervals are reported. It can be seen that our method generally outperforms the SIFT + SVM method, despite the fact that we use much smaller codebooks (10 “words” versus 50, 100, 500 “words”).

Parameters setting

General Parameters: nOrient = 16; sizeTemplatex = 100; sizeTemplatey = 100; GaborScaleList = [0.7]; DoGScaleList = []; sigsq = 10; locationShiftLimit = 2; orientShiftLimit = 1; numSketch = 40; isGlobalNormalization = true; isLocalNormalize = true; minHeightOrWidth = 150x150;
HMC Parameters: lambdaLearningRate = 0.1/sqrt(sigsq); epsilon = 0.03; L = 10; nIteraton = 40; 12x12 chains;
Codebook Parameters: flipOrNot = false; rotateShiftLimit = 1; allResolution = [0.8, 1, 1.2]; #EM iteration = 12; numCluster = 10; maxNumClusterMember = 50; LocationPerturbationFraction = 0.4;

Table 1. Accuracies (%) on binary classification tasks for 24 categories from Caltech-101, ETHZ Shape and Graz-02 data

Datasets SIFT+SVM Our method
Caltech-Watch
90.1 ± 1.0
89.1 ± 1.6
Caltech-Sunflower
76.0 ± 2.5
89.6 ± 3.7
Caltech-Laptop
73.5 ± 5.3
89.8 ± 2.7
Caltech-Chair
62.5 ± 5.0
82.9 ± 4.7
Caltech-Piano
84.5 ± 4.2
93.8 ± 2.6
Caltech-Lamp
61.5 ± 4.5
86.6 ± 4.3
Caltech-Ketch
82.2 ± 0.8
83.3 ± 6.5
Caltech-Dragonfly
66.0 ± 4.0
89.9 ± 5.7
Caltech-Motorbike
93.9 ± 1.2
92.2 ± 2.9
Caltech-Umbrella
73.4 ± 4.4
90.0 ± 0.7
Caltech-Guitar
70.0 ± 2.4
77.3 ± 6.3
Caltech-Cellphone
68.7 ± 5.1
95.7 ± 1.8
Caltech-Schooner
64.3 ± 2.2
87.7 ± 2.8
Caltech-Face
91.8 ± 2.3
94.4 ± 2.3
Caltech-Ibis
67.8 ± 6.0
85.3 ± 2.7
Caltech-Starfish
73.1 ± 6.7
90.0 ± 2.3
ETHZ-Bottle
68.6 ± 3.2
77.5 ± 5.6
ETHZ-Cup
66.0 ± 3.3
62.5 ± 3.0
ETHZ-Swans
64.2 ± 1.5
74.2 ± 7.5
ETHZ-Giraffes
61.5 ± 6.4
73.3 ± 4.8
ETHZ-Apple
55.0 ± 1.8
65.8 ± 6.1
Graz02-Person
70.4 ± 1.2
68.2 ± 3.8
Graz02-Car
64.0 ± 6.7
59.6 ± 5.5
Graz02-Bike
68.5 ± 2.8
71.3 ± 5.1

Reference

[1] Y. Hong, Z. Si, W. Hu, S. C. Zhu, and Y. N. Wu, Unsupervised learning of compositional sparse code for natural image representation. Quarterly of Applied Mathematics, 72, 373-406, 2013.
[2] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. CVPR Workshop, 2004.
[3] V. Ferrari, F. Jurie, and C. Schmid. From images to shape models for object detection. IJCV, 87, 284-303, 2010.
[4] M. Marszalek and C. Schmid. Accurate object localization with shape masks. CVPR, 2007.
[5] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. LIBLINEAR: A library for large linear classiÞcation. Journal of Machine Learning Research, 9, 1871-1874, 2008.
[6] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 91-110, 2004.