We evaluate the above "bag-of-word" representation extracted by a codebook of sparse FRAME templates on a binary classification task. We test it on a mixed datasets published in [1], which is a collection of 16 categories from Caltech-101 [2], all 5 categories from ETHZ Shape [3] and all 3 categories from Graz-02 [4] datasets. The task is to separate each category from a negative category. We resize all images to 150 x 150 pixels without changing their aspect ratios and convert them to grey level images. We randomly choose 30 positive and 30 negative images respectively as training data, and keep the rest as testing data. For Caltech-101 and Graz-02, negative images are chosen from background category, while for ETHZ, negative examples are chosen from images other than the target category. For each category, we learn a codebook of T = 10 sparse FRAME templates. Each template is of the size 100 x 100 and has n = 40 wavelets. We set scale S in {0.8,1,1.2} and orientation A in {-1, 0, +1} times Pi/16. Binary classification is done with linear logistic regression with regularization by L2 norm [5]. We compare our results with those obtained by SIFT [6] features and SVM classifier, where SIFT features are quantized into “words” by K-means clustering (K = 50, 100, 500) and fed into linear or kernel SVM. The best result among these six combinations (3 numbers of words x two types of SVM) is then reported. Table 2 shows the comparison results of the binary classification experiments. All experiments are repeated five times with different randomly selected training and test images, and the average accuracies and the 95% confident intervals are reported. It can be seen that our method generally outperforms the SIFT + SVM method, despite the fact that we use much smaller codebooks (10 “words” versus 50, 100, 500 “words”).
General Parameters: nOrient = 16; sizeTemplatex = 100;
sizeTemplatey = 100;
GaborScaleList = [0.7]; DoGScaleList = []; sigsq = 10; locationShiftLimit = 2; orientShiftLimit = 1; numSketch = 40; isGlobalNormalization = true; isLocalNormalize = true; minHeightOrWidth = 150x150;
HMC Parameters: lambdaLearningRate = 0.1/sqrt(sigsq);
epsilon = 0.03;
L = 10; nIteraton = 40; 12x12 chains;
Codebook Parameters: flipOrNot = false;
rotateShiftLimit = 1; allResolution = [0.8, 1, 1.2]; #EM iteration = 12; numCluster = 10; maxNumClusterMember = 50; LocationPerturbationFraction = 0.4;
Datasets | SIFT+SVM | Our method |
---|---|---|
Caltech-Watch |
90.1 ± 1.0 |
89.1 ± 1.6 |
Caltech-Sunflower |
76.0 ± 2.5 |
89.6 ± 3.7 |
Caltech-Laptop |
73.5 ± 5.3 |
89.8 ± 2.7 |
Caltech-Chair |
62.5 ± 5.0 |
82.9 ± 4.7 |
Caltech-Piano |
84.5 ± 4.2 |
93.8 ± 2.6 |
Caltech-Lamp |
61.5 ± 4.5 |
86.6 ± 4.3 |
Caltech-Ketch |
82.2 ± 0.8 |
83.3 ± 6.5 |
Caltech-Dragonfly |
66.0 ± 4.0 |
89.9 ± 5.7 |
Caltech-Motorbike |
93.9 ± 1.2 |
92.2 ± 2.9 |
Caltech-Umbrella |
73.4 ± 4.4 |
90.0 ± 0.7 |
Caltech-Guitar |
70.0 ± 2.4 |
77.3 ± 6.3 |
Caltech-Cellphone |
68.7 ± 5.1 |
95.7 ± 1.8 |
Caltech-Schooner |
64.3 ± 2.2 |
87.7 ± 2.8 |
Caltech-Face |
91.8 ± 2.3 |
94.4 ± 2.3 |
Caltech-Ibis |
67.8 ± 6.0 |
85.3 ± 2.7 |
Caltech-Starfish |
73.1 ± 6.7 |
90.0 ± 2.3 |
ETHZ-Bottle |
68.6 ± 3.2 |
77.5 ± 5.6 |
ETHZ-Cup |
66.0 ± 3.3 |
62.5 ± 3.0 |
ETHZ-Swans |
64.2 ± 1.5 |
74.2 ± 7.5 |
ETHZ-Giraffes |
61.5 ± 6.4 |
73.3 ± 4.8 |
ETHZ-Apple |
55.0 ± 1.8 |
65.8 ± 6.1 |
Graz02-Person |
70.4 ± 1.2 |
68.2 ± 3.8 |
Graz02-Car |
64.0 ± 6.7 |
59.6 ± 5.5 |
Graz02-Bike |
68.5 ± 2.8 |
71.3 ± 5.1 |
[1] Y. Hong, Z. Si, W. Hu, S. C. Zhu, and Y. N. Wu, Unsupervised learning of compositional sparse code for natural image representation. Quarterly of Applied Mathematics, 72, 373-406, 2013.
[2] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. CVPR Workshop, 2004.
[3] V. Ferrari, F. Jurie, and C. Schmid. From images to shape models for object detection. IJCV, 87, 284-303, 2010.
[4] M. Marszalek and C. Schmid. Accurate object localization with shape masks. CVPR, 2007.
[5] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. LIBLINEAR: A library for large linear classiÞcation. Journal of Machine Learning Research, 9, 1871-1874, 2008.
[6]
D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 91-110, 2004.