In real-world visual recognition, many factors (such as illumination or pose) can cause a significant mismatch between the source domain on which classifiers are trained and the target domain to which those classifiers are applied. Without explicitly using domain adaption techniques, the objects representated by codewords, which are learnt from our sparse-FRAME codebook learning framwork, can be transferred to different domains automatically. In this experiment, we test the domain adaptibility of our represtation by image classification on four datasets, and compare directly to the published results [1] [2] [3] [4] [5] [6] [7]. The four dataset are: Amazon (images downloaded from online merchants), Webcam (low-resolution images by a web camera), DSLR (high-resolution images by a digital SLR camera), and Caltech-256 [8]. Each dataset is regarded as a domain. For the experiment with single source traning, 10 classes common to all four datasets are extracted: BACKPACK, TOURING-BIKE, CALCULATOR, HEAD-PHONES, COMPUTER-KEYBOARD, LAPTOP-101, COMPUTER-MONITOR, COMPUTER-MOUSE, COFFEE MUG, AND VIDEO-PROJECTOR, while as to the experiment with multiple sources training, all 31 classes in Amazon, Webcam, and DSLR are used. We use the evaluation protocol in [3] to randomly sample labeled data in the source domain as training examples, and unlabeled data in the target domain as testing examples, and then learn 3 codewords (templates) for each of the object category to construct a codebook. The codebook response extracted from each image are then fed to pyramid matching method, which equally divide an image into 1, 4, 16 areas, and concatenates the maximum codewords responses at different image areas into a feature vector. We use multi-class SVM to train image classifiers from these feature vectors, and then evaluate the classification accuracies of these classifiers on the testing domain. For each pair of source and target domains, we conduct experiments in 8 random trials and report averaged accuracies on target domains as well as standard errors. Table 1 and Table 2 show the comparison of recognition accuracies on target domains separately for single source training and multiple source training. It can be seen that our method performs significantly better than other methods on 7 out of 11 sub tasks, and on-par with the best performing method on the rest tasks.
General Parameters: nOrient = 16; sizeTemplatex = 108;
sizeTemplatey = 108;
GaborScaleList = [0.7]; DoGScaleList = []; sigsq = 10; locationShiftLimit = 2; orientShiftLimit = 1; numSketch = 40; isGlobalNormalization = true; isLocalNormalize = true; minHeightOrWidth = 150;
HMC Parameters: lambdaLearningRate = 0.1/sqrt(sigsq);
epsilon = 0.03;
L = 10; nIteraton = 40; 12x12 chains;
Codebook Parameters: flipOrNot = false;
rotateShiftLimit = 1; allResolution = [0.8, 1, 1.2]; #EM iteration = 12; numCluster = 3; maxNumClusterMember = 50; LocationPerturbationFraction = 0.4;
Trial |
C→A |
C→D |
A→C |
A→W |
W→C |
W→A |
D→A |
D→W |
1 (seed=1) |
60.1293 |
51.1811 |
45.1052 |
56.9811 |
39.8902 |
55.4957 |
50.2155 |
76.9811 |
2 (seed=2) |
63.3621 |
58.2677 |
44.9222 |
56.6038 |
35.4986 |
49.0302 |
55.4957 |
69.4340 |
3 (seed=3) |
61.9612 |
48.8189 |
48.8564 |
60.0000 |
37.6944 |
43.7500 |
56.5733 |
72.8302 |
4 (seed=4) |
61.5302 |
52.7559 |
47.1180 |
55.4717 |
39.5242 |
51.5086 |
58.1897 |
75.8491 |
5 (seed=5) |
62.9310 |
55.1181 |
47.8500 |
57.3585 |
36.7795 |
55.1724 |
57.2198 |
70.9434 |
6 (seed=6) |
61.9612 |
48.8189 |
44.2818 |
46.7925 |
38.7008 |
55.8190 |
59.5905 |
69.8113 |
7 (seed=7) |
60.5603 |
54.3307 |
48.9478 |
45.6604 |
35.9561 |
55.7112 |
54.5259 |
71.6981 |
8 (seed=8) |
61.0991 |
48.8189 |
42.0860 |
50.9434 |
43.0924 |
59.1595 |
51.8319 |
75.8491 |
9 (seed=9) |
65.8405 |
57.4803 |
47.3010 |
48.6792 |
38.8838 |
54.6336 |
53.1250 |
73.2075 |
10 (seed=10) |
62.3922 |
46.4567 |
50.4117 |
53.9623 |
45.1052 |
51.9397 |
56.0345 |
67.5472 |
Method |
C→A |
C→D |
A→C |
A→W |
W→C |
W→A |
D→A |
D→W |
Metric [1] |
33.7 ± 0.8 |
35.0 ± 1.1 |
27.3 ± 0.7 |
36.0 ± 1.0 |
21.7 ± 0.5 |
32.3 ± 0.8 |
30.3 ± 0.8 |
55.6 ± 0.7 |
SGF [2] |
40.2 ± 0.7 |
36.6 ± 0.8 |
37.7 ± 0.5 |
37.9 ± 0.7 |
29.2 ± 0.7 |
38.2 ± 0.6 |
39.2 ± 0.7 |
69.5 ± 0.9 |
GFK [3] |
46.1 ± 0.6 |
55.0 ± 0.9 |
39.6 ± 0.4 |
56.9 ± 1.0 |
32.8 ± 0.7 |
46.2 ± 0.7 |
46.2 ± 0.6 |
80.2 ± 0.4 |
FDDL [4] |
39.3 ± 2.9 |
55.0 ± 2.8 |
24.3 ± 2.2 |
50.4 ± 3.5 |
22.9 ± 2.6 |
41.1 ± 2.6 |
36.7 ± 2.5 |
65.9 ± 4.9 |
MMDT [5] |
49.4 ± 0.8 |
56.5 ± 0.9 |
36.4 ± 0.8 |
64.6 ± 1.2 |
32.2 ± 0.8 |
47.7 ± 0.9 |
46.9 ± 1.0 |
74.1 ± 0.8 |
SDDL [6] |
49.5 ± 2.6 |
76.7 ± 3.9 |
27.4 ± 2.4 |
72.0 ± 4.8 |
29.7 ± 1.9 |
49.4 ± 2.1 |
48.9 ± 3.8 |
72.6 ± 2.1 |
Our method |
62.2 ± 1.6 |
52.2 ± 4.0 |
46.7 ± 2.5 |
53.2 ± 4.9 |
39.1 ± 3.0 |
53.2 ± 4.4 |
55.3 ± 2.9 |
72.4 ± 3.1 |
Source |
Target |
Trial |
|||||||||
1 (seed=1) |
2 (seed=2) |
3 (seed=3) |
4(seed=4) |
5(seed=5) |
6 (seed=6) |
7(seed=7) |
8(seed=8) |
9(seed=9) |
10 (seed=10) |
||
DLSR, amazon |
webcom |
54.1311 |
50.9972 |
50.2849 |
52.8490 |
53.1339 |
51.5670 |
51.8519 |
53.9886 |
52.2792 |
50.5698 |
amazon, webcam |
DSLR |
50.8642 |
52.8395 |
56.5432 |
60.9877 |
52.3457 |
54.3210 |
51.8519 |
53.3333 |
53.0864 |
59.0123 |
webcam, DSLR |
amazon |
30.1395 |
33.9574 |
30.6902 |
31.2775 |
32.7093 |
30.4699 |
31.5345 |
32.4890 |
32.5624 |
35.0220 |
Source |
Target |
SGF [2] |
RDALR [7] |
FDDL [4] |
our method |
DLSR, amazon |
webcam |
52 ± 2.5 |
36.9 ± 1.1 |
41.0 ± 2.4 |
52.2 ± 1.4 |
amazon, webcam |
DSLR |
39 ± 1.1 |
31.2 ± 1.3 |
38.4 ± 3.4 |
54.5 ± 3.3 |
webcam, DSLR |
amazon |
28 ± 0.8 |
20.9 ± 0.9 |
19.0 ± 1.2 |
32.1 ± 1.6 |