A Hidden Obstacle to Image Retrieval

Theo Pavlidis ©2009
(First posted May 22, 2009, revision of June 5, 2009)

1. Introduction

I have argued in the past that the key to progress in image retrieval is narrow focus because in such a case one can link features and semantics (see for example Limitations of CBIR or ICPR08 slides). But it seems the reason for successful narrow applications and the poor performance of general image retrieval may have an addiitonal, more subtle cause. Computed similarity can be close to human similarity when the images compared are very close to each other (as it is often the case in narrow applications) but it is hard to make the two agree when we are looking for looser connections as is the case in general searches. Next, I will try to formalize the problem, then point to published studies, and finally present examples from web search engines that support this view.

An aside: The new viewpoint does not negate the fact that narrowing the domain allows the mapping of images to sets of features that capture the semantics of the image. If we know how humans interpret images of a particular type we can apply transformations that map images into objects that are close to each other (under some properly defined metric) for images that humans find very similar. For example in Optical Character Recognition (OCR) some methods apply thinning because of the assumption that thickness of strokes does not affect the meaning of a symbol.

2. Computed versus Human Similarity

Let S(I,J) be a similarity measure between images I and J normalized in the [0,1] one range. Thus S(I,I) equals 1. Let H(P,I,J) be the human similarity measure for person P. I use this additional variable to emphasize the subjectivity of human judgment. It is certainly true that H(P,I,I) equals 1 for all P. So for identical images any reasonable measure should give the same answer. When S(I,J) is close to 1, then one may argue that it is also close to H(P,I,J). This seems to be the case with anything I have come across in the literature. In other words, when we deal with "very similar" images it is not hard to replicate human impressions with a numerical computation. But how about similarity measures for images that need not be close?. It seems that the two measures diverge in such cases and H(P,I,J) may also vary a lot depending on P.

I have found only one paper that focuses on that issue. Zhang, Samaras, and Zelinsky [1] used 142 human subjects and compared directly H(I,J) (averaged over the subjects) with S(I,J). (They do not use this notation, but it is easy to establish the connection. Their computed similarity measure used SIFT as well as other features.) They found good agreement between the two values at the extremes, high similarity or no similarity at all (Fig. 3 in [1]). However, for intermediate values the two measures did not agree, partly because humans did not agree amongst themselves. Fig. 2 of the paper displays the U-shaped curves showing agreements only at the extremes.

The paper by Zhao, Reyes, Pappas, and Neuhoff [2] compares of algorithms for texture discrimination using four subjects for estimating human similarity. Figs. 2 and 3 in [2] show that all computed similarity measures exhibit a big drop when going from humanly similar to humanly dissimilar textures, but otherwise the measures fluctuate and they are more in agreement with each other than the human rankings.

While there has been considerable research on image similarity, with the exception of [1] and [2], such work has not investigated the distinction in the response of humans or algorithms between nearly identical images and loosely similar images. For example, the paper by Rogowitz el al [3] uses multi-dimensional scaling [4] to analyze pairwise similarity matrices obtained from a set of 97 images and it found certain overall agreement by human subjects both amongst themselves as well as with some computed similarity measures. However because the set did not include any nearly identical images it is hard to tell how how different human agreement is between them and loosely similar images.

3. SIFT Based Methods

There is strong evidence that methods based on SIFT (such as that used in [1]) are likely to do well only in close matchings and not in cases where similarity can be defined in a subjective way. (Analysis of SIFT.) Therefore SIFT based approaches are good candidates for Biometrics where one is interested only in very close matches. This is exemplified in the work of Lee, Jain, and Jin [5].

Another example of successful retrieval based on close matchings using SIFT based features is the work of Sivic and Zisserman [6] that provides on line demonstrations. Because the method requires the user to select objects in an image, that by itself encourages queries that look for close matchings. The demo for Oxford buildings illustrates the effect of closeness. If one asks for a view of a building, the results are quite impressive because the system tries to match landmarks (Figures 1a and 1b). However, for other searches (for example, for people or objects) returned results are nowhere as good as those for the buildings. Because the system also displays measures of the quality of the matching one can see the difference in similarity strength in the two types of searches.

Figure 1: Examples from the Oxford buildings demonstration

One can argue that the system has been "trained" only for buildings but the examples of Figures 1c and 1d indicate that something more subtle is going on. A round light blue object in Figure 1c has been matched with round blue objects in Figure 1d but the context is quite different. We have a good match on the basis of low level local features (as expected from SIFT) but we cannot match the human interpretation that relies on object recognition.

SIFT relies on the matching of local details while being invariant to brightness, image dimensions, and other geometric transformations. Thus it seems well suited to the task of matching what, from a human perspective, are nearly identical images. (Humans do not seem to be affected in their judgment by uniform moderate changes in brightness or scale.)

4. Results from Google Similar Images

Figure 2a shows a thumb sketch of the top of the return of Google Similar Images when I asked for "labrador dog" and Figures 2b-2d the returns for "Similar Images".

2a: Return based on metadata 2b: Similar image return for second original return. 2c: Similar image return for third original return 2d: Similar image return for fourth original return
Figure 2: Results from Google Similar Images (Click on any figure to see it full size)

Clearly, a global similarity measure is used when one clicks on "Similar Images" because the background is also included in the comparison. The results of Figure 2c are impressive because, apparently, the same picture (albeit in different resolution) has been found in several web sites. (No effort to remove duplicates.) The results of Figure 2d are poor because the injuries on the face are a detail that it is not captured by the measure used. Interestingly, these diverse results were obtained in my first test of the system.

There are several blogs discussing Google Similar images, for example Richard Marr's, so I will not dwell anymore on that topic. From my view point it offers a striking demonstration that computational similarity measures work only for nearly identical images.

5. Results from GazoPa

GazoPa is a content based image search engine that is being currently developed by Hitachi. One can apply for an account at http://www.gazopa.com/sign_in and upload pictures of his/her choice to test the BETA version of the system. I tried several images and I obtained mixed results that I described in myICPR08 talk. Eventually I realized that the best results were obtained for images that could be matched closely by images in the data base. Figures 3 and 4 show the best result that I obtained and Figures 5 and 6 another good result.

Figure 3: Palm Tree Figure 4: Screen dump of GazoPa screen for the query of Figure 3. Note the close match of the 9th entry.
Figure 5: Mosque Figure 6: Screen dump of GazoPa screen for Figure 5

On the other hand Figures 7 and 8 show a failure of the search.

Figure 7: Shoe Figure 8: Screen dump of GazoPa screen for Figure 7.

Why the poor performance? Apparently there was no picture of a black shoe in the database. But I find quite interesting that many of the returns contained a dark elongated object and the same characterization can be applied to the shoe. The matchings seem absurd only because we apply a high level recognition processes to the objects shown. This case seems similar to that of Figures 1c and 1d, where we also have agreement in low level image descriptions. Calling the phenomenon the "semantic gap" is evading the issue. According to Professor Greg Zelinsky (Dept. of Psychology, Stony Brook University) "This is what a normal human might perceive very early in their visual processing. For example, if you are looking for a black shoe in a scene, your eye might be drawn to a black sports car based on low level visual metrics. This is a kind of human image retrieval error (your gaze "retrieved" the wrong pattern) that happens hundreds of times each day."

6. Conclusions

Unless we attempt object recognition, image comparisons rely on low level features. If two images are "very close" then their low level features are likely to match and we can do successful retrieval without worrying about the semantic gap. Of course this comment applies only to features (such as SIFT) that capture the image details. Histograms and other global statistics tend to be quite unreliable (ICPR08 slides). At another extreme we have images that humans may call similar because they have similar interpretation, even though their details are quite different. A third extreme is the case of images where most of the details may be similar but the human interpretation is different (such as the examples of Figures 1c and 1d, 2d, 7 and 8). In such extremes we have to deal with a semantic abyss (ICPR08 slides).

There is need for more research to find out to what extent humans decide that two images are similar based on the appearance as opposed to evoked high level associations. The first type of decision is clearly made in the case of nearly identical images and it is the only kind that we hope to replicate by computational methods. The open question is how far we can push the distinction between a pair of images and still maintain a "low level" decision. One significant indicator may be whether different people agree on the similarity of pairs if images, especially if people have diverse backgrounds and thus different cultural interpretations.

Cited Publications

[1] W. Zhang, D. Samaras, and G. Zelinsky "Classifying objects based on their visual similarity to target categories," Proceedings of the 30th Annual Conference of the Cognitive Science Society, 2008, pp. 1856-1861. On Line.

[2] X. Zhao, M. G. Reyes, T. N. Pappas, D. L. Neuhoff "Structural Texture Similarity Metrics for Retrieval Applications", Proceedings Int. Conference on Image Processing, 2008, pp. 1196-1199.

[3] B. E. Rogowitz, T. Frese, J. R. Smith, C. A. Bouman, and E. Kalin, "Perceptual Image Similarity Experiments" Human Vision and Electronic Imaging, Proc. of SPIEE, 3299, 1998. On Line

[4] http://www.analytictech.com/networks/mds.htm

[5] J-E. Lee, A. K. Jain, R. Jin "Scars, Marks, and Tattoos (SMT): Soft Biometric for Suspect and Victim Identification" Biometrics Symposium, Tampa, Sept. 2008. On Line.

[6] J. Sivic and A. Zisserman "Efficient Visual Search for Objects in Videos", IEEE Proceedings, Special Issue on Multimedia Information Retrieval, 96 (4), 2008, pp. 548-566.


Back to CBIR IndexBack to Technical Index


theopavlidis.com Site Map