Note: This material is from a seminar course given in the Fall of 2003 at Stony Brook University.
Lossy image compression needs a measure of image distortion in order to decide how much information can be discarted without affecting significantly he appearance of an image. It is generally agreed that subjective distortion does not correspond to numerical distortion as measured by a function of differences in pixel values unless the differences are very small. For example, human viewers cannot discriminate intensity differences below 1% so that it is safe to allow errors of approximation of that order. However, by insisting on small numerical errors we limit ourselves to relatively small compression ratios. We would like to exploit properties of the human visual systems and allow large numerical errors that may not be noticeable in order to achieve much higher compression ratios.
The common compression standard JPEG relies primarily on keeping numerical errors small but it does take into account some properties of human vision. For example, color images are coverted from RGB into a luminence and two chrominance components. Luminance is approximated closely while chrominance is approximated coarsely because human vision relies mainly on luminance. You can confirm this property of JPEG by comparing the size of the JPEG files for a color image and its gray version. The color JPEG file is only slightly larger (usually 10-20%) than the gray JPEG file (rather than three times the size). You can run a simple experiment by converting a color image into a gray one using the equation
|Gray = 0.299*Red + 0.587*Green + 0.114*Blue||(1)|
and then storing both versions in JPEG format. (The above is the luminance equation.) We see that taking into account a property of the human visual system results in an increase in compression by a factor of nearly 3.
The challenge is to do better than that by taking advantage of other properties of human vision. This is particularly true for images displayed on the web that are meant to provide information rather than artistic impression or be subject to enhancement for diagnostic or forensic reasons.
It is best to focus on preprocessors for the standard compression schemes rather than try to modify the standards, otherwise acceptance of our work will be delayed or will never occur. Thus given an image J we look for a transformed image T(J) that looks similar to J for human viewers but compresses much better than J. To do such work we must have not only good image distortion measures but we must also undertand how the standard compression schemes work.
Here is an example. The original JPEG standard subdivides an image into 8 by 8 blocks. In areas where the luminance varies slowly each block is approximated by the average luminance value in the block. These averages differ from block to block and this produces a blocking artifact that is quite annoying to human viewers to the point that many prefer a lossless encoding of such images using GIF or PNG with much bigger files than JPEG. If we had a preprocessor that would convert areas of slowly varying luminance into areas of constant luminance we should be able to use JPEG for the image. Of course this assumes that human viewers could not tell appart constant from slowly varying luminance. In order to perform such transformations automatically we need a quantitative measure that will tell whether the change in the image is significant or not.
The most popular image distortion measures operate by passing each image through a series of band pass filters and than computing the differences in all versions and finally summing them up. In order to illustrate the concept we consider a simplified version of a distortion measure where we have only two filters. One is a smoothing (blurring) filter to obtain a low-pass version L(J) and the other is a differentiating (gradient) filter obtaining a high pass version H(J). Then the distortion between images J1 and J2 (assumed to be of exactly the same size) is
|dsimple(J1,J2) = √(∑[L(J1)-L(J2)]2 + ∑[H(J1)-H(J2)]2)/(2*A)||(2)|
where ∑ stands for a summation over all pixels and A is the area of the image. Such a measure might catch the blocking effect of JPEG because the high pass version of the reconstructed image will be quite different from the high pass version of the original. Note that both filtered images are weighted equally in the above equation which need not be so in general.
If you have trouble reading Equation (2) click here for an old fashioned version. While this may be more understandable than the HTML version it uses about the same number of bytes as the full document. To see a version with about half the size click here. While the smaller version may not look as good, it conveys the same information as the original. This lossy compression is a simple example of using prior knowledge about the nature of the image to perform pre-processing. We decided that this is really a two tone image and mapped some gray values to black and white. (We did in effect reverse anti-aliasing.) This example also points to the need of a different type of image distortion measures, namely measures that take into account the "nature" of the image.
The motivation for looking at the nature of the image goes beyond aesthetics (or the desire to ignore them). Sometimes a small detail is critical and ignoring it may not be capture by a measure involving summation over the whole image as Equ. (2) and its more sophisticated counterparts do.
Consider for example the image of a clock, let's call it A. Construct an image B by replicating all details faithfully except for replacing one of the hands with the color of the face of the clock. Construct an image C by replacing the RGB values of each pixel with others that are, say, 5% higher. The difference will be barely noticeable to a human viewer, yet d(A,C) will be greater than d(A,B) for the measure of Equ.(2). We can even distort C even more by blurring the parts outside the face of the clock and changing the RGB values by more than 5%. While C will become visibly different than A, to a human observer it is a better approximation to A than B is. Of course is the purpose of displaying the clock does not have to do anything with showing the time the situation changes although a missing hand will still be noticeable.
There have also been claims based on tests that one of the most sophisticated versions of Equ. (2) (the Sarnoff model metric) does not perform significantly better than much simpler metrics. (See  below.)
The two figures below provide an illustration of this point. Both are approximations of an original (found in the web site of work on lossy PNG). The rms error of (a) is 7.54 while the rms error of (b) is 6.49 and yet the letters appear much clearer on (a). (b) does a better job in approximating the clouds, something that a human viewer barely notices. The rms of the gradient (an extreme high frequency high pass filter) is 9.69 for (a) and 9.25 for (b). Because the letters occupy a relatively small area and have low contrast with the cloudy sky background their blurring contributes very little to the error. Image (a) is slightly bigger than (b) in size (66.2KB versus 65.7KB) while the original is 96KB.
A related question to the nature of the image is how do we decide which compression scheme to use. The choice is usually between transform coding such as JPEG and space domain coding such as PNG or GIF. Heuristic suggestions abound in several web sides but a person can easily construct counter-examples. We can try both methods and select the one that gives the better result but we will need an image distortion measure if we are going to automate the process. We could do even better if we classified an image in a way as to guide the selection of the distortion measure as well as the selection of possible transformations. For an image we may identify objects in it and insist that they present in the transform version as well. (The hands in the clock or the letters in the "Beautiful Devon" image.) Interestingly enough such measures have been developed in connection with MPEG-4 that relies on image segmentation. (See  below.) Of interest are also similarity measures developed for information retrieval (See [301-303] below.)
There are several references, the following is a relatively recent concise treatise (on reserve in the CS Library)
For a coverage of the recent JPEG2000 format see
The magazine is available through the electronic journal database of the SUNY library. For those interested in digging deeper in the mathematics underlying JPEG2000 see in the same magazine
There are several books on wavelets. A relatively compact treatment is provided by
An older but comprehensive reference on image and video comrpession is (on reserve in the CS Library)
A good place to start on image distortion measures is a special issue of the journal Signal Processing, published by Elsevier and available through the electronic journal database of the SUNY library. It is part of volume 70 (1998) starting with page 153. It contains the following papers that are relevant to the topic.
The following paper is motivated by MPEG-4 but it describes a method that can be extended to still images.
The term "Similarity Measures" is used in information retrieval applications. The following are three relatively recent papers available through the electronic journal database of the SUNY library.
The following book has a wide collection of papers on such topics.
The following papers describe ideas for improving compression by pre- and post-processing of images while relying on standard compression methods.