COMPUTER VISION
|
Figure 1: For
text the path from the internal computer representation to human
interpretation and understanding is short and simple, while for
pictures the path is quite long and tortuous. |
We may end with an observation from evolutionary biology. The human visual system has evolved from animal visual systems over a period of more than 100 million years (dinosaurs had a good visual system). In contrast, speech is barely over 100 thousand years old and written text no more than 10 thousand years old.
Computers must do a lot of work to make any sense of pictures. Here is an example. Suppose we want to make the computer "read". Instead of typing a document we scan it and then ask the computer to go from the pixels of the picture to the character codes. (This is what Google is doing with old books.) The first task would be to decide which pixels are black and which are white. Black means that all colors are 0 and white means all colors are 255 but that is an ideal situation. In practice we have shadows and illumination variations from the scanner so the task is not simple, but let us assume we have marked all pixels as either black or white. We may identify parts of each row where all pixels are black and replace them by short line segments. Then we can group stacks of such segments and replace them by straight lines (vectors).
(a) | (b) | (c) |
Figure 2: The thin lines represent scan lines of the original bit image and the heavy lines the vectors produced by the computation. — Click on each image to see it in full scale. |
Figure 2a shows an example of the results of such a process. You can see that we have been able to represent each character by elements that are close to what a human may use to describe the shape of the character. But it has taken a lot of computation to do that and our results are not foolproof. The example of Figure 2b shows that such a representation can be effected significantly by slight changes in the shape of the original. Finally, the example of Figure 2c shows the results of another method that also produces straight line representations of numerals.
The straight lines computed in the examples of Figure 2 are close to the strokes that a person may use to describe the shape of the letter. However, this is only a first step. We have to create decision rules for determining the name of a letter from these strokes. Research in this area started in the early days of computers (in the 1940's) but high quality commercial systems did not become available until half a century later, in the mid-1990's. Today you can buy software (for under $100) that lets you scan a page and then convert it into a text file. However, because most documents are already available in electronic form the need for such conversions is limited. The technology is used by postal services to read typed or handwritten address but the need for such processing becomes increasing less frequent because paper mail is replaced by e-mail and FAX.
Before we leave the subject of computer reading we should say a few words about bar codes. Right after World War II several people identified the need to tag items with a computer readable code so that they could be tracked in shipping and in commercial transactions. Because reading digits was hard for computers engineers search for alternative marking systems and the eventual result was bar codes. Bar codes encode information in the width of black and white stripes thus the computer needs to do only the first step of a reading program: finding lines of pixels with the same color. Because only the horizontal dimension is used bar codes can be read by linear scanners that capture only one line across a label. Measuring the width of a stripe may be hard for humans but it is quite easy for computers. Bar codes offer a way for marking items with symbols that match the special abilities of machines.
Many modern cameras have the ability to detect faces in a picture. There are several ways to do that: one is to look for light color oval areas with four dark spots (eyes, nose, and mouth). Another is to look for oval shapes with flesh color, and, of course, a combination of the two. However, such methods have trouble dealing with the faces of bearded people or those wearing dark glasses. Figure 3 illustrates the challenge.
Figure 3: The web site http://vasc.ri.cmu.edu/cgi-bin/demos/findface.cgi allows users to submit images and returns the results of a program for face detection, i.e. locating the face or faces in an image. Detected faces are marked with a green rectangle. A major miss is evident on the example shown. |
Recognizing faces is even harder. There has been a lot of publicity about computers looking at pictures of crowds and finding a particular individual (say, a suspected terrorist) there but the example of Figure 3 should make one skeptical about such claims. If it took nearly 50 years to develop machines that can read how long we should expect to take to make machines that recognize people? It appears that our brains have special areas dedicated to face recognition. This possibility can be inferred from the examples of Figure 4.
The task of finding the differences amongst each pair of pictures becomes much easier if we look at them right side up. From a mathematical viewpoint there should not be any difference. It is the circuitry of the human brain that is tuned to the way we usually see faces. You may also find that the difference in the ease of the task is bigger for the human face than the cat face. Why such a difference? Scrutinizing human faces is more valuable to people than scrutinizing cat faces.
Figure 4: Can you find in what way the pictures in each of the above pairs differ? Click here to see the pictures right side up. (Or if you are reading this in a hard copy turn the page around. |
Not surprisingly, the results of installed face recognition systems after 9/11 have been dismal. An ACLU press release of May 14, 2002 stated that "interim results of a test of face- recognition surveillance technology … from Palm Beach International Airport confirm previous results showing that the technology is ineffective." The release went on to say that: "Even with recent, high quality photographs and subjects who were not trying to fool the system, the face-recognition technology was less accurate than a coin toss. Under real world conditions, Osama Bin Laden himself could easily evade a face recognition system. … It hardly takes a genius of disguise to trick this system. All a terrorist would have to do, it seems, is put on eyeglasses or turn his head a little to the side." Similar conclusions appeared in a Boston Globe article of August 5, 2002. It quotes the director of security consulting firm saying that the "technology was not ready for prime time yet.'' He added that the " systems produced a high level of false positives, requiring an airport worker to visually examine each passenger and clear him for boarding." The article goes on to say: "One of the biggest deployments of the technology has occurred in England, in the London borough of Newham. Officials there claim that the installation of 300 facial-recognition cameras in public areas has led to a reduction in crime. However, they admit that the system has yet to result in a single arrest." A criticism of mechanical face recognition also appeared in the October 26, 2002 issue of the Economist. Ignoring the scientific evidence has resulted in a curious situation. The suppliers of the face recognition systems insist that the testers need to prove beyond reasonable doubt that their systems are faulty, instead of themselves having to prove that they are selling a valid product.
One obstacle to making computers match human visual understanding is that it is very hard to model the human perception of similarity in mathematical terms (the only language computers understand). Consider, for example, the three shapes in Figure 5.
Figure 5: Illustration used to explain the difference between humanly perceived and mathematical similarity. |
If you ask a person which one of the three shapes does not belong with the other two, most likely the answer will be the middle shape (circle with a notch). But computers must use mathematical formulas to compare the position of the pixels. It turns that the first two shapes are identical, except for the notch. If we overlay them on top of another (as the computer, effectively, does) they match everywhere, except at the notch, so their matching gets a high score. But the shape on the right is an ellipse, so it cannot be overlayed on the leftmost circle, so the matching will get a poor score. The computer answer will be that the shape to the right does not belong with the rest!
It is possible to write a program using different matching criteria (based on differential geometry) that will provide the same answer as people, but one can construct more complex examples where that program also fails to match human results.
Sources Cited | |
---|---|
[1] | V. S. Ramachandran and S. Blakeslee Phantoms in the Brain, William Morrow and Company Inc., New York, 1998. |
[2] | http://www.michaelbach.de/ot/ Web site of Professor Michael Bach of the University of Freiburg. |
First Posted: June 13, 2010 — Latest Update: June 13, 2010