CvsH: part009

1.5. THE SEMANTIC GAP

Numbers and Text versus Everything Else

We saw that computers were able to do complex tasks, such as breaking an encryption code, when they were still in their infancy. On the other hand, more than 60 years later, they are unable to do something as simple as reading a "wiggly" text. This might look puzzling but there is a simple explanation.

Computers can deal only with numbers. (Actually, only binary numbers but it is easy to convert other numerical representations into binary.) Therefore any other type of data (text, speech, music, pictures) must be converted into numbers before it can be processed by a computer. Then we are faced with the question about how well connected is the human interpretation of the data to the numbers fed into a computer.

For numbers there is no issue. A 5 is a 5, always. For text the numbers represent symbols of the alphabet (or other writing system) according to some correspondence scheme as we saw in Section 1.1. For one such system (ASCII) 65 stands for A, 66 for B, 97 for a, 98 for b, etc. Punctuation marks and spaces are assigned a code so that Brown fox becomes 66 114 111 119 110 32 102 111 120. If you searching for documents with the word fox, your query is converted into numbers 102 111 120 and searching for matches is reduced to matching strings of numbers, something that computers can do very well.

Let us look now at pictures. They are also stored as strings of numbers, the colors at each pixel. The word pixel stands for picture element and here we come into the first fundamental difference between text and pictures. The characters in a text are natural elements and they describe the text whether stored in computer memory or chiseled in a stone tablet. When a person looks at a picture, he/she does not perceive any pixels. When a picture is taken with a digital camera is subdivided into tiny areas and information about their color is stored in the camera memory and, eventually, in computer memory. These tiny elements are the pixels. We know from physics that any color can be expressed as a combination of red, green, and blue and the standard is to express the strength of each color in the 0-255 range (so that it is stored in a byte). That gives a total of 256³ colors or more than 16 million, far more colors than a person can tell apart. (When the three colors are the same the picture appears gray, what we call "black-and-white photograph".) A typical camera may subdivide the scene into 1024 rows and 1024 columns, so that we have a million pixels and since each pixel has three bytes, the picture requires three million bytes or three megabytes. compression

The trouble is that pixels have no meaning for humans. One must create from them other entities that capture properties of a picture that are meaningful to people and that is not an easy task. According to neuroscientists [1] the brain has models of the world and 30 distinct areas, each performing a different type of information processing, such as detecting color, texture, etc. Perceptions emerge as a result of reverberations of signals between different levels of the sensory hierarchy, indeed across different senses. The complexity of the system may be also be inferred from the existence of visual illusions. People have tried to describe pictures using statistics of the pixel values but they have not been successful. The difference between human perception of pictures and pixel statistics is called in the literature the semantic gap. A more appropriate name might be the semantic abyss.

In contrast, text consists of words that are well-defined models of concepts so that communication amongst humans is possible. While words may be ambiguous, they are usually easily specified by content. Thus "dog" may mean either an animal or an unattractive person, but if we look at the words surrounding it, the meaning usually becomes clear. In turn, words have a formal representation either by single symbols (in Chinese for example) or by a sequence of symbols (in most languages). Whether one or several, the symbols have a well-defined representation in terms of bits, so going from digital storage to a human understandable representation is a matter of table look up. Figure 1.5.1 illustrates the paths from computer representation to human understanding for text and for pictures.

Figure 1.5.1: For text the path from the internal computer representation to human interpretation and understanding is short and simple, while for pictures the path is quite long and tortuous.

The semantic gap exists not only for pictures but also for sounds (speech, music, etc). The recording of sound is converted into numerical form by measuring its strength at an instance and storing that number. This process is repeated thousands of times a second. This is actually how digital recordings are made.

We may conclude with an observation from evolutionary biology. The human visual system has evolved from animal visual systems over a period of more than 100 million years (dinosaurs had a good visual system). In contrast, speech is barely over 100 thousand years old and written text no more than 10 thousand years old. On that basis, it seems that pictures would represent a much more difficult challenge for computers than speech, and speech in turn would be more challenging than text.

Notes

[1] The discussion of this paragraph relies on Chapter 4 of the book Phantoms in the Brain, by V. S. Ramachandran and S. Blakeslee (William Morrow and Company Inc., New York, 1998). The text in Italics is a quote from p. 56.

Back to Contents --- Previous Section --- Next Section