If you like or dislike the ideas put forward on this page, I am very open to feedback.
Developing the technology to fulfill these requirements is not trivial. In my PhD project, I investigate whether this problem can be solved using a new combination of existing information technology.
The resulting artifact after four years of research would be a multimedia database system that can be used for storage and retrieval of all the information that you would now keep in a dusty shoebox put away somewhere in your attic.
Textual descriptions of multimedia objects alone are not sufficient. First, it is not very likely that people can describe objects with keywords in a standardized manner. One person may describe a picture as "dark" while another person describes the same picture as "somber". Moreover, substantial evidence exists that many semantic properties of multimedia objects cannot unambiguously be expressed verbally. The explanation of this phenomenon may be found in the differences between the left and the right brain. Therefore, we need other representations of these objects.
Applying an exact match on two digitized objects will only retrieve another object if it is bit-for-bit exactly the same. Thus, multimedia retrieval uses approximate match techniques. We use a distance measure that estimates how similar two objects are. The retrieval process returns a list of interesting objects with their ranking.
Many techniques exist to extract features characterizing some syntactic properties of the digitized version of the original object. A good example of the kind of information that can be extracted from images is illustrated using the Virage image datablade for the Illustra database system. Recently, the same approach has been applied to content based retrieval of audio objects.
Some particular retrieval task may be better addressed by one specific representation. For example, when we want to retrieve pictures of people, we can identify some very specialized features that work well for the identification of faces. However, the same system would be useless for searching pictures of dogs or cars. The features that describe faces are not defined for car bumpers.
We should not restrict ourselves to one main representation for all retrieval tasks. Therefore, the multimedia object has to be represented in many different ways. Database management systems are good at handling objects and relations between these objects. Therefore, we can use the traditional functionality of database management system for the management of these representations. In the next section, we will argue why the query processor has to learn about the different representations as well.
The syntactic representation of an image as a vector in color space, helps us retrieve pictures of "red cars" when we provide the database with an example picture. However, we also retrieve images of buildings or waterfalls, which are semantically completely different. Syntactically though, the picture of a car is very similar to the picture of a building, if we search in color space alone.
For example, if we just use color in the aforementioned demo, providing the leftmost picture as a query object, we find that the following objects are similar with respect to the color features:
![]() |
![]() |
... | ![]() |
![]() |
![]() |
| query | rank 1 | rank 5 | rank 6 | rank 7 | |
| similarity 12.2 | similarity 21.2 | similarity 22.0 | similarity 22.0 |
Although we would definitely like to retrieve the cars ranked at the 7th position, the buildings and waterfalls have little to do with the input picture.
We think the combination of many different syntactic properties of the multimedia objects is necessary to improve the retrieval process. The representations used should explicitely span several media if the original objects do. The hypothesis underlying this suggestion is that the combination of many syntactic representations will encompass more of the semantic properties of the object. Since the user searches using semantic properties of the multimedia objects, the combination of representations may be a big step forward in improving content-based retrieval processes.
Similar to the content-based query process orinally proposed in the QBIC project, the image retrieval process implemented in the Virage datablade uses more information than just the position in color space. The vector representing the images has been extended with other features, representing texture, structure and composition information. However, the kind of features that can be used is restricted to the combinations the datablade designer thought of. Moreover, the intuition behind the calculation of the overall similarity using the values for color similarity and texture similarity is not clear.
In the QBIC demo, it is possible to use a textual attribute containing keywords in combination with the image attributes. However, if you choose a picture that was not labelled with the keyword provided, the system does not find the query object similar to itself. Apparently, the keywords are simply used as a boolean filter on the retrieved objects.
We think that the combination of syntactic properties has to be traceable. Moreover, the user rarely wants to use a boolean filter. The power of the Query by Example paradigm is that the user can ask for similar objects to the ones he or she finds useful. If the combination with a keyword results in boolean retrieval, we return to the old problems with textual descriptions.

The term agent is used to indicate a unit that produces a representation of the input data. Each agent maps an input object to a point in some multidimensional space.
The black box contains the magic used to combine the representations. The database is used by this black box to search in the representations and give a final judgement of relevance for each object in the database. A major requirement for the black box is that it should be possible to explain the process of combination to the user.
The current architecture cannot yet process social information. However, this may be realized using the addition of an extra feature space that represents the user judgements on the objects in the database.
The exact formulation in this architecture is still an open question. If we try to build it on top of an extensible database system like Illustra, we can integrate the agents in the database, leading to the following database structure:

The open research task is to develop the technology to create the database technology to implement the middle layer. We will use the insights from information retrieval, image retrieval and machine learning scientists to build the higher-level multimedia query processor.

In the figure above, we outlined how the iterative query process takes place in this architecture. The user first poses an initial query, expressing some properties of the objects searched for. Next, the multimedia database (as described above) retrieves a ranked list of objects by combining the evidence on some of the representations.
The user investigates the ranked list, and returns some judgements on the retrieved objects. This relevance feedback is used to refine the query. Some of the information will be discarded from the initial query because no correlation is found with the judgements from the user. Other information will be added to the query.
Last updated: $Id: arch.html,v 1.8 1998/07/21 15:52:25 arjen Exp $