MMDB ARCHITECTURE

A database architecture for multimedia retrieval

MIRROR

WARNING!

This is just draft information. It has no official status! It is also explicitely excluded from the indexing by robots. These pages are representing the intuition behind my research program.

If you like or dislike the ideas put forward on this page, I am very open to feedback.


MULTIMEDIA DATABASE REQUIREMENTS

In the paper (submitted for publication) Let's talk about it: Multimedia Databases, we introduce three extra requirements on multimedia database systems:

Developing the technology to fulfill these requirements is not trivial. In my PhD project, I investigate whether this problem can be solved using a new combination of existing information technology.

The resulting artifact after four years of research would be a multimedia database system that can be used for storage and retrieval of all the information that you would now keep in a dusty shoebox put away somewhere in your attic.


RESEARCH FRAMEWORK

Multiple representations

In this section, we will define what we mean with a representation. First, a television program consists of a video track, a sound track and maybe some textual information like subtitles and start and finish time. Moreover, each medium that the object consists of can be represented in several ways. For example, we can label an audio track with the output of a speech recognizer, or simply whether there is speech or music. We use the term representation for each version of the object that we can generate in our system. Thus, the speech recognizer's output and the speech/music labeling are two different representations of a video object.

Textual descriptions of multimedia objects alone are not sufficient. First, it is not very likely that people can describe objects with keywords in a standardized manner. One person may describe a picture as "dark" while another person describes the same picture as "somber". Moreover, substantial evidence exists that many semantic properties of multimedia objects cannot unambiguously be expressed verbally. The explanation of this phenomenon may be found in the differences between the left and the right brain. Therefore, we need other representations of these objects.

Applying an exact match on two digitized objects will only retrieve another object if it is bit-for-bit exactly the same. Thus, multimedia retrieval uses approximate match techniques. We use a distance measure that estimates how similar two objects are. The retrieval process returns a list of interesting objects with their ranking.

Many techniques exist to extract features characterizing some syntactic properties of the digitized version of the original object. A good example of the kind of information that can be extracted from images is illustrated using the Virage image datablade for the Illustra database system. Recently, the same approach has been applied to content based retrieval of audio objects.

Some particular retrieval task may be better addressed by one specific representation. For example, when we want to retrieve pictures of people, we can identify some very specialized features that work well for the identification of faces. However, the same system would be useless for searching pictures of dogs or cars. The features that describe faces are not defined for car bumpers.

We should not restrict ourselves to one main representation for all retrieval tasks. Therefore, the multimedia object has to be represented in many different ways. Database management systems are good at handling objects and relations between these objects. Therefore, we can use the traditional functionality of database management system for the management of these representations. In the next section, we will argue why the query processor has to learn about the different representations as well.

Query processing

In the previous subsection, we saw how one object may be represented with many different representations each using some syntactic properties of the object.

The syntactic representation of an image as a vector in color space, helps us retrieve pictures of "red cars" when we provide the database with an example picture. However, we also retrieve images of buildings or waterfalls, which are semantically completely different. Syntactically though, the picture of a car is very similar to the picture of a building, if we search in color space alone.

For example, if we just use color in the aforementioned demo, providing the leftmost picture as a query object, we find that the following objects are similar with respect to the color features:

...
query rank 1 rank 5 rank 6 rank 7
similarity 12.2 similarity 21.2 similarity 22.0 similarity 22.0

Although we would definitely like to retrieve the cars ranked at the 7th position, the buildings and waterfalls have little to do with the input picture.

We think the combination of many different syntactic properties of the multimedia objects is necessary to improve the retrieval process. The representations used should explicitely span several media if the original objects do. The hypothesis underlying this suggestion is that the combination of many syntactic representations will encompass more of the semantic properties of the object. Since the user searches using semantic properties of the multimedia objects, the combination of representations may be a big step forward in improving content-based retrieval processes.

Similar to the content-based query process orinally proposed in the QBIC project, the image retrieval process implemented in the Virage datablade uses more information than just the position in color space. The vector representing the images has been extended with other features, representing texture, structure and composition information. However, the kind of features that can be used is restricted to the combinations the datablade designer thought of. Moreover, the intuition behind the calculation of the overall similarity using the values for color similarity and texture similarity is not clear.

In the QBIC demo, it is possible to use a textual attribute containing keywords in combination with the image attributes. However, if you choose a picture that was not labelled with the keyword provided, the system does not find the query object similar to itself. Apparently, the keywords are simply used as a boolean filter on the retrieved objects.

We think that the combination of syntactic properties has to be traceable. Moreover, the user rarely wants to use a boolean filter. The power of the Query by Example paradigm is that the user can ask for similar objects to the ones he or she finds useful. If the combination with a keyword results in boolean retrieval, we return to the old problems with textual descriptions.

Social information filtering

This section ought to describe how social information filtering can contribute to the multimedia retrieval process. The intuition is that the social information filtering process can be modeled as a point query in user space. Because we already have to deal with multiple point queries in feature space, we hope we can realize this social information filtering relatively easy.

...More to follow...

Architecture

The following figure is a very rudimenary sketch of the architecture we are developing.

The term agent is used to indicate a unit that produces a representation of the input data. Each agent maps an input object to a point in some multidimensional space.

The black box contains the magic used to combine the representations. The database is used by this black box to search in the representations and give a final judgement of relevance for each object in the database. A major requirement for the black box is that it should be possible to explain the process of combination to the user.

The current architecture cannot yet process social information. However, this may be realized using the addition of an extra feature space that represents the user judgements on the objects in the database.

The exact formulation in this architecture is still an open question. If we try to build it on top of an extensible database system like Illustra, we can integrate the agents in the database, leading to the following database structure:

The open research task is to develop the technology to create the database technology to implement the middle layer. We will use the insights from information retrieval, image retrieval and machine learning scientists to build the higher-level multimedia query processor.

In the figure above, we outlined how the iterative query process takes place in this architecture. The user first poses an initial query, expressing some properties of the objects searched for. Next, the multimedia database (as described above) retrieves a ranked list of objects by combining the evidence on some of the representations.

The user investigates the ranked list, and returns some judgements on the retrieved objects. This relevance feedback is used to refine the query. Some of the information will be discarded from the initial query because no correlation is found with the judgements from the user. Other information will be added to the query.


RESEARCH QUESTIONS


Last updated: $Id: arch.html,v 1.8 1998/07/21 15:52:25 arjen Exp $
Maintained by: arjen@cs.utwente.nl