PROJECT OVERVIEW

Arjen P. de Vries

PROBLEM STATEMENT

Too much data floats around in the world surrounding us. Information superhighways are hyped by the media, worldwide computer networks that are really nothing more than data highways. Television and radio channels blast newsreports, documentaries and talk shows through the air day and night and thousands of magazines and papers are printed all over the world. This amount of data is so huge that it has become impossible to deal with in an efficient manner. People call this the problem of information overload. Even in strictly scientific environments it has been shown that researchers always miss relevant information and they have been trained to find as much information as possible.

How nice it would be to have a computer keeping track of everything broadcasted and notifying you whenever you should listen to something interesting! This approach to dealing with information overload is taken in the field of Information Access. My project has focused on the application of information access to multimedia data. Ideally hundreds of television and radio broadcasts would be covered by a final system.


DESIGN OF A MULTIMEDIA INFORMATION ACCESS SYSTEM


This figure shows a general framework that could address information access to multimedia data. The original raw data will be stored seperately from a collection of higher level representations of these data. The raw data will be analyzed using automatic techniques like speech recognition and scene change detection. The connection between the raw data and the higher level representation can be made by storing tuples containing channel, date and time. Available technology to create meta-data that can be applied in real-time is:

The user can present a query to the system. The query will be analyzed to produce the higher level representation of the query and this representation will be used to retrieve the position of the audio and video in the raw data repository. The retrieval system is based on the probabilistic information retrieval model. After retrieval, the audio and video will be played.

It is really important to notice that the probablistic information retrieval model is representation independent. Although it has only been applied to text data, I suggest to apply it to phoneme sequences. A problem that still has to be solved is how to deal with the imprecise data that a speech recognizer would produce. Intuitively I think it is possible to extend the model with the probability that a word was recognized correctly. However, this still has to be proven.

The incoming data is a continuous stream of multimedia data. It is very important that this stream is split in reasonable segments I will call documents from now on. I have defined two types of segmentation:

Syntactic segmentation segments the data by data type. For audio this could be the segmentation of the data in segments of silence, segments of music and segments of speech. For video this can be the segmentation based on fade-in's and fade-out's. This segmentation is important to decide which recognition to use (for example, a speech recognizer does not know what is speech and what is not). It could also provide a basis for semantic segmentation.

Semantic segmentation is the segmentation according to human perception of document boundaries. The purpose is to distinguish between a news report and a commercial. This segmentation is important because the user will expect the system to return a newsreport as a document and not half of a news report plus two commercials. Also, because the system has to measure the relevance of a document with respect to a query, the document boundaries have to be correct.


A PROTOTYPE DEMONSTRATING THESE IDEAS

In the prototype I developed, the raw data storage is a ringbuffer of 24 hours of audio from the commercial broadcaster CNN headline news. The video data is not stored. For the higher level representation, the captions of the television data are decoded by a special hardware box and stored as the higher level representation (captions are descriptive subtitles). The INQUERY system is used to store these captions. The applied segmentation is far too simple for a real application but works reasonably well on CNN. I simply segment the stream in documents whenever it takes longer than a treshold time for captions to come in. The captions are parsed in a SGML-like representation that INQUERY can deal with. To jump to found words within a document it is necessary to store an index of (timestamp, caption) tuples. This index is stored as a table in the TIMES field. I choose a (line, word, seconds) format, that could also be applied to complete sentences analyzed by a speech recognizer.

An example of a metaversion of a newsreport is:

< DOC >
< DOCNO > CNN-04/04/95-02:00:03 < /DOCNO >
< DATE > 04/04/95 < /DATE >
< TIME > 02:00:03 < /TIME >
< SOURCE > CNN < /SOURCE >
< TIMES >
{1 1 2} {2 1 3} {3 1 13} {4 1 14} {5 1 17} {6 1 19} {7 1 20}
....
< /TIMES >
< TEXT >
captions paid for by
the us department of education
live from atlanta
headline news
david goodnow reporting
texas authorities are trying
to figure out what caused
....
< /TEXT >
< /DOC >

If the user queries for "david goodnow", the caption documents stored in the INQUERY database collection are searched and inquery returns a list of documents ordered by decreasing probability of usefulness for the user. The system then reads the subindex for line 5 and adds the 17 seconds to the document time stamp. Store-24 is tuned to CNN and starts playing at 2:00:20, 17 seconds after the start of the document.

For the user interface, I implemented a WWW interface to the data. After typing the query, the user will be returned a page containing the following:

3 documents retrieved:
  1. CNN-04/04/95-02:00:03 Captions Audio
  2. CNN-03/28/95-20:29:54 Captions Audio
  3. CNN-03/28/95-23:58:57 Captions Audio
If a user checks audio, the running store-24 application will be positioned to the timestamp of the requested document. Searching the subindex was not implemented in the prototype, but this is of course a trivial task. A better approach than directly positioning store-24, would be to return a list of links that after pressing will position store-24. If the ringbuffer contained a longer time of audio (eg. a full month) these links could be included in other documents as well. The http protocol suite does not allow server initiated updates yet, so this interface could not be used for a filtering application that updates a users homepage automatically. However, netscape will extend its browsers with server push and client pull capability, which will make this interface suitable for a filtering application as well.

SYNTACTICAL SEGMENTATION OF AUDIO USING NEURAL NETWORKS


One of the problems I solved when I worked on the overall project is the segmentation of audio in segments of at least tenths of milliseconds of the following data types:
Silence
Pauses, breath noise, white noise
Speech
Speech with little background noise
Music
Music
Mix
Speech with music or high background noise
The motivation for segmenting in these data types is two-fold. First goal is to avoid feeding speech recognizers with music. They would not know what to do with it and start outputting nonsense. The other motive is that it is possible to guess the semantic structure of the audio document by using the syntactic structure. An analogous approach is commonplace in automatic document structure analysis of scanned documents. To show how this would work, I implemented the segmentation in the store-24 application so that people could jump to next and previous long silence and skip musical fragments. It runs in real time on the CNN channel in the lab.

The neural network is fed with seventh order MFCCs and has been trained on thirty minutes of CNN and WBUR. MFCC is a spectral representation of the speech signal. The architecture of the neural network was inspired by the research group at OGI working on speech recognition using neural networks. Important to mention is that these MFCCs are also used by most speech recognition engines. It is time-consuming to calculate them so it is good that the same input vectors can be used for both the segmentation and the content analysis.

The performance of the neural network segmenter on a full hour of CNN is given below. I have ignored Music-Mix, Mix-Music and Speech-Silence errors in this test because these errors are not important for what I want to accomplish. The distinction I really wanted to achieve was between Music, Speech and Silence. The results of this test are summarized in the following table:

               Correct  Wrong
     Silence    99.0%    1.0%
     Speech     92.1%    7.9%
     Mix        71.3%   28.7%
     Music     100.0%    0.0%
All segments classified as music really contained music, sometimes with speech in the background though. Almost 30% of the segments labeled with mix actually contained plain speech. This implicates that you will miss some information that could have been recognized if you had known that is was speech and not mix. The 8% of speech that contains music in the background may be a larger problem though. This could really slow down the speech recognition engine of the second recognition step.

Although the performance of the algorithm is not too bad, I sincerely believe a lot of improvements can still be made. The output smoothing is not very advanced, I did not use zero mean normalization of the inputs and the amount of training data is very small. However, the approach of using a neural network for this low level speech analysis seems to be a good one.


CONCLUSIONS

The inference network model can be applied to deal with the meta versions of the raw audio and video data we want to provide information access to. The Swiss federal institute found promising results using phoneme sequences as indexing features for ordinary textfiles. The only problem that has to be investigated is how the error probablitities of the recognition can be added to the inference network model without introducing inconsistencies in the model.

Another problem that will need to be solved is finding the document boundaries. I have started to implement syntactical segmentation and I believe that this can provide a basis to do semantic segmentation. I am convinced that adding speaker change detection and scene change detection will take semantic segmentation based on the syntax to a reasonable level.

Trying to apply speech recognition to sources like CNN audio will open a lot of new problems to be solved. For example, the traditionally used start and end point detection of utterances, typically based on energy and zero-crossing rates, will not work because there is almost no low-energy information on the television. By using a neural network trained on noises that we recognize as pauses I solved this to a certain account. Other problems are the number of times people speak at the same time, the amount of background noise, and the unlimited vocabulary. Speech recognizers have been trained on a limited pre-known vocabulary, which is not suitable to do information access on live audio and television feed.


EXPLANATION OF USED TERMS

Captions are the American equivalent of subtitles, especially designed for use by hearing impaired people. Therefore, captions may contain extra information like Music... or Knock knock.

Scene change detection can be done real-time by looking at the framesize of the compressed frames. Doing this it is possible to create storyboards. For a typical newsreport you would see one frame showing the announcer, one frame showing some houses under water in the Netherlands and one final frame showing the annoucer again.

My interpretation of the concept document is very flexible. It could even be applied to a coke can because the can contains information about the manufacturer and about the chemicals used.

Store-24 is a ring buffer containing 24 hours of audio from several channels. The people in the lab use it to listen to the radio. Now and then, mail is send to everybody refering to an interesting interview that somebody heard, and then you can position the radio to that timestamp to listen to the interview. You never have to miss a news report using this application!


Last updated: May 10th, 1996
Maintained by: arjen@cs.utwente.nl