How nice it would be to have a computer keeping track of everything broadcasted and notifying you whenever you should listen to something interesting! This approach to dealing with information overload is taken in the field of Information Access. My project has focused on the application of information access to multimedia data. Ideally hundreds of television and radio broadcasts would be covered by a final system.
The user can present a query to the system. The query will be analyzed to produce the higher level representation of the query and this representation will be used to retrieve the position of the audio and video in the raw data repository. The retrieval system is based on the probabilistic information retrieval model. After retrieval, the audio and video will be played.
It is really important to notice that the probablistic information retrieval model is representation independent. Although it has only been applied to text data, I suggest to apply it to phoneme sequences. A problem that still has to be solved is how to deal with the imprecise data that a speech recognizer would produce. Intuitively I think it is possible to extend the model with the probability that a word was recognized correctly. However, this still has to be proven.
The incoming data is a continuous stream of multimedia data. It is very important that this stream is split in reasonable segments I will call documents from now on. I have defined two types of segmentation:
Semantic segmentation is the segmentation according to human perception of document boundaries. The purpose is to distinguish between a news report and a commercial. This segmentation is important because the user will expect the system to return a newsreport as a document and not half of a news report plus two commercials. Also, because the system has to measure the relevance of a document with respect to a query, the document boundaries have to be correct.
An example of a metaversion of a newsreport is:
< DOC >
< DOCNO > CNN-04/04/95-02:00:03 < /DOCNO >
< DATE > 04/04/95 < /DATE >
< TIME > 02:00:03 < /TIME >
< SOURCE > CNN < /SOURCE >
< TIMES >
{1 1 2} {2 1 3} {3 1 13} {4 1 14} {5 1 17} {6 1 19} {7 1 20}
....
< /TIMES >
< TEXT >
captions paid for by
the us department of education
live from atlanta
headline news
david goodnow reporting
texas authorities are trying
to figure out what caused
....
< /TEXT >
< /DOC >
If the user queries for "david goodnow", the caption documents stored in the INQUERY database collection are searched and inquery returns a list of documents ordered by decreasing probability of usefulness for the user. The system then reads the subindex for line 5 and adds the 17 seconds to the document time stamp. Store-24 is tuned to CNN and starts playing at 2:00:20, 17 seconds after the start of the document.
For the user interface, I implemented a WWW interface to the data. After typing the query, the user will be returned a page containing the following:
If a user checks audio, the running store-24 application will be positioned to the timestamp of the requested document. Searching the subindex was not implemented in the prototype, but this is of course a trivial task. A better approach than directly positioning store-24, would be to return a list of links that after pressing will position store-24. If the ringbuffer contained a longer time of audio (eg. a full month) these links could be included in other documents as well. The http protocol suite does not allow server initiated updates yet, so this interface could not be used for a filtering application that updates a users homepage automatically. However, netscape will extend its browsers with server push and client pull capability, which will make this interface suitable for a filtering application as well.
The neural network is fed with seventh order MFCCs and has been trained on thirty minutes of CNN and WBUR. MFCC is a spectral representation of the speech signal. The architecture of the neural network was inspired by the research group at OGI working on speech recognition using neural networks. Important to mention is that these MFCCs are also used by most speech recognition engines. It is time-consuming to calculate them so it is good that the same input vectors can be used for both the segmentation and the content analysis.
The performance of the neural network segmenter on a full hour of CNN is given below. I have ignored Music-Mix, Mix-Music and Speech-Silence errors in this test because these errors are not important for what I want to accomplish. The distinction I really wanted to achieve was between Music, Speech and Silence. The results of this test are summarized in the following table:
Correct Wrong
Silence 99.0% 1.0%
Speech 92.1% 7.9%
Mix 71.3% 28.7%
Music 100.0% 0.0%
All segments classified as music really
contained music, sometimes with speech in the background though.
Almost 30% of the segments labeled with mix actually contained plain speech.
This implicates that you will miss some information that could have been
recognized if you had known that is was speech and not mix. The 8% of
speech that contains music in the background may be a larger problem
though. This could really slow down the speech recognition engine of the
second recognition step.
Although the performance of the algorithm is not too bad, I sincerely believe a lot of improvements can still be made. The output smoothing is not very advanced, I did not use zero mean normalization of the inputs and the amount of training data is very small. However, the approach of using a neural network for this low level speech analysis seems to be a good one.
Another problem that will need to be solved is finding the document boundaries. I have started to implement syntactical segmentation and I believe that this can provide a basis to do semantic segmentation. I am convinced that adding speaker change detection and scene change detection will take semantic segmentation based on the syntax to a reasonable level.
Trying to apply speech recognition to sources like CNN audio will open a lot of new problems to be solved. For example, the traditionally used start and end point detection of utterances, typically based on energy and zero-crossing rates, will not work because there is almost no low-energy information on the television. By using a neural network trained on noises that we recognize as pauses I solved this to a certain account. Other problems are the number of times people speak at the same time, the amount of background noise, and the unlimited vocabulary. Speech recognizers have been trained on a limited pre-known vocabulary, which is not suitable to do information access on live audio and television feed.
Scene change detection can be done real-time by looking at the framesize of the compressed frames. Doing this it is possible to create storyboards. For a typical newsreport you would see one frame showing the announcer, one frame showing some houses under water in the Netherlands and one final frame showing the annoucer again.
My interpretation of the concept document is very flexible. It could even be applied to a coke can because the can contains information about the manufacturer and about the chemicals used.
Store-24 is a ring buffer containing 24 hours of audio from several channels. The people in the lab use it to listen to the radio. Now and then, mail is send to everybody refering to an interesting interview that somebody heard, and then you can position the radio to that timestamp to listen to the interview. You never have to miss a news report using this application!