6th SIKS/Twente Seminar on Searching and Ranking

Searching Speech: Evaluation of Speech Recognition in Context


The goal of this seminar is to bring together researchers from companies and academia working on the evaluation of speech recognition techniques. Invited speakers are:

The symposium will take place at the campus of the University of Twente at the Citadel (building 9), lecture hall H327.
See Travel information

The event is part of the Advanced Components Stage of SIKS educational program. Especially PhD-students working in the field of Web based Systems and Data Management, Storage and Retrieval are strongly encouraged to participate.

The symposium is organized by Franciska de Jong, Laurens van der Werff en Thijs Verschoor, members of the CHoral project team.


13:00 Coffee and Welcome
An overview of the constuction and analysis of participant results for the MediaEval 2011 Rich Speech Retrieval task

The MediaEVal 2011 Rich Speech Retrieval (RSR) task was an exploratory study of the retrieval from an archive of semiprofessional user-generated Internet video where user information needs were associated with specific types of speech acts. The video dataset was taken from the Internet sharing platform, and search queries were associated with specific speech acts occurring in the video. A crowdsourcing approach was used to identify segments in the video data which contain speech acts, to create a description of the video containing the act and to generate search queries designed to refind this speech act. I first describe the construction of the dataset and reflect on our experiences with crowdsourcing this test collection using the Amazon Mechanical Turk platform. I will highlight the challenges we encountered in constructing this dataset, including the selection of the data source, design of the crowdsouring task and the specification of queries and relevant items. The completed MediaEval 2011 RSR test collection created using this methods was a known-item search for a single manually identified ideal jump-in point in the video where playback should begin for each query.

I will then provide a summary comparison of the results from three participant groups in the MediaEval 2011 RSR task based on automatic speech recognition system (ASR) transcripts, metadata manually assigned to each video by the user who uploaded it to, and their combination. This analysis shows how the participants sought to use different transcript segmentation methods to maximize the rank of the relevant item and to locate the nearest match to the ideal jump-in point. The results indicate that best overall results are obtained for topically homogeneous segments which have a strong overlap with the relevant region associated with the jump-in point, and that use of metadata can be beneficial when segments are unfocused or cover more than one topic.

Gareth J. F. Jones (Centre for Next Generation Localisation, School of Computing. Dublin City University, Ireland)
The importance of proper evaluation measures in technology evaluations

In several engineering disciplines, the research is driven, to a large extent, by existing evaluation data and protocols. As part of the protocol, the evaluation measure plays an important role, as this will be the main objective for researchers to select algoritms and choose parameters of there algoritms. I will review some of the history of evaluation measures in my own field (that of speech) and try to draw parallels to that of information retrieval.

David van Leeuwen (Radboud University Nijmegen, Netherlands Forensic Institute)
14:40 Break
Speech Processing Technologies in Quaero

This talk will overview the speech processing technologies developed in the Quaero program, with a focus on research at LIMSI. The aim of the Quaero program is wide: to develop technologies to automatically analyse and organize multimedia and multilingual content, with the aim of quick takeup of results in application projects, most of which rely more or less on speech technologies. The main research areas concerning speech are speech-to-text transcription, within and cross-show speaker diarization and language identification. All technologies are evaluated annually to assess progress either within Quaero or in international benchmarks.

Lori Lamel (Limsi - CNRS, France)
15:40 Closing
Evaluation of Noisy Transcripts for Spoken Document Retrieval

Spoken Document Retrieval (SDR) is usually implemented by using an Information Retrieval (IR) engine on speech transcripts that are produced by an Automatic Speech Recognition (ASR) system. These transcripts generally contain a substantial amount of transcription errors (noise) and are mostly unstructured. This thesis addresses two challenges that arise when doing IR on this type of source material: i. segmentation of speech transcripts into suitable retrieval units, and ii. evaluation of the impact of transcript noise on the results of an IR task.

It is shown that intrinsic evaluation results in different conclusions with regard to the quality of automatic story boundaries than when (extrinsic) Mean Average Precision (MAP) is used. This indicates that for automatic story segmentation for search applications, the traditionally used (intrinsic) segmentation cost may not be a good performance target. The best performance in an SDR context was achieved using lexical cohesion-based approaches, rather than the statistical approaches that were most popular in story segmentation benchmarks.

For the evaluation of speech transcript noise in an SDR context a novel framework is introduced, in which evaluation is done in an extrinsic, and query-dependent manner but without depending on relevance judgments. This is achieved by making a direct comparison between the ranked results lists of IR tasks on a reference and an ASR-derived transcription. The resulting measures are highly correlated with MAP, making it possible to do extrinsic evaluation of ASR transcripts for ad-hoc collections, while using a similar amount of reference material as the popular intrinsic metric Word Error Rate.

The proposed evaluation methods are expected to be helpful for the task of optimizing the configuration of ASR systems for the transcription of (large) speech collections for use in Spoken Document Retrieval, rather than the more traditional dictation tasks.

PhD Defense by Laurens van der Werff (Human Media Interaction, University of Twente)


CTITCentre for Telematics and Information Technology
NWONetherlands Organisation for Scientific Research (NWO)
SIKSNetherlands research school for Information and Knowledge Systems


Please send your name and affiliation to if you plan to attend the symposium, and help us estimate the required catering.