Archive for the 'PF/Tijah' Category

Towards Affordable Disclosure of Spoken Word Archives

Thursday, October 30th, 2008, posted by Djoerd Hiemstra

by Roeland Ordelman, Willemijn Heeren, Marijn Huijbregts, Djoerd Hiemstra, and Franciska de Jong

This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research.

[download pdf]

Search for the Future

Friday, October 17th, 2008, posted by Djoerd Hiemstra

Information Retrieval is the discipline that studies computer-based search tools. Many applications that handle information on the internet would be completely inadequate without the support of information retrieval technology. How would we manage our email without spam filtering? How would we find information on the world wide web if there were no web search engines? The rise of web search engines has been one of the major success stories in computer science of the last decade: Internet and search companies like Google and Yahoo are now among the world’s most influential information technology companies.

Today, search technology is provided and developed by major search providers like Google and Yahoo, and by small specialized companies with specialized staff. But as search technology matures, it will have to be available to non-expert application developers as well. A major obstacle to achieve this, is the lack of theories and high-level abstractions of search systems and the lack of declarative query languages. Another obstacle is the lack of methods to handle non-textual data, such as images, audio and video. Several projects of the Database Group of the University of Twente try to solve these problems for application areas such as Entity Search, Expert Search, Video Search, and Distributed Search. The models and approaches that are developed in these projects are evaluated on large scale, realistic testbeds, and implemented in the group’s open source search system PF/Tijah, a search system that combines keyword queries with structured queries on XML databases. The research contributes to the several courses in the university’s graduate programs, for instance Information Retrieval, and XML & Databases 1 and XML & Databases 2.

Scientific programmer and post-doctoral positions

Thursday, July 3rd, 2008, posted by Djoerd Hiemstra

We have two job positions in the MultimediaN project.

Position 1: Speech Technology
SHoUT is an open source speech recognition toolkit developed at the University of Twente. SHoUT is a Dutch acronym for: “Spraak Herkennings Onderzoek Universiteit Twente”, or in English: “Speech Recognition Research at the University of Twente”. SHoUT is used to aid research on large vocabulary continuous speech recognition, including research into the application of statistical language models, audio segmentation and classification, speaker diarization and machine learning hyper parameter estimation for speech recognition.

Position 2: Search Engine Technology
PF/Tijah (Pathfinder/Tijah, pronounce as “Pee Ef Teeja”) is a flexible open source text search system developed at the University of Twente in cooperation with CWI Amsterdam and TU München. The system is integrated in the Pathfinder XQuery compiler and can be downloaded as part of the MonetDB/XQuery database system. PF/Tijah is used to aid research in information retrieval at the University of Twente, including the application of language models to search, entity retrieva, and implementation of the W3C candidate recommendation XQuery Full-Text.

[Official Job Advertisement] (deadline: August 1, 2008)

Henning Rode defends Ph.D. thesis on Entity Ranking

Monday, June 30th, 2008, posted by Djoerd Hiemstra

From Document to Entity Retrieval: Improving Precision and Performance of Focused Text Search

by Henning Rode

Text retrieval is an active area of research since decades. Finding the best index terms, the development of statistical models for the estimation of relevance, using relevance feedback, and the challenge to keep retrieval tasks efficient with ever growing text collections had been important issues over the entire period. Especially in the last decade, we have also seen a diversification of retrieval tasks. Instead of searching for entire documents only, passage or XML retrieval systems allow to formulate a more focused search for finer grained text units. Question answering systems even try to pinpoint the part of a sentence that contains the answer to a user question, and expert search systems return a list of persons with expertise on the topic. The sketched situation forms the starting point of this thesis, which presents a number of task specific search solutions and tries to set them into more generic frameworks. In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.

In the first case, we show how different types of context information can be incorporated in the retrieval of documents. When users are searching for information, the search task is typically part of a wider working process. This search context, however, is often not reflected by the few search keywords stated to the retrieval system, though it can contain valuable information for query refinement. We address with this work two research questions related to the aim of developing context aware retrieval systems. Firstly, we show how already available information about the user’s context can be employed effectively to gain highly precise search results. Secondly, we investigate how such meta-data about the search context can be gathered. The proposed “query profiles'’ have a central role in the query refinement process. They automatically detect necessary context information and help the user to explicitly express context dependent search constraints. The effectiveness of the approach is tested with experiments on selected dimensions of the user’s context.

When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is straightforward to develop retrieval systems that make use of the additional structure information. Structured retrieval first asks for the design of a suitable language that enables the user to express queries on content and structure. We investigate here existing query languages, whether and how they support the basic needs of structured querying. However, our main focus lies on the efficiency of structured retrieval systems. Conventional inverted indices for document retrieval systems are not suitable for maintaining structure indices. We identify base operations involved in the execution of structured queries and show how they can be supported by new indices and algorithms on a database system. Efficient query processing has to be concerned with the optimization of query plans as well. We investigate low level query plans of physical database operators for simple query patterns, and demonstrate the benefits of higher level query optimization for complex queries.

New search tasks and interfaces for the presentation of search results, like faceted search applications, question answering, expert search, and automatic timeline construction, come with the need to rank entities, such as persons, organizations or dates, instead of documents or text passages. Modern language processing tools are able to automatically detect and categorize named entities in large text collections. In order to estimate their relevance to a given search topic, we develop retrieval models for entities which are based on the relevance of texts that mention the entity. A graph-based relevance propagation framework is introduced for this purpose that enables to derive the relevance of entities. Several options for the modeling of entity containment graphs and different relevance propagation approaches are tested, demonstrating usefulness of the graph-based ranking framework.

Download Henning’s thesis from EPrints.

Sound ranking algorithms for XML search

Wednesday, June 18th, 2008, posted by Djoerd Hiemstra

by Djoerd Hiemstra, Stefan Klinger, Henning Rode, Jan Flokstra, and Peter Apers

Ranking algorithms for XML should reflect the actual combined content and structure constraints of queries, while at the same time producing equal rankings for queries that are semantically equal. Ranking algorithms that produce different rankings for queries that are semantically equal are easily detected by tests on large databases: We call such algorithms not sound. We report the behavior of different approaches to ranking content-and-structure queries on pairs of queries for which we expect equal ranking results from the query semantics. We show that most of these approaches are not sound. Of the remaining approaches, only 3 adhere to the W3C XQuery Full-Text standard.

The paper will be presented at the SIGIR 2008 Workshop on Focused Retrieval in Singapore

[download pdf]

PF/Tijah at INEX

Tuesday, June 17th, 2008, posted by Djoerd Hiemstra

To facilitate topic development for the INEX Entity Ranking track, we developed a simple but effective INEX entity ranking demo. The demo searches in about 4.5 GB of English Wikipedia articles. It is not that fast, but it was coded in less than two days: just insert the data and write a few XQuery statements, done!

Web-portal over kamp Buchenwald

Tuesday, April 15th, 2008, posted by Djoerd Hiemstra

Op vrijdag 11 april 2008 jl. publiceerde het Nederlands Instituut voor Oorlogsdocumentatie (NIOD) een web-portal over kamp Buchenwald, Binnen dit portal is algemene informatie over de historie van het kamp Buchenwald aanwezig, kan men de documentaire over het kamp bekijken en staan 37 interviews (ruim 60 uur) met oud-Buchenwalders online. Het nieuwe aan deze portal is dat men de interviews niet alleen integraal kan bekijken maar dat men er ook in kan zoeken. Dit laatste is mogelijk gemaakt door de afdelingen Human Media Interaction (HMI) en Databases (DB), beide onderdeel van het onderzoeksinstituut CTIT van de Universiteit Twente. Het gesproken woord van de overlevenden is door HMI ontsloten met behulp van spraaktechnologie, en dit is samen met de conventionele metadata bij de collectie (beschrijvingen en persoonsprofielen) doorzoekbaar gemaakt via de PF/Tijah zoekmachine, mede ontwikkeld door de UT DB groep. Hierdoor zijn de interviews online toegankelijk via zowel de traditionele metadata als via het letterlijke, gesproken woord van de overlevenden.

De Buchenwald-interviewcollectie bestaat uit 38 interviews met overlevenden en bevat in totaal zo´n 60 uur video. Met een tekstweergave van wat er letterlijk werd gezegd tijdens elk interview -de zogenaamde spraaktranscripties- is het mogelijk gemaakt om te zoeken in het gesproken woord van de overlevenden van kamp Buchenwald. Doordat bekend is welk woord op welk moment in welk interview gesproken werd, kan precies naar de plek in het interview verwezen worden waar het ging over, bijvoorbeeld, “werk in de fabrieken”. Voor gebruikers heeft dit meerdere voordelen: het is mogelijk om te weten te komen of bepaalde woorden wel of niet gezegd zijn, zonder het volledige interview af te luisteren; het is mogelijk direct de gevonden fragmenten te beluisteren zonder het hele interview te moeten afluisteren; het is mogelijk te “rekenen” aan de interviews (´hoe vaak werden bepaalde woorden door wie gebruikt´). Hoewel de spraakherkenner, zeker in het geval van een erfgoedcollectie als de Buchenwaldinterviews, regelmatig steekjes zal laten vallen, kan het resultaat van de herkenning heel goed gebruikt worden om in de interviews te zoeken. Voor gebruikers van gesproken collecties, zoals documentairemakers en onderzoekers, kan er daarom veel gaan veranderen met de komst zoekmachines zoals het hier beschreven systeem (zie de o.a. PF/Tijah site). Door de digitalisering en ontsluiting van audio- en videocollecties via Internet is het niet meer nodig om in persoon naar een archief te gaan, maar wordt het mogelijk om vanachter je eigen werkplek gesproken erfgoedmateriaal te benaderen. Daarnaast hoeft dit soort collecties niet meer van begin tot eind afgeluisterd te worden, maar kan de gebruiker door te zoeken in het gesproken woord heel specifieke fragmenten opvragen en direct beluisteren.

Het gesproken woord van de overlevenden van kamp Buchenwald is te doorzoeken via De projecten waarbinnen de zoekfunctionaliteit is ontwikkeld zijn CHoral, een NWO-CATCH project, en MultimediaN.

Ranked XML Querying Seminar

Monday, March 17th, 2008, posted by Djoerd Hiemstra

Participants of the Dagstuhl seminar on Ranked XML Querying

The goal of the Dagstuhl seminar on Ranked XML Querying is to bring together researchers and practitioners from the database (DB), the information retrieval (IR) and the web/applications communities, and create an environment where the distinct communities collaboratively work on understanding the similarities and differences between their various approaches for querying XML data with heterogeneous structure and content, and benefit from each other’s experiences.

The workshop was attended by 27 people from three different research communities: database systems (DB), information retrieval (IR), and Web. The seminar title was interpreted in an IR-style „andish“ sense (it covered also subsets of {Ranking, XML, Querying}, with larger sets being favored) rather than the DB-style strictly conjunctive manner. So in essence, the seminar really addressed the integration of DB and IR technologies with Web 2.0 being an important target area.

[download report]

Structured Text Retrieval Models

Monday, February 25th, 2008, posted by Djoerd Hiemstra

by Djoerd Hiemstra and Ricardo Baeza-Yates

Structured text retrieval models provide a formal definition or mathematical framework for querying semi-structured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text model’s word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like “containing'’ and “contained-by'’ to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like “I want a paragraph discussing formal models near to a table discussing the differences between databases and information retrieval'’. Here, “formal models'’ and “differences between databases and information retrieval'’ should match the content that needs to be retrieved from the database, whereas “paragraph'’ and “table'’ refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed in this entry.

This entry will soon be published in the Encyclopedia of Database Systems by Springer. The Encyclopedia, under the editorial guidance of Ling Liu and M. Tamer Özsu, will be a multiple volume, comprehensive, and authoritative reference on databases, data management, and database systems. Since it will be available in both print and online formats, researchers, students, and practitioners will benefit from advanced search functionality and convenient interlinking possibilities with related online content. The Encyclopedia’s online version will be accessible on the SpringerLink platform. Click here for more information about the Encyclopedia of Database Systems.


New PF/Tijah release

Monday, February 18th, 2008, posted by Djoerd Hiemstra
With the new stable release of MonetDB/XQuery (version 0.22) comes a new version (version 0.5) of PF/Tijah. We improved the main indexing data structure in this version, which is smaller and more efficient on most queries. Go to the PF/Tijah web site at: