PhD thesis by Henning Rode (University of Twente)
Text retrieval is an active area of research since decades. Finding the best index terms, the development of statistical models for the estimation of relevance, using relevance feedback, and the challenge to keep retrieval tasks efficient with ever growing text collections had been important issues over the entire period. Especially in the last decade, we have also seen a diversification of retrieval tasks. Instead of searching for entire documents only, passage or XML retrieval systems allow to formulate a more focused search for finer grained text units. Question answering systems even try to pinpoint the part of a sentence that contains the answer to a user question, and expert search systems return a list of persons with expertise on the topic. The sketched situation forms the starting point of this thesis, which presents a number of task specific search solutions and tries to set them into more generic frameworks. In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.
In the first case, we show how different types of context information can be incorporated in the retrieval of documents. When users are searching for information, the search task is typically part of a wider working process. This search context, however, is often not reflected by the few search keywords stated to the retrieval system, though it can contain valuable information for query refinement. We address with this work two research questions related to the aim of developing context aware retrieval systems. Firstly, we show how already available information about the user's context can be employed effectively to gain highly precise search results. Secondly, we investigate how such meta-data about the search context can be gathered. The proposed ``query profiles'' have a central role in the query refinement process. They automatically detect necessary context information and help the user to explicitly express context dependent search constraints. The effectiveness of the approach is tested with experiments on selected dimensions of the user's context.
When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is straightforward to develop retrieval systems that make use of the additional structure information. Structured retrieval first asks for the design of a suitable language that enables the user to express queries on content and structure. We investigate here existing query languages, whether and how they support the basic needs of structured querying. However, our main focus lies on the efficiency of structured retrieval systems. Conventional inverted indices for document retrieval systems are not suitable for maintaining structure indices. We identify base operations involved in the execution of structured queries and show how they can be supported by new indices and algorithms on a database system. Efficient query processing has to be concerned with the optimization of query plans as well. We investigate low level query plans of physical database operators for simple query patterns, and demonstrate the benefits of higher level query optimization for complex queries.
New search tasks and interfaces for the presentation of search results, like faceted search applications, question answering, expert search, and automatic timeline construction, come with the need to rank entities, such as persons, organizations or dates, instead of documents or text passages. Modern language processing tools are able to automatically detect and categorize named entities in large text collections. In order to estimate their relevance to a given search topic, we develop retrieval models for entities which are based on the relevance of texts that mention the entity. A graph-based relevance propagation framework is introduced for this purpose that enables to derive the relevance of entities. Several options for the modeling of entity containment graphs and different relevance propagation approaches are tested, demonstrating usefulness of the graph-based ranking framework.