Structured Text Retrieval Models

by Djoerd Hiemstra and Ricardo Baeza-Yates

Structured text retrieval models provide a formal definition or mathematical framework for querying semi-structured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text model’s word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like “containing'’ and “contained-by'’ to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like “I want a paragraph discussing formal models near to a table discussing the differences between databases and information retrieval'’. Here, “formal models'’ and “differences between databases and information retrieval'’ should match the content that needs to be retrieved from the database, whereas “paragraph'’ and “table'’ refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed in this entry.

This entry will soon be published in the Encyclopedia of Database Systems by Springer. The Encyclopedia, under the editorial guidance of Ling Liu and M. Tamer Özsu, will be a multiple volume, comprehensive, and authoritative reference on databases, data management, and database systems. Since it will be available in both print and online formats, researchers, students, and practitioners will benefit from advanced search functionality and convenient interlinking possibilities with related online content. The Encyclopedia’s online version will be accessible on the SpringerLink platform. Click here for more information about the Encyclopedia of Database Systems.

[draft]

One Response to “Structured Text Retrieval Models”

  1. Chan Says:
    Respected Sir I am Chan from Pakistan. Currently doing MPhil in Computer Science. I got your article on Structure Document Retrieval and found it a very good for the begginners like me to start with the basic concepts. I appreciate your contribution in the IRS field. Sir i want to know about current research issues/problems or areas currently work is in progress where. Especiaaly i will be delighted to know about most current Problems/research areas in Structured Document Retrieval Models. I studied your article, where u talked about explicit/implicit , static/dynamic structure? Tell me are these the current research areas or already done with. or Any other kind guidance from you with regards Chan Pakistan