UNIVERSITY OF TWENTE.

9th SIKS/Twente Seminar on Searching and Ranking

Understanding the Web

About

The goal of this seminar is to bring together researchers from academia and companies working on the development and evaluation of information systems, in particular retrieval, filtering, and recommending systems. Invited speakers are:

The symposium will take place at the campus of the University of Twente in building Carré, room 1333.
See Travel information. The event is part of the SIKS educational program. Especially PhD-students working in the field of (interactive) information filtering, recommending, and retrieval are strongly encouraged to participate.

Program

13:30 Coffee and Welcome
13:45
Semantic Annotation of Search Results from the Deep Web

An increasing number of databases have become Web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many applications such as deep Web data collection and Internet comparison-shopping, they need to be extracted out and assigned correct semantic labels. In this talk, we present an automatic annotation approach that first aligns the data units on a result page into different groups such that the data in the same group have the same semantic. Then for each group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same Web database. Our experiments indicate that the proposed approach is highly effective.

Biography

Weiyi Meng is currently a professor in the Department of Computer Science of the State University of New York at Binghamton. He received his Ph.D. from the Department of Computer Science at University of Illinois at Chicago in 1992. His research interests include metasearch engines, Web database integration systems, Internet-based information retrieval, information trustworthiness analysis, Web data quality, Web information extraction, sentiment analysis, and database management system. He is the co-author of three books "Deep Web Query Interface Understanding and Integration", "Advanced Metasearch Engine Technology" and "Principles of Database Query Processing for Advanced Applications". He has also published over 120 research papers. He is active in organizing conferences and serving on editorial boards of journals. He was a PC chair of the 2013 DASFAA conference and is currently the chair of the steering committee of the WAIM conference series.

Weiyi Meng (Department of Computer Science, State University of New York at Binghamton)
14:45
Large Scale Syntactic Annotation of Written Dutch: Lassy

Prof. van Noord will present the Lassy Small and Lassy Large treebanks, as well as related tools and applications. Lassy Small is a corpus of written Dutch texts (1,000,000 words) which has been syntactically annotated with manual verification and correction. Lassy Large is a much larger corpus (over 500,000,000 words) which has been syntactically annotated fully automatically. In addition, various browse and search tools for syntactically annotated corpora have been developed and made available. Their potential for applications in corpus linguistics and information extraction has been illustrated and evaluated in a series of case studies.

Biography

Gertjan van Noord (1961) was born in Culemborg. After obtaining a masters degree (cum laude) in Linguistics at the University of Utrecht, he obtained his PhD degree on the "reversibilty in natural language processing" at this university with Jan Landsbergen and Jan van Eijck. Since 1992 he is affiliated to the Faculty of Arts of the University of Groningen at which he holds a professorship. He is a member of the Executive Board of the Association of Computational Linguistics and co-founder of the Computational Linguistics in the Netherlands CLIN working-group. Van Noord is well-known for his open source natural language processing tools, including Alpino, FSA Utilities, Hdrug, and Textcat.
Gertjan van Noord (Faculty of Arts, University of Groningen, The Netherlands)
15:45 Closing
16:30
Distributed Deep Web Search

The World Wide Web contains billions of documents (and counting); hence, it is likely that some document will contain the answer or content you are searching for. While major search engines like Bing and Google often manage to return more or less relevant results to your query, there are plenty of situations in which they are less capable of doing so. Specifically, there is a noticeable shortcoming in situations that involve the retrieval of data from the deep web. Deep web data is difficult to crawl and index for today's web search engines, and this is largely due to the fact that the data must be accessed via complex web forms. However, deep web data can be highly relevant to the information-need of the end-user. This thesis overviews the problems, solutions, and paradigms for deep web search. Moreover, it proposes a new paradigm to overcome the apparent limitations in the current state of deep web search, and makes the following scientific contributions:

  1. A more specific classification scheme for deep web search systems, to better illustrate the differences and variation between these systems.
  2. Virtual surfacing, a new, and in our opinion better, deep web search paradigm which tries to combine the benefits of the two already existing paradigms, surfacing and virtual integration, and which also raises new research opportunities.
  3. A stack decoding approach which combines rules and statistical usage information for interpreting the end-user's free-text query, and to subsequently derive filled-out web forms based on that interpretation.
  4. A practical comparison of the developed approach against a well-established text-processing toolkit.
  5. Empirical evidence that, for a single site, end-users would rather use the proposed free-text search interface instead of a complex web form.
Analysis of data obtained from user studies shows that the stack decoding approach works as well as, or better than, today's top-performing alternatives.

PhD Defense by Kien Tjin-Kam-Jet (University of Twente)

Sponsors

CTIT Centre for Telematics and Information Technology
SIKS Netherlands research school for Information and Knowledge Systems

Registration

Please send your name and affiliation to if you plan to attend the symposium, and help us estimate the required catering.

Organizers: Djoerd Hiemstra and Franciska de Jong.