2013 – Page 2 – Djoerd Hiemstra

Frans van der Sluis defends his PhD thesis on Information Experience

When Complexity becomes Interesting: An Inquiry into the Information eXperience

by Frans van der Sluis

To date, most research in information retrieval and related fields has been concerned primarily with efficiency and effectiveness of either the information system or the interaction of the user with the information system. At the same time, understanding the experience of a user during information interaction is recognized as a grand challenge for the development of information systems. There is a widely shared intuition that the value of the retrieved information is dependent on more than system characteristics such as the topical overlap between a query and a document. As it is not obvious how to embrace this intuition, this challenge has mostly been left ignored. This dissertation embarked upon the challenge of describing and developing an operational model of the Information eXperience (IX) – the experience during the interaction with information. This task was decomposed into three sub-challenges:

Transform the fuzzy concept of the IX into a formalized one.
Develop a model of textual complexity that enables an information system to influence a user's IX.
Identify and influence the causes of the experience of interest in text.

Debasis Ganguly successfully defends PhD thesis on Topical Relevance Models

Today, Debasis Ganguly successfully defended his PhD thesis at Dublin City University.

Topical Relevance Models

by Debasis Ganguly

An inherent characteristic of information retrieval (IR) is that the query expressing a user's information need is often multi-faceted, that is, it encapsulates more than one specific potential sub-information need. This multi-facetedness of queries manifests itself as a topic distribution in the retrieved set of documents, where each document can be considered as a mixture of topics, one or more of which may correspond to the sub-information needs expressed in the query. In some specific domains of IR, such as patent prior art search, where the queries are full patent articles and the objective is to (in)validate the claims contained therein, the queries themselves are multi-topical in addition to the retrieved set of documents. The overall objective of the research described in this thesis involves investigating techniques to recognize and exploit these multi-topical characteristic of the retrieved documents and the queries in IR and relevance feedback in IR.

Vertical Selection in the Information Domain of Children

Sergio Duarte Torres' paper on vertical selection for search for children is nominated for the JCDL Best Student Paper Award.

Vertical Selection in the Information Domain of Children

by Sergio Duarte Torres, Djoerd Hiemstra and Theo Huibers

In this paper we explore the vertical selection methods in aggregated search in the specific domain of topics for children between 7 and 12 years old. A test collection consisting of 25 verticals, 3.8K queries and relevant assessments for a large sample of these queries mapping relevant verticals to queries was built. We gather relevant assessment by envisaging two aggregated search systems: one in which the Web vertical is always displayed and in which each vertical is assessed independently from the web vertical. We show that both approaches lead to a di?erent set of relevant verticals and that the former is prone to bias of visually oriented verticals. In the second part of this paper we estimate the size of the verticals for the target domain. We show that employing the global size and domain specific size estimation of the verticals lead to significant improvements when using state-of-the art methods of vertical selection. We also introduce a novel vertical and query representation based on tags from social media and we show that its use lead to significant performance gains.

Presented on 23 July at the joint ACM/IEEE conference on Digital Libraries JCDL 2013 in Indianapolis, USA.

[download pdf]

MIREX 0.3 for ClueWeb12

MIREX 0.3 We released a new version 0.3 for TREC Web track participants that work on the new ClueWeb12 dataset. The code now uses the new Hadoop API. The code was tested on Cloudera's cdh3u5 Hadoop distribution, Hadoop version 0.20.2, and with some minor tweaks of the build.xml file also on Cloudera cdh4 versions. Download MIREX at:
http://mirex.sourceforge.net.

Anchor text for ClueWeb12

We are happy to share the anchor text extracted from the TREC ClueWeb12 collection:

ClueWeb12_Anchors (30.4 GB; use a BitTorrent client; please seed until you reach a reasonable share ratio)

The data contains anchor text for 0.5 billion pages, about 64% of the total number of pages in ClueWeb12. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. Web pages were truncated at 50KB before extracting the anchors. The size is about 30.4 GB (gzipped). The data consists of a tab-separated text files consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):

Djoerd Hiemstra and Claudia Hauff. “MIREX: MapReduce Information Retrieval Experiments” CTIT Technical Report TR-CTIT-10-15, Centre for Telematics and Information Technology, University of Twente, ISSN 1381-3625, 2010 (arXiv preprint 1004.4489)

The source code is available from: http://mirex.sourceforge.net.
(See also Anchor Text for ClueWeb09.)

STW Valorization Grant for Q-Able

The University of Twente spin-off Q-Able receives a Valorization Grant Phase 1 from the Dutch Technology Foundation STW to further develop and market their OneBox search technology.

When it comes to web applications, users love the “single text box” interface because it is extremely easy to use. However, much information on the web is stored in structured databases and can only be accessed by filling out a web form with multiple input fields. Examples include planning a trip, booking a hotel room, looking for a second-hand car, etc.

The approach of web search engines – to crawl sites and make a central index of the pages – does not suffice in many cases. First, some sites are hard to crawl because the pages can only be accessed via the web form. Second, some sites provide information that changes quickly, like available hotel rooms, and crawled pages would be almost immediately outdated. Third, some sites provide information that is generated dynamically, like planning a trip from one address to another on a certain date, and it is impossible to crawl all combinations of addresses and dates. Finally, a simple text index that search engines provide does not easily allow structured queries on arbitrary fields. In all these cases, the sites that provide such information can be found using a search engine like Google, but the information itself can only be retrieved after filling in a web form. Filling in one or more web forms with many fields can be a tedious job.

Q-Able replaces a site's web forms by OneBox, a simple text field, giving complex sites the look and feel of Google: a single field for asking questions and performing simple transactions. OneBox allows users to plan a trip by typing for instance “Next Wednesday from Enschede to Amsterdam arriving at 9am”, or to search for second-hand cars by typing “Ford C-max 4-doors less than 200,000 kilometres from before 2008”. OneBox can be configured to operate on any web site that provides complex web forms. Furthermore, OneBox can be configured to operate on multiple web sites using a single simple text field. This way, to search for instance for a second-hand car, users enter a single query, and search multiple second-hand car sites with a single click. OneBox only replaces the user interface of a web database: It does not copy, crawl or otherwise index the data itself.

OneBox, is the result of the Ph.D. research project done by Kien Tjin-Kam-Jet at the University of Twente. His research identified several successful novel approaches to query understanding by combining rule-based approaches with probabilistic approaches that rank query interpretations. Furthermore, the research resulted in an efficient implementation of OneBox, that needs only a fraction of a second to interpret queries even in complex configurations for accessing multiple web databases. Treinplanner.info, Q-Able's first public demonstration of OneBox, demonstrates natural search for the Dutch Railways (Nederlandse Spoorwegen) travel planner, and was well-received in user questionnaires, on Twitter, and on the Dutch national public radio and television. Q-Able will use the STW valorisation grant to investigate the technical and and commercial feasibility of OneBox.

Using a Stack Decoder for Structured Search

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

We describe a novel and flexible method that translates free-text queries to structured queries for filling out web forms. This can benefit searching in web databases which only allow access to their information through complex web forms. We introduce boosting and discounting heuristics, and use the constraints imposed by a web form to find a solution both efficiently and effectively. Our method is more efficient and shows improved performance over a baseline system.

To be presented at the 10th international conference on Flexible Query Answering Systems (FQAS 2013) in Grenada, Spain on 18-20 September.

[download preprint]

TREC Federated Web Search track

http://sites.google.com/site/trecfedweb/

First submission due on August 11, 2013

The Federated Web Search track is part of NIST's Text REtrieval Conference TREC 2013. The track investigates techniques for the selection and combination of search results from a large number of real on-line web search services. The data set, consisting of search results from 157 search engines is now available. The search engines cover a broad range of categories, including news, books, academic, travel, etc. We have included one big general web search engine, which is based on a combination of existing web search engines.

Federated search is the approach of querying multiple search engines simultaneously, and combining their results into one coherent search engine result page. The goal of the Federated Web Search (FedWeb) track is to evaluate approaches to federated search at very large scale in a realistic setting, by combining the search results of existing web search engines. This year the track focuses on resource selection (selecting the search engines that should be queried), and results merging (combining the results into a single ranked list). You may submit up to 3 runs for each task. All runs will be judged. Upon submission, you will be asked for each run to indicate whether you used result snippets and/or pages, and whether any external data was used. Precise guidelines can be found at:
http://sites.google.com/site/trecfedweb/

Track coordinators

Djoerd Hiemstra – University of Twente, The Netherlands
Thomas Demeester – Ghent University, Belgium
Dolf Trieschnigg – University of Twente, The Netherlands
Dong Nguyen – University of Twente, The Netherlands

Keynote by Ravi Kumar

We are very proud that Ravi Kumar from Google agreed to give a keynote speech at the CTIT Symposium on Big Data and the Emergence of Data Science. Kumar, who is well-known for hist work on web and data mining and algorithms for large data sets, has been a senior staff research scientist at Google since June 2012. Prior to this, he was a research staff member at the IBM Almaden Research Center and a principal research scientist at Yahoo! Research. He obtained his Ph.D. in Computer Science from Cornell University in 1998.
Ravi Kumar's talk will cover two non- conventional computational models for analyzing big data. The first is data streams: in this model, data arrives in a stream and the algorithm is tasked with computing a function of the data without explicitly storing it. The second is map-reduce: in this model, data is distributed across many machines and computation is done as sequence of map and reduce operations. Kumar will present a few algorithms in these models and discuss their scalability.

The workshop takes place on Tuesday 4 June at the University of Twente. Other invited spearkers at the CTIT symposium are Maarten de Rijke (U. Amsterdam) and Milan Petkovic (Philips).