Archive for 2013

Debasis Ganguly successfully defends PhD thesis on Topical Relevance Models

Wednesday, August 28th, 2013, posted by Djoerd Hiemstra

Today, Debasis Ganguly successfully defended his PhD thesis at Dublin City University.

Topical Relevance Models

by Debasis Ganguly

An inherent characteristic of information retrieval (IR) is that the query expressing a user’s information need is often multi-faceted, that is, it encapsulates more than one specific potential sub-information need. This multi-facetedness of queries manifests itself as a topic distribution in the retrieved set of documents, where each document can be considered as a mixture of topics, one or more of which may correspond to the sub-information needs expressed in the query. In some specific domains of IR, such as patent prior art search, where the queries are full patent articles and the objective is to (in)validate the claims contained therein, the queries themselves are multi-topical in addition to the retrieved set of documents. The overall objective of the research described in this thesis involves investigating techniques to recognize and exploit these multi-topical characteristic of the retrieved documents and the queries in IR and relevance feedback in IR.

Vertical Selection in the Information Domain of Children

Monday, August 5th, 2013, posted by Djoerd Hiemstra

Sergio Duarte Torres’ paper on vertical selection for search for children is nominated for the JCDL Best Student Paper Award.

Vertical Selection in the Information Domain of Children

by Sergio Duarte Torres, Djoerd Hiemstra and Theo Huibers

In this paper we explore the vertical selection methods in aggregated search in the specific domain of topics for children between 7 and 12 years old. A test collection consisting of 25 verticals, 3.8K queries and relevant assessments for a large sample of these queries mapping relevant verticals to queries was built. We gather relevant assessment by envisaging two aggregated search systems: one in which the Web vertical is always displayed and in which each vertical is assessed independently from the web vertical. We show that both approaches lead to a di?erent set of relevant verticals and that the former is prone to bias of visually oriented verticals. In the second part of this paper we estimate the size of the verticals for the target domain. We show that employing the global size and domain specific size estimation of the verticals lead to significant improvements when using state-of-the art methods of vertical selection. We also introduce a novel vertical and query representation based on tags from social media and we show that its use lead to significant performance gains.

Presented on 23 July at the joint ACM/IEEE conference on Digital Libraries JCDL 2013 in Indianapolis, USA.

[download pdf]

SemEval’s Sentiment Analysis in Twitter

Friday, July 5th, 2013, posted by Djoerd Hiemstra

UT-DB: An Experimental Study on Sentiment Analysis in Twitter

Zhemin Zhu, Djoerd Hiemstra, Peter Apers, and Andreas Wombacher

This paper describes our system for participating SemEval 2013 Task 2-B: Sentiment Analysis in Twitter. Given a message, our system classifies whether the message is positive, negative or neutral sentiment. It uses a co-occurrence rate model. The training data are constrained to the data provided by the task organizers (No other tweet data are used). We consider 9 types of features and use a subset of them in our submitted system. To see the contribution of each type of features, we do experimental study on features by leaving one type of features out each time. Results suggest that unigrams are the most important features, bigrams and POS tags seem not helpful, and stopwords should be retained to achieve the best results. The overall results of our system are promising regarding the constrained features and data we use.

[download pdf]

MIREX 0.3 for ClueWeb12

Monday, July 1st, 2013, posted by Djoerd Hiemstra

MIREX 0.3 We released a new version 0.3 for TREC Web track participants that work on the new ClueWeb12 dataset. The code now uses the new Hadoop API. The code was tested on Cloudera’s cdh3u5 Hadoop distribution, Hadoop version 0.20.2, and with some minor tweaks of the build.xml file also on Cloudera cdh4 versions. Download MIREX at:

Anchor text for ClueWeb12

Thursday, June 27th, 2013, posted by Djoerd Hiemstra

We are happy to share the anchor text extracted from the TREC ClueWeb12 collection:

  • ClueWeb12_Anchors (30.4 GB; use a BitTorrent client; please seed until you reach a reasonable share ratio)
The data contains anchor text for 0.5 billion pages, about 64% of the total number of pages in ClueWeb12. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. Web pages were truncated at 50KB before extracting the anchors. The size is about 30.4 GB (gzipped). The data consists of a tab-separated text files consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research): The source code is available from:
(See also Anchor Text for ClueWeb09.)

STW Valorization Grant for Q-Able

Friday, June 21st, 2013, posted by Djoerd Hiemstra The University of Twente spin-off Q-Able receives a Valorization Grant Phase 1 from the Dutch Technology Foundation STW to further develop and market their OneBox search technology.

When it comes to web applications, users love the “single text box” interface because it is extremely easy to use. However, much information on the web is stored in structured databases and can only be accessed by filling out a web form with multiple input fields. Examples include planning a trip, booking a hotel room, looking for a second-hand car, etc.

The approach of web search engines – to crawl sites and make a central index of the pages – does not suffice in many cases. First, some sites are hard to crawl because the pages can only be accessed via the web form. Second, some sites provide information that changes quickly, like available hotel rooms, and crawled pages would be almost immediately outdated. Third, some sites provide information that is generated dynamically, like planning a trip from one address to another on a certain date, and it is impossible to crawl all combinations of addresses and dates. Finally, a simple text index that search engines provide does not easily allow structured queries on arbitrary fields. In all these cases, the sites that provide such information can be found using a search engine like Google, but the information itself can only be retrieved after filling in a web form. Filling in one or more web forms with many fields can be a tedious job.

Q-Able replaces a site’s web forms by OneBox, a simple text field, giving complex sites the look and feel of Google: a single field for asking questions and performing simple transactions. OneBox allows users to plan a trip by typing for instance “Next Wednesday from Enschede to Amsterdam arriving at 9am”, or to search for second-hand cars by typing “Ford C-max 4-doors less than 200,000 kilometres from before 2008″. OneBox can be configured to operate on any web site that provides complex web forms. Furthermore, OneBox can be configured to operate on multiple web sites using a single simple text field. This way, to search for instance for a second-hand car, users enter a single query, and search multiple second-hand car sites with a single click. OneBox only replaces the user interface of a web database: It does not copy, crawl or otherwise index the data itself.

OneBox, is the result of the Ph.D. research project done by Kien Tjin-Kam-Jet at the University of Twente. His research identified several successful novel approaches to query understanding by combining rule-based approaches with probabilistic approaches that rank query interpretations. Furthermore, the research resulted in an efficient implementation of OneBox, that needs only a fraction of a second to interpret queries even in complex configurations for accessing multiple web databases., Q-Able’s first public demonstration of OneBox, demonstrates natural search for the Dutch Railways (Nederlandse Spoorwegen) travel planner, and was well-received in user questionnaires, on Twitter, and on the Dutch national public radio and television. Q-Able will use the STW valorisation grant to investigate the technical and and commercial feasibility of OneBox.

Using a Stack Decoder for Structured Search

Monday, June 10th, 2013, posted by Djoerd Hiemstra

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

We describe a novel and flexible method that translates free-text queries to structured queries for filling out web forms. This can benefit searching in web databases which only allow access to their information through complex web forms. We introduce boosting and discounting heuristics, and use the constraints imposed by a web form to find a solution both efficiently and effectively. Our method is more efficient and shows improved performance over a baseline system.

To be presented at the 10th international conference on Flexible Query Answering Systems (FQAS 2013) in Grenada, Spain on 18-20 September.

[download preprint]

TREC Federated Web Search track

Thursday, June 6th, 2013, posted by Djoerd Hiemstra

First submission due on August 11, 2013

The Federated Web Search track is part of NIST’s Text REtrieval Conference TREC 2013. The track investigates techniques for the selection and combination of search results from a large number of real on-line web search services. The data set, consisting of search results from 157 search engines is now available. The search engines cover a broad range of categories, including news, books, academic, travel, etc. We have included one big general web search engine, which is based on a combination of existing web search engines.

Federated search is the approach of querying multiple search engines simultaneously, and combining their results into one coherent search engine result page. The goal of the Federated Web Search (FedWeb) track is to evaluate approaches to federated search at very large scale in a realistic setting, by combining the search results of existing web search engines. This year the track focuses on resource selection (selecting the search engines that should be queried), and results merging (combining the results into a single ranked list). You may submit up to 3 runs for each task. All runs will be judged. Upon submission, you will be asked for each run to indicate whether you used result snippets and/or pages, and whether any external data was used. Precise guidelines can be found at:


Track coordinators

  • Djoerd Hiemstra - University of Twente, The Netherlands
  • Thomas Demeester - Ghent University, Belgium
  • Dolf Trieschnigg - University of Twente, The Netherlands
  • Dong Nguyen - University of Twente, The Netherlands

Keynote by Ravi Kumar

Thursday, May 23rd, 2013, posted by Djoerd Hiemstra

Ravi Kumar We are very proud that Ravi Kumar from Google agreed to give a keynote speech at the CTIT Symposium on Big Data and the Emergence of Data Science. Kumar, who is well-known for hist work on web and data mining and algorithms for large data sets, has been a senior staff research scientist at Google since June 2012. Prior to this, he was a research staff member at the IBM Almaden Research Center and a principal research scientist at Yahoo! Research. He obtained his Ph.D. in Computer Science from Cornell University in 1998.
Ravi Kumar’s talk will cover two non- conventional computational models for analyzing big data. The first is data streams: in this model, data arrives in a stream and the algorithm is tasked with computing a function of the data without explicitly storing it. The second is map-reduce: in this model, data is distributed across many machines and computation is done as sequence of map and reduce operations. Kumar will present a few algorithms in these models and discuss their scalability.

The workshop takes place on Tuesday 4 June at the University of Twente. Other invited spearkers at the CTIT symposium are Maarten de Rijke (U. Amsterdam) and Milan Petkovic (Philips).

Federated Search Made Easy

Tuesday, May 21st, 2013, posted by Djoerd Hiemstra

by Dolf Trieschnigg, Kien Tjin-Kam-Jet, and Djoerd Hiemstra

Building a federated search engine based on a large number existing web search engines is a challenge: implementing the programming interface (API) for each search engine is an exacting and time-consuming job. In this demonstration we present SearchResultFinder, a browser plugin which speeds up determining reusable XPaths for extracting search result items from HTML search result pages. Based on a single search result page, the tool presents a ranked list of candidate extraction XPaths and allows highlighting to view the extraction result. An evaluation with 148 web search engines shows that in 90% of the cases a correct XPath is suggested.

The software can be downloaded as a FireFox plugin.

SIGIR 2013 demonstration

The tool was demonstrated at the ACM SIGIR Conference in Dublin.

[download pdf]