Archive for 2013

Taily: Shard Selection Using the Tail of Score Distributions

Wednesday, May 15th, 2013, posted by Djoerd Hiemstra

by Robin Aly, Djoerd Hiemstra, and Thomas Demeester

Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query’s score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function’s features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.

SIGIR 2013 presentation

Presented at the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in Dublin, Ireland, 28 July - 1 August.

[download pdf]

Twente at NTCIR 2013

Thursday, May 2nd, 2013, posted by Djoerd Hiemstra

An API-based Search System for One Click Access to Information

by Dan Ionita, Niek Tax, and Djoerd Hiemstra

This paper proposes a prototype One Click access system, based on previous work in the field and the related 1CLICK-2@NTCIR10 task. The proposed solution integrates methods from previous such attempts into a three tier algorithm: query categorization, information extraction and output generation and offers suggestions on how each of these can be implemented. Finally, a thorough user-based evaluation concludes that such an information retrieval system outperforms the textual preview collected from Google search results, based on a paired sign test. Based on validation results possible suggestions on future improvements are proposed.

To be presented at the Japanese National Institute of Informatics (NII) Testbeds and Community for Information access Research (NTCIR-10) Conference at the National Center of Sciences, Tokyo, Japan on June 18-21

[download pdf]

SIGIR Doctoral Consortium

Wednesday, May 1st, 2013, posted by Djoerd Hiemstra

I will be participating in the SIGIR Doctoral Consortium this year in Dublin, Ireland, organized by Jaime Arguello, Mounia Lalmas, and Grace Hui Yang.

Update (August 5)

SIGIR 2013 Doctoral Consortium

Everyone concentrated at the DC meeting on 28 July in Dublin

Mattijs Ugen graduates on scalable performance for digital forensics

Wednesday, April 24th, 2013, posted by Djoerd Hiemstra

Scalable performance for a forensic database application

by Mattijs Ugen

As digital forensic investigations deal with more and more data, the Netherlands Forensic Institute, NFI, foresees scalability issues with the current solution in the near future. Following the global trend towards distributed solutions for ‘Big data’ problems, the NFI wants to find a suitable architecture to replace the currently used XIRAF system. Using experimental implementations on top of a selection of distributed data stores, we present query performance timings in three different scaling dimensions: cluster size, working set size and the amount of parallel clients. We present that scaling characteristics for parallel clients show a linear trend, but proves hard to measure for the other dimensions. A distributed search engine architecture proves the best candidate for the NFI, warranting closer investigation in that area for a real-world deployment.

[download pdf]

Traitor: Associating Concepts using the WWW

Wednesday, April 17th, 2013, posted by Djoerd Hiemstra

by Wanno Drijfhout, Oliver Jundt, and Lesley Wevers

Traitor uses Common Crawl’s 25TB data set of web pages to construct a database of associated concepts using Hadoop. The database can be queried through a web application with two query interfaces. A textual interface allows searching for similarities and differences between multiple concepts using a query language similar to set notation, and a graphical interface allows users to visualize similarity relationships of concepts in a force directed graph.

To be presented at the 13th Dutch-Belgian Information Retrieval Workshop DIR 2013 on 26 April in Delft, The Netherlands

[download pdf]

Try Traitor at

Readability of the Web

Monday, April 15th, 2013, posted by Djoerd Hiemstra

A study on 1 billion web pages.

by Marije de Heus

Automated Readability Index for the Web

We have performed a readability study on more than 1 billion web pages. The Automated Readability Index was used to determine the average grade level required to easily comprehend a website. Some of the results are that a 16-year-old can easily understand 50% of the web and an 18-year old can easily understand 77% of the web. This information can be used in a search engine to filter websites that are likely to be incomprehensible for younger users.

To be presented at the 13th Dutch-Belgian Information Retrieval Workshop DIR 2013 on 26 April in Delft, The Netherlands

[download pdf]

Deep Web Entity Monitoring

Wednesday, March 27th, 2013, posted by Djoerd Hiemstra

by Mohammad Khelghati

Search engines do not cover all the data available on the Web. In addition to the fact that none of these search engines cover all the webpages existing on the Web, they miss the data behind web search forms. This data is defined as hidden web or deep web which is not accessible through search engines. It is estimated that deep web contains data in a scale several times bigger than the data accessible through search engines which is referred to as surface web. Although this information on deep web could be accessed through their own interfaces, finding and querying all the interesting sources of information that might be useful could be a difficult, time-consuming and tiring task. Considering the huge amount of information that might be related to one’s information needs, it might be even impossible for a person to cover all the deep web sources of his interest. Therefore, there is a great demand for applications which can facilitate accessing this big amount of data being locked behind web search forms. Realizing approaches to meet this demand is one of the main issues targeted in this PhD project. Having provided the access to deep web data, different techniques can be applied to provide users with additional values out of this data. Analyzing data, finding patterns and relationships among different data items and also data sources are considered as some of these techniques. However, in this research, monitoring entities existing in deep web sources is targeted.

To be presented at the World Wide Web Conference Doctorial Consortium on 13 May in Rio de Janeiro, Brasil.

Robert Neumayer defends thesis on distributed entity search

Friday, March 8th, 2013, posted by Djoerd Hiemstra

Today, Robert Neumayer defended his Ph.D. thesis Semantic and Distributed Entity Search in the Web of Data at the Norwegian University of Science and Technology (NTNU) in Trondheim, Norway

by Robert Neumayer

Both the growth and ubiquitious character of the Internet have had a profound effect on how we access and consume ata and information. More recently, the Semantic Web, an extension of the current Web has come increasingly relevant due to its widespread adoption.
The Web of Data (WoD) is an extension of the current web, where not only docu- ments are interlinked by means of hyperlinks but also data in terms of predicates. Specifically, it describes objects, entities or “things” in terms of their attributes and their relationships, using RDF data (and often is used equivalently to Linked Data). Given its growth, there is a strong need for making this wealth of knowl- edge accessible by keyword search (the de-facto standard paradigm for accessing information online).
The overall goal of this thesis is to provide new techniques for accessing this data, i.e., to leverage its full potential to end users. We therefore address the following four main issues: a) how can the Web of Data be searched by means of keyword search?, b) what sets apart search in the WoD from traditional web search?, c) how can these elements be used in a theoretically sound and effective way?, and d) How can the techniques be adapted to a distributed environment? To this end, we develop techniques for effectively searching WoD sources. We build upon and formalise existing entity modelling approaches within a generative language modelling framework, and compare them experimentally using standard test collections. We show that these models outperform the current state-of-the-art in terms of retrieval effectiveness, however, this is done at the cost of abandoning a large part of the semantics behind the data. We propose a novel entity model capable of preserving the semantics associated with entities, without sacrificing retrieval effectiveness. We further show how these approaches can be applied in the distributed context, both with low (federated search) and high numbers (Peer- to-peer or P2P) of independent repositories, collections, or nodes.
The main contributions are as follows:

  • We develop a hybrid approach to search in the Web of Data, using elements from traditional information retrieval and structured retrieval alike.
  • We formalise our approaches in a language model setting.
  • Our extensions are successfully evaluated with respect to their applicability in different distributed environments such as federated search and P2P.
  • We discuss and analyse based on our empirical evaluation and provide insights into the entity search problem.

18 March: Norvig Award Ceremony

Wednesday, February 27th, 2013, posted by Djoerd Hiemstra

Update (19 March): See the photos of the event.

On 18 March, starting at 15.45 h. until 17.30 h. the Norvig Web Data Science Award Ceremony takes place in the SmartXP lab in building Zilverling of the University of Twente. During the ceremony, Peter Norvig, Director of Research at Google, will award the prize (funds to attend the 2013 edition of SIGIR in Dublin Ireland, a tablet, and a lightening talk at Hadoop Summit in Amsterdam) to the winners via a live video connection from California, USA. Participation in the event is free of charge. Please register by sending your name and affiliation to: Students and researchers will get the opportunity to ask questions to Peter Norvig during the event. If you have a good question, please send it to the email address above too: Maybe your question will be selected to be asked at the event.

Peter Norvig
Announcement at Inter-Actief

More information at U. Twente Activities

Empirical Training for Conditional Random Fields

Tuesday, February 26th, 2013, posted by Djoerd Hiemstra

A Closed Form Maximum Likelihood Estimator Of Conditional Random Fields

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers and Andreas Wombacher

Training Conditional Random Fields (CRFs) can be very slow for big data. In this paper, we present a new training method for CRFs called Empirical Training which is motivated by the concept of co-occurrence rate. We show that the standard training (unregularized) can have many maximum like-lihood estimations (MLEs). Empirical training has a unique closed form MLE which is also a MLE of the standard training. We are the first to identify the Test Time Problem of the standard training which may lead to low accuracy. Empirical training is immune to this problem. Empirical training is also unaffected by the label bias problem even it is locally normalized. All of these have been verified by experiments. Experiments also show that empirical training reduces the training time from weeks to seconds, and obtains competitive results to the standard and piecewise training on linear-chain CRFs, especially when data are insufficient.

[download pdf]