Archive for the 'Distributed Search' Category

Emiel Mols graduates on sharding Spotify search

Wednesday, August 15th, 2012, posted by Djoerd Hiemstra

Today, Emiel Mols graduated when presenting the master thesis project he did at Spotify in Stockholm, Sweden. Emiel got quite some attention last year when he launched SpotifyOnTheWeb, leaving Spotify “no choice but to hire him”.

In the master thesis, Emiel describes a prototype implementation of a term sharded full text search architecture. The system’s requirements are based on the use case of searching for music in the Spotify catalogue. He benchmarked the system using non-synthethic data gathered from Spotify’s infrastructure.

The thesis will be available from ePrints.

Kien Tjin-Kam-Jet wins CTIT PhD Carousel

Tuesday, June 19th, 2012, posted by Djoerd Hiemstra

Kien Tjin-Kam-Jet was awarded the first prize in the PhD Carousel of the Centre for Telematics and Information Technology Symposium: ICT The Innovation Highway. The prize was handed over by Stefano Stramigioli, professor of Advanced Robotics and chair holder of the Control Engineering group at the University of Twente.

Query log analysis for Treinplanner

Friday, June 15th, 2012, posted by Djoerd Hiemstra

An analysis of free-text queries for a multi-field web form

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra We report how users interact with an experimental system that transforms single- field textual input into a multi- field query for an existing travel planner system. The experimental system was made publicly available and we collected over 30,000 queries from almost 12,000 users. From the free-text query log, we examined how users formulated structured information needs into free-text queries. The query log analysis shows that there is great variety in query formulation, over 400 query templates were found that occurred at least 4 times. Furthermore, with over 100 respondents to our questionnaire, we provide both quantitative and qualitative evidence indicating that end-users significantly prefer a single field interface over a multi-field interface when performing structured search.

The paper will be presented at the fourth Information Interaction in Context Symposium, IIiX 2012 on August 21-24, 2012 in Nijmegen, the Netherlands.

[download pdf]

Treinplanner nominated best ICT project

Monday, May 14th, 2012, posted by Djoerd Hiemstra nominated best ICT project by Computable

Each year, the Dutch journal Computable awards companies, projects and persons in five categories. Treinplanner has been nominated for the best ICT project award 2012 in the category Industry. If you like our project, please vote for Treinplanner at Computable (in Dutch).

See also: Treinplanner on Dutch television.

MIREX in ERCIM News Big Data Special

Wednesday, April 11th, 2012, posted by Djoerd Hiemstra

by Djoerd Hiemstra and Claudia Hauff

ERCIM News 89 MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totalling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available at SourceForge.

More information in ERCIM News 89.

Wrapper induction for search results

Sunday, March 18th, 2012, posted by Djoerd Hiemstra

Ranking XPaths for extracting search result records

by Dolf Trieschnigg, Kien Tjin-Kam-Jet and Djoerd Hiemstra

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

[download pdf]

Search Result Finder XPaths

Download Search Result Finder Firefox plugin.

Searching the deep web

Thursday, February 16th, 2012, posted by Djoerd Hiemstra

Today on Radio 1: An interview by Deborah Blekkenhorst on our attempts to search the deep web. And… no, the deep web is not the part of the web where terrorists hang out. (in Dutch)

Peer-to-Peer Information Retrieval: An Overview

Friday, February 10th, 2012, posted by Djoerd Hiemstra

by Almer Tigelaar, Djoerd Hiemstra, Dolf Trieschnigg

Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these have seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralised solutions. In this paper we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralised client-server solutions in terms of scalability, performance, user satisfaction and freedom.

The paper will appear in ACM Transactions on Information Systems.

[download pdf]

Treinplanner on Dutch television

Sunday, January 29th, 2012, posted by Djoerd Hiemstra

Dutch broadcaster BNN tests the intuitive train planner developed at the Database Group. Their verdict: “ingenious”, and “approved for elderly”. Picture of Kien Tjin-Kam-Jet proudly in the back (in Dutch). See the treinplanner in action at:


Monday, January 23rd, 2012, posted by Djoerd Hiemstra

We released a demo today: The Treinplanner built by Kien that allows you to search the search the Dutch Railways Journey planner with a single search box. (in Dutch)