Archive for 2012

Treinplanner nominated best ICT project

Monday, May 14th, 2012, posted by Djoerd Hiemstra

Treinplanner.info nominated best ICT project by Computable

Each year, the Dutch journal Computable awards companies, projects and persons in five categories. Treinplanner has been nominated for the best ICT project award 2012 in the category Industry. If you like our project, please vote for Treinplanner at Computable (in Dutch).

See also: Treinplanner on Dutch television.

Exploring Language Identification Techniques for Dutch Folktales

Friday, April 27th, 2012, posted by Djoerd Hiemstra

by Dolf Trieschnigg , Djoerd Hiemstra , Mariët Theune, Franciska de Jong, and Theo Meder

The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research.

The paper will be presented at the LREC Workshop Adaptation of Language Resources and Tools for Processing Cultural Heritage Objects on 26 May 2012 in Istanbul, Turkey

[download preprint]

Saving the Old IR Literature

Friday, April 20th, 2012, posted by Djoerd Hiemstra

The SIGIR project Saving the Old IR Literature has scanned and released a new batch of historic IR (Information Retrieval) papers, including early papers on the SMART system and papers on the development of test collections. The papers are written by amongst others: Gerard Salton, Karen Sparck Jones, William Cooper, Keith van Rijsbergen, Stepen Robertson, Martin Kay, Michael Lesk, and Nicolas Belkin. The new batch is listed below and available from the SIGIR web site.

The collection contains some unique documents, for instance Karen Sparck Jones’ and Keith van Rijsbergen’s Report on the Need for and Provision for an ‘IDEAL’ Information Retrieval Test Collection written in 1975, which I anxiously searched for when doing my Ph.D. research. The document is an important mile stone towards the current TREC conferences; work that already started in 1960 with Cyril Cleverdon’s Cranfield experiments, one of Computer Science’s earliest examples of empirical testing in a laboratory setting.

It’s all there, enjoy!

Study tour to South Korea and China

Wednesday, April 11th, 2012, posted by Djoerd Hiemstra

Noodle is the name of the 2012 study tour organized by study association Inter-Actief from the University of Twente. In September and October 2012 we will visit companies and universities in South Korea and China. Before the students depart they research the countries they will be visiting. All participants conduct research in one of the six research tracks defined within the tour’s theme IT Integrated Lifestyle: how IT affects and enriches our daily lives.

Stucie Noodle
The Study Tour Committee: David Huistra, Lex Utama, Marijn Mensinga, Mark Oude Veldhuis, Nils van Kleef, and Yme Joustra

Follow the Noodle study tour preparations at http://noodle2012.nl.

MIREX in ERCIM News Big Data Special

Wednesday, April 11th, 2012, posted by Djoerd Hiemstra

by Djoerd Hiemstra and Claudia Hauff

ERCIM News 89 MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totalling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available at SourceForge.

More information in ERCIM News 89.

Emma Search Service

Tuesday, March 27th, 2012, posted by Djoerd Hiemstra

This demonstrator showcases the PuppyIR framework by incorporating a numerous child specific components developed as part of the PuppyIR project. The Demonstrator is for Emma’s Children’s Hospital in Amsterdam and provides children with a novel and exciting interface to help support their information needs while in hospital or visiting the hospital.

EmSe will be demonstrated at the 34th European Conference on Information Retrieval (ECIR) in Barcelona on 1-5 April 2012

Wrapper induction for search results

Sunday, March 18th, 2012, posted by Djoerd Hiemstra

Ranking XPaths for extracting search result records

by Dolf Trieschnigg, Kien Tjin-Kam-Jet and Djoerd Hiemstra

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

[download pdf]

Download Search Result Finder Firefox plugin.

MapReduce grades and evaluation

Friday, February 17th, 2012, posted by Djoerd Hiemstra

The MapReduce, Pig Latin and Cloud Computing assignments are graded. The final grades can be found in Blackboard’s grade center. Please join the course evaluation session on 21 February in hal B 2C from 12.30 - 13.30 hour (including a free lunch).

Searching the deep web

Thursday, February 16th, 2012, posted by Djoerd Hiemstra

Today on Radio 1: An interview by Deborah Blekkenhorst on our attempts to search the deep web. And… no, the deep web is not the part of the web where terrorists hang out. (in Dutch)

Peer-to-Peer Information Retrieval: An Overview

Friday, February 10th, 2012, posted by Djoerd Hiemstra

by Almer Tigelaar, Djoerd Hiemstra, Dolf Trieschnigg

Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these have seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralised solutions. In this paper we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralised client-server solutions in terms of scalability, performance, user satisfaction and freedom.

The paper will appear in ACM Transactions on Information Systems.