Archive for the 'Photos' Category

The Importance of Prior Probabilities for Entry Page Search

Thursday, July 10th, 2014, posted by Djoerd Hiemstra

by Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra

An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability

[download pdf]

SIGIR 2014 Test of Time Honourable Mention

The paper was published at SIGIR 2002 and received an Honourable Mention for the ACM SIGIR Test of Time award at the 37th Annual ACM SIGIR conference on Research & development in information retrieval in Gold Coast Australia on 9 July 2014.

Overview of TREC FedWeb 2013

Monday, March 10th, 2014, posted by Djoerd Hiemstra

Overview of the TREC 2013 Federated Web Search Track

by Thomas Demeester, Dolf Trieschnigg, Dong Nguyen, Djoerd Hiemstra

The TREC Federated Web Search track is intended to promote research related to federated search in a realistic web setting, and hereto provides a large data collection gathered from a series of online search engines. This overview paper discusses the results of the first edition of the track, FedWeb 2013. The focus was on basic challenges in federated search: (1) resource selection, and (2) results merging. After an overview of the provided data collection and the relevance judgments for the test topics, the participants’ individual approaches and results on both tasks are discussed. Promising research directions and an outlook on the 2014 edition of the track are provided as well.

TREC FedWeb
Ellen Voorhees presenting FedWeb at TREC 2013

The FedWeb task is organized as part of the Text REtrieval Conference (TREC)

[download pdf]

Cum laude PhD degree for Sergio Duarte Torres

Tuesday, February 18th, 2014, posted by Djoerd Hiemstra

Sergio Duarte Torres and defense committee

Sergio Duarte Torres’ PhD defense last Friday February 14th resulted in a exceptional PhD degree cum laude. His PhD thesis: “Information Retrieval for Children: Search Behavior and Solutions” was written at the Database Group as part of the European project PuppyIR, a joint project with amongst others Human Media Interaction. Sergio’s research shows an extraordinary diversity and heterogeneity, touching many areas of computer science, including Information Retrieval, Big Data analysis, and Machine Learning. Sergio sought cooperation with leading search engine companies in the field: Yahoo and Yandex. He did a three-month internship at Yahoo Research in Barcelona. Sergio’s work is well-received. His paper on vertical selection for search for children was nominated for the Best Student Paper Award at the joint ACM/IEEE conference on Digital Libraries in Indianapolis, USA. His work is accepted at two important journals in the field: the ACM Transactions on the Web, and the Journal of the American Society of Information Science and Technology. Specifically worth mentioning is the user study with children aged 8 to 10 years old done by Sergio to evaluate the child-friendly search approaches that he developed. We are proud of the achievements of Sergio Duarte Torres. He will be an excellent ambassador of the University of Twente.

[download pdf]

Kien Tjin-Kam-Jet presents Q-Able at the Young Technology Award

Monday, February 3rd, 2014, posted by Djoerd Hiemstra

Kien at YTA

Thursday 30 January the final of the Young Technology Award was held in Atak with an excellent performance of Kien Tjin-Kam-Jet of Q-Able.
See the photo impression.

Merry Christmas

Friday, December 20th, 2013, posted by Djoerd Hiemstra

Database Group

Merry Christmas from the Database Group!

Federated Search Made Easy

Tuesday, May 21st, 2013, posted by Djoerd Hiemstra

by Dolf Trieschnigg, Kien Tjin-Kam-Jet, and Djoerd Hiemstra

Building a federated search engine based on a large number existing web search engines is a challenge: implementing the programming interface (API) for each search engine is an exacting and time-consuming job. In this demonstration we present SearchResultFinder, a browser plugin which speeds up determining reusable XPaths for extracting search result items from HTML search result pages. Based on a single search result page, the tool presents a ranked list of candidate extraction XPaths and allows highlighting to view the extraction result. An evaluation with 148 web search engines shows that in 90% of the cases a correct XPath is suggested.

The software can be downloaded as a FireFox plugin.

SIGIR 2013 demonstration

The tool was demonstrated at the ACM SIGIR Conference in Dublin.

[download pdf]

Taily: Shard Selection Using the Tail of Score Distributions

Wednesday, May 15th, 2013, posted by Djoerd Hiemstra

by Robin Aly, Djoerd Hiemstra, and Thomas Demeester

Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query’s score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function’s features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.

SIGIR 2013 presentation

Presented at the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in Dublin, Ireland, 28 July - 1 August.

[download pdf]

SIGIR Doctoral Consortium

Wednesday, May 1st, 2013, posted by Djoerd Hiemstra

I will be participating in the SIGIR Doctoral Consortium this year in Dublin, Ireland, organized by Jaime Arguello, Mounia Lalmas, and Grace Hui Yang.

Update (August 5)

SIGIR 2013 Doctoral Consortium

Everyone concentrated at the DC meeting on 28 July in Dublin

18 March: Norvig Award Ceremony

Wednesday, February 27th, 2013, posted by Djoerd Hiemstra

Update (19 March): See the photos of the event.

On 18 March, starting at 15.45 h. until 17.30 h. the Norvig Web Data Science Award Ceremony takes place in the SmartXP lab in building Zilverling of the University of Twente. During the ceremony, Peter Norvig, Director of Research at Google, will award the prize (funds to attend the 2013 edition of SIGIR in Dublin Ireland, a tablet, and a lightening talk at Hadoop Summit in Amsterdam) to the winners via a live video connection from California, USA. Participation in the event is free of charge. Please register by sending your name and affiliation to: challenges@inter-actief.net. Students and researchers will get the opportunity to ask questions to Peter Norvig during the event. If you have a good question, please send it to the email address above too: Maybe your question will be selected to be asked at the event.

Peter Norvig
Announcement at Inter-Actief

More information at U. Twente Activities

The Winners of The Norvig Web Data Science Award

Tuesday, February 26th, 2013, posted by Lisa Green

by Lisa Green (Common Crawl)

We are very excited to announce that the winners of the Norvig Web Data Science Award: Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente! The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science and named in honor of distinguished computer scientist Peter Norvig.

There were many excellent submissions that demonstrated how you can extract valuable insight and knowledge from web crawl data. Be sure to check out the work of the winning team, Traitor – Associating Concepts Using The World Wide Web, and the other finalists on the award website. You will find descriptions of the projects as well as links to the code that was used. We hope that these projects will serve as an inspiration for what kind of work can be done with the Common Crawl corpus. All code is open source and we are looking forward to seeing it reused and adapted for other projects.

A huge thank you to our distinguished panel of judges: Peter Norvig, Ricardo Baeza-Yates, Hilary Mason, Jimmy Lin, and Evert Lammerts!


Added on 18 March: Award winners Oliver Jundt, Wanno Drijfhout, and Lesley Wevers with their prize: a high-end Android tablet!