Archive for the 'Paper abstracts' Category

SIGIR Test of Time Awardees 1978-2001

Wednesday, June 21st, 2017, posted by Djoerd Hiemstra

Overview of Special Issue

by Donna Harman, Diane Kelly (Editors), James Allan, Nicholas J. Belkin, Paul Bennett, Jamie Callan, Charles Clarke, Fernando Diaz, Susan Dumais, Nicola Ferro, Donna Harman, Djoerd Hiemstra, Ian Ruthven, Tetsuya Sakai, Mark D. Smucker, Justin Zobel (Authors)

This special issue of SIGIR Forum marks the 40th anniversary of the ACM SIGIR Conference by showcasing papers selected for the ACM SIGIR Test of Time Award from the years 1978-2001. These papers document the history and evolution of IR research and practice, and illustrate the intellectual impact the SIGIR Conference has had over time.
The ACM SIGIR Test of Time Award recognizes conference papers that have had a long-lasting influence on information retrieval research. When the award guidelines were created, eligible papers were identified as those that were published in a window of time 10 to 12 years prior to the year of the award. This meant that the first year this award was given, 2014, eligible papers came from the years 2002-2004. To identify papers published during the period 1978-2001 that might also be recognized with the Test of Time Award, a committee was created, which was led by Keith van Rijsbergen. Members of the committee were: Nicholas Belkin, Charlie Clarke, Susan Dumais, Norbert Fuhr, Donna Harman, Diane Kelly, Stephen Robertson, Stefan Rueger, Ian Ruthven, Tetsuya Sakai, Mark Sanderson, Ryen White, and Chengxiang Zhai.
The committee used citation counts and other techniques to build a nomination pool. Nominations were also solicited from the community. In addition, a sub-committee was formed of people active in the 1980s to identify papers from the period 1978-1989 that should be recognized with the award. As a result of these processes, a nomination pool of papers was created and each paper in the pool was reviewed by a team of three committee members and assigned a grade. The 30 papers with the highest grades were selected to be recognized with an award.
To commemorate the 1978-2001 ACM SIGIR Test of Time awardees, we invited a number of people from the SIGIR community to contribute write-ups of each paper. Each write-up consists of a summary of the paper, a description of the main contributions of the paper and commentary on why the paper is still useful. This special issue contains reprints of all the papers, with the exception of a few whose copyrights are not held by ACM (members of ACM can access these papers at the ACM Digital Library as part of the original conference proceedings).
As members of the selection committee, we really enjoyed reading the older papers. The style was very different from todays SIGIR paper: the writing was simple and unpretentious, with an equal mix of creativity, rigor and openness. We encourage everyone to read at least a handful of these papers and to consider how things have changed, and if, and how, we might bring some of the positive qualities of these older papers back to the SIGIR program.

To be published in SIGIR Forum 51(2), Association for Computing Machinery, July 2017

[download pdf]

Exploring the Query Halo Effect in Site Search

Friday, May 19th, 2017, posted by Djoerd Hiemstra

Leading People to Longer Queries

by Djoerd Hiemstra, Claudia Hauff, and Leif Azzopardi

People tend to type short queries, however, the belief is that longer queries are more effective. Consequently, a number of attempts have been made to encourage and motivate people to enter longer queries. While most have failed, a recent attempt — conducted in a laboratory setup — in which the query box has a halo or glow effect, that changes as the query becomes longer, has been shown to increase query length by one term, on average. In this paper, we test whether a similar increase is observed when the same component is deployed in a production system for site search and used by real end users. To this end, we conducted two separate experiments, where the rate at which the color changes in the halo were varied. In both experiments users were assigned to one of two conditions: halo and no-halo. The experiments were ran over a fifty day period with 3,506 unique users submitting over six thousand queries. In both experiments, however, we observed no significant difference in query length. We also did not find longer queries to result in greater retrieval performance. While, we did not reproduce the previous findings, our results indicate that the query halo effect appears to be sensitive to performance and task, limiting its applicability to other contexts.

To be presented at SIGIR 2017, the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval in Tokyo, Japan on August 7-11, 2017

Also to be presented at DIR2017, the 16th Dutch-Belgian Information Retrieval Workshop in Hilversum, The Netherlands, on November 24, 2017

[download pdf]

Inoculating Relevance Feedback Against Poison Pills

Friday, November 4th, 2016, posted by Djoerd Hiemstra

by Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra, and Maarten Marx

Relevance Feedback (RF) is a common approach for enriching queries, given a set of explicitly or implicitly judged documents to improve the performance of the retrieval. Although it has been shown that on average, the overall performance of retrieval will be improved after relevance feedback, for some topics, employing some relevant documents may decrease the average precision of the initial run. This is mostly because the feedback document is partially relevant and contains off-topic terms which adding them to the query as expansion terms results in loosing the retrieval performance. These relevant documents that hurt the performance of retrieval after feedback are called “poison pills”. In this paper, we discuss the effect of poison pills on the relevance feedback and present significant words language models (SWLM) as an approach for estimating feedback model to tackle this problem.

To be presented at the 15th Dutch-Belgian Information Retrieval Workshop, DIR 2016 on 25 November in Delft.

[download pdf]

Evaluation and analysis of term scoring methods for term extraction

Tuesday, August 30th, 2016, posted by Djoerd Hiemstra

by Suzan Verberne, Maya Sappelli, Djoerd Hiemstra, and Wessel Kraaij

We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain. Larger collections lead to better terms; all methods are hindered by small collection sizes (below 1000 words). The most flexible method for the extraction of single-word and multi-word terms is pointwise Kullback-Leibler divergence for informativeness and phraseness. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.

To appear in Information Retrieval.

[download pdf]

Solving the Continuous Cold Start Problem in E-commerce Recommendations

Wednesday, August 3rd, 2016, posted by Djoerd Hiemstra

Beyond Movie Recommendations: Solving the Continuous Cold Start Problem in E-commerce Recommendations

by Julia Kiseleva, Alexander Tuzhilin, Jaap Kamps, Melanie Mueller, Lucas Bernardi, Chad Davis, Ivan Kovacek, Mats Stafseng Einarsen, Djoerd Hiemstra

Many e-commerce websites use recommender systems or personalized rankers to personalize search results based on their previous interactions. However, a large fraction of users has no prior interactions, making it impossible to use collaborative filtering or rely on user history for personalization. Even the most active users may visit only a few times a year and may have volatile needs or different personas, making their personal history a sparse and noisy signal at best. This paper investigates how, when we cannot rely on the user history, the large scale availability of other user interactions still allows us to build meaningful profiles from the contextual data and whether such contextual profiles are useful to customize the ranking, exemplified by data from a major online travel agent Booking.com.
Our main findings are threefold: First, we characterize the Continuous Cold Start Problem (CoCoS) from the viewpoint of typical e-commerce applications. Second, as explicit situational context is not available in typical real world applications, implicit cues from transaction logs used at scale can capture essential features of situational context. Third, contextual user profiles can be created offline, resulting in a set of smaller models compared to a single huge non-contextual model, making contextual ranking available with negligible CPU and memory footprint. Finally we conclude that, in an online A/B test on live users, our contextual ranker increased user engagement substantially over a non-contextual baseline, with click-through-rate (CTR) increased by 20%. This clearly demonstrates the value of contextual user profiles in a real world application.

[download pdf]

The Importance of Prior Probabilities for Entry Page Search

Thursday, July 10th, 2014, posted by Djoerd Hiemstra

by Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra

An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability

[download pdf]

SIGIR 2014 Test of Time Honourable Mention

The paper was published at SIGIR 2002 and received an Honourable Mention for the ACM SIGIR Test of Time award at the 37th Annual ACM SIGIR conference on Research & development in information retrieval in Gold Coast Australia on 9 July 2014.

Using a Stack Decoder for Structured Search

Monday, June 10th, 2013, posted by Djoerd Hiemstra

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

We describe a novel and flexible method that translates free-text queries to structured queries for filling out web forms. This can benefit searching in web databases which only allow access to their information through complex web forms. We introduce boosting and discounting heuristics, and use the constraints imposed by a web form to find a solution both efficiently and effectively. Our method is more efficient and shows improved performance over a baseline system.

To be presented at the 10th international conference on Flexible Query Answering Systems (FQAS 2013) in Grenada, Spain on 18-20 September.

[download preprint]

Traitor: Associating Concepts using the WWW

Wednesday, April 17th, 2013, posted by Djoerd Hiemstra

by Wanno Drijfhout, Oliver Jundt, and Lesley Wevers

Traitor uses Common Crawl’s 25TB data set of web pages to construct a database of associated concepts using Hadoop. The database can be queried through a web application with two query interfaces. A textual interface allows searching for similarities and differences between multiple concepts using a query language similar to set notation, and a graphical interface allows users to visualize similarity relationships of concepts in a force directed graph.

To be presented at the 13th Dutch-Belgian Information Retrieval Workshop DIR 2013 on 26 April in Delft, The Netherlands

[download pdf]

Try Traitor at http://traitor.imperamus.eu.

Readability of the Web

Monday, April 15th, 2013, posted by Djoerd Hiemstra

A study on 1 billion web pages.

by Marije de Heus

Automated Readability Index for the Web

We have performed a readability study on more than 1 billion web pages. The Automated Readability Index was used to determine the average grade level required to easily comprehend a website. Some of the results are that a 16-year-old can easily understand 50% of the web and an 18-year old can easily understand 77% of the web. This information can be used in a search engine to filter websites that are likely to be incomprehensible for younger users.

To be presented at the 13th Dutch-Belgian Information Retrieval Workshop DIR 2013 on 26 April in Delft, The Netherlands

[download pdf]

Assigning reviewers to papers

Monday, November 12th, 2012, posted by Djoerd Hiemstra

Multi-Aspect Group Formation using Facility Location Analysis

by Mahmood Neshati, Hamid Beigy, and Djoerd Hiemstra

In this paper, we propose an optimization framework to retrieve an optimal group of experts to perform a given multi-aspect task/project. Each task needs a diverse set of skills and the group of assigned experts should be able to collectively cover all required aspects of the task. We consider three types of multi-aspect team formation problems and propose a unified framework to solve these problems accurately and efficiently. Our proposed framework is based on Facility Location Analysis which is a well known branch of the Operation Research. Our experiments on a real dataset show significant improvement in comparison with the state-of-the art approaches for the team formation problem.

The paper will be presented at the 17th Australasian Document Computing Symposium ADCS 2012 at the University of Otago, Dunedin, New Zealand on the 5th and 6th December, 2012.

[download pdf]