Archive for the 'Distributed Search' Category

Federated Search for Sheet Music

Friday, July 7th, 2017, posted by Djoerd Hiemstra

After running the UT search engine for about a year now, there’s a new search engine that uses Searsia: The search engine, called Dr. Sheet Music is a federated search engine for sheet music. Give it a try at http://drsheetmusic.com.

Exploring the Query Halo Effect in Site Search

Friday, May 19th, 2017, posted by Djoerd Hiemstra

Leading People to Longer Queries

by Djoerd Hiemstra, Claudia Hauff, and Leif Azzopardi

People tend to type short queries, however, the belief is that longer queries are more effective. Consequently, a number of attempts have been made to encourage and motivate people to enter longer queries. While most have failed, a recent attempt — conducted in a laboratory setup — in which the query box has a halo or glow effect, that changes as the query becomes longer, has been shown to increase query length by one term, on average. In this paper, we test whether a similar increase is observed when the same component is deployed in a production system for site search and used by real end users. To this end, we conducted two separate experiments, where the rate at which the color changes in the halo were varied. In both experiments users were assigned to one of two conditions: halo and no-halo. The experiments were ran over a fifty day period with 3,506 unique users submitting over six thousand queries. In both experiments, however, we observed no significant difference in query length. We also did not find longer queries to result in greater retrieval performance. While, we did not reproduce the previous findings, our results indicate that the query halo effect appears to be sensitive to performance and task, limiting its applicability to other contexts.

To be presented at SIGIR 2017, the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval in Tokyo, Japan on August 7-11, 2017

Also to be presented at DIR2017, the 16th Dutch-Belgian Information Retrieval Workshop in Hilversum, The Netherlands, on November 24, 2017

[download pdf]

Searsia nominated by ISOC NL

Monday, January 2nd, 2017, posted by Djoerd Hiemstra

The Dutch chapter of the Internet Society (ISOC) nominated Searsia for its 2017 Innovation Award.

Read more on the Searsia blog

Resource Selection for Federated Search on the Web

Thursday, September 22nd, 2016, posted by Djoerd Hiemstra

by Dong Nguyen, Thomas Demeester, Dolf Trieschnigg, and Djoerd Hiemstra

A publicly available dataset for federated search reflecting a real web environment has long been absent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines.
First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.

[download pdf]

CLEF keynote slides

Wednesday, September 14th, 2016, posted by Djoerd Hiemstra

The slides of the CLEF keynote can be downloaded below

A case for search specialization and search delegation

Evaluation conferences like CLEF, TREC and NTCIR are important for the field, and keep being important because there is no “one-size-fits-all” for search engines. Different domains need different ranking approaches: For instance, Web search benefits from analyzing the link graph; Twitter search benefits from retweets and likes; Restaurant search benefits from geo-location and reviews; Advertisement search need bids and click-through, etc. Researching many domains will learn us more about the need and the value of the specialization of search engines, and about approaches that can quickly learn rankings for new domains using for instance learning-to-rank and clever feature selection.
A search engine that provides results from multiple domains, therefore better delegates its queries to specialized search engines. This brings up unique research questions on how to best select a specialized search engine. The TREC Federated Web Search track, that ran in 2013 and 2014, studied these questions in two tasks: the resource selection task studied how to select, given a query but before seeing the results for the query, the top specialized search engines for a query. The vertical selection task studied how to select the top domains from a predefined set of domains such as news, video, Q&A, etc.
I will present the lessons that we learned from running the Federated Web Search track, focusing on successful approaches to resource selection and vertical selection. I will conclude the talk by discussing our steps to take this work to full practice by running the University of Twente’s search engine as a federation of more than 30 smaller search engines, including local databases with news, courses, publications, as well as results from social media like Twitter and YouTube. The engine that runs U. Twente search is called Searsia and is available as open source software at: http://searsia.org.

[download slides]

A new search engine for the university

Thursday, March 24th, 2016, posted by Djoerd Hiemstra

As of this today, the university is using our Distributed Search approach as their main search engine on: http://utwente.nl/search (and also stand-alone on https://search.utwente.nl). The UT search engine offers its user not only the results from a large web crawl, but also live results from many sources that were previously invisible, such as courses, timetables, staff contact information, publications, the local photo database “Beeldbank”, vacancies, etc. The search engine combines about 30 of such sources, and learns over time which sources should be included for a query, even if it has never seen that query, nor the results for the query.

University of Twente

Read more in the official announcement (in Dutch).

Predicting relevance based on assessor disagreement

Wednesday, November 18th, 2015, posted by Djoerd Hiemstra

Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation

by Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, and Chris Develder

Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.

To be published in Information Retrieval Journal by Springer

[download pdf]

FedWeb Greatest Hits

Wednesday, March 11th, 2015, posted by Djoerd Hiemstra

Presenting the New Test Collection for Federated Web Search

by Thomas Demeester (Ghent University), Dolf Trieschnigg, Ke Zhou (Yahoo!), Dong Nguyen, and Djoerd Hiemstra

This paper presents FedWeb Greatest Hits, a large new test collection for research in web information retrieval. As a combination and extension of the datasets used in the TREC Federated Web Search Track, this collection opens up new research possibilities on federated web search challenges, as well as on various other problems.

The paper will be presented at the 24th International World Wide Web Conference (WWW 2015) in Florence, Italy on 18-22 May 2015.

[download pdf]

To obtain the dataset go to: http://fedwebgh.intec.ugent.be.

Sebastiaan Vercammen graduates on displaying intermediate results for on-going searches

Wednesday, December 17th, 2014, posted by Djoerd Hiemstra

by Sebastiaan Vercammen

Distributed search introduces problems with resources that require time to process queries and produce results, and users waiting to get an answer to their query. The system could wait a maximum amount of time for every resource to produce its results or start displaying results the very moment they are retrieved by the distributed search engine. This thesis introduces a number of alternative display strategies and describes a method to research their effectiveness in providing the most relevant results, as quickly and as high in the combined results as possible, while maintaining a user-friendly search experience. It then continues by describing the performed research and its results. For each experiment, test participants are asked a number of questions, to describe their experience operating the search engine using the specific display strategy. Also recorded are statistics concerning test participants’ clicks. These metrics are combined with the answers to the user questions and also used for determining the best display strategy. Observations were made of aspects that seemed to have influenced the experiment, such as the red color of the notifications used for one of the display strategies. The precise influence of these aspects should be further studied, by using A/B testing, as proposed in section 7.2. Finally, the conclusion is drawn that the Screen fill with “endless” scrolling display strategy (section 3.3.4) performed best when taking the test participants’ answers into account.

[download pdf]

Andres Marenco graduates on federated search

Friday, December 12th, 2014, posted by Djoerd Hiemstra

Federated Aggregated Search

by Andrés Marenco Zúñiga

The traditional search engine paradigm has changed from retrieving simple text documents, to selecting a broader combination of diverse document types (i.e. images, videos, maps…) that could satisfy the user’s information need. Each type of document, stored in specialized databases known as ‘verticals’, and found in either local or federated locations, is nowadays integrated into ‘aggregated search engines’. Due to this domain coverage of each vertical, when a query enters the system, only the ones which are most likely to contain the desired information should be selected. To perform this selection, a text representation of each vertical is created by directly sampling a set of documents from the vertical’s search engine. However, many times the vertical representation is not descriptive enough. Reasons such as the heterogeneous nature of the documents or the lack of cooperation of the vertical could negatively affect the generation of the representation. Thus, we focus on the problem of creating an aggregated search engine which integrates federated collections in an uncooperative environment. With the help of Wikipedia as a complementary external source of information, we investigate the use of three techniques found in the literature aimed to enrich the vertical representation: a) using only Wikipedia articles as representation; b) using a combination of Wikipedia articles and the sample obtained from the vertical; and c) expanding the contents of each sampled document. We discovered how by applying latent Dirichlet allocation to model the hidden topics of documents directly sampled from each vertical it is possible to identify Wikipedia articles with the same theme coverage as the vertical. Then, we demonstrate how by using only Wikipedia articles for representation of some particular verticals, the selection task is improved. As a second point, we explored the use of the modelled topics together with Wikipedia categories to boost the score of the verticals that could be associated with the query string. Although in this case our results are inconclusive, the experiments suggest that by applying query classification and then matching obtained categories with the verticals’ categories it is possible to increase the effectiveness of the vertical selection task.

[download pdf]