Rutger Varkevisser graduates on Large Scale Online Readability Assessment

November 30th, 2016, posted by Djoerd Hiemstra

by Rutger Varkevisser

The internet is an incredible resource for information and learning. By using search engines like Google, information is usually just a click away. Unless you are a child, in which case most of the information on the web is either (way) too difficult to read and/or understand, or impossible to find. This research aims to successfully combine the areas of readability assessment and gamification in order to provide a tech- nical and theoretical foundation for the creation of an automatic large scale child feedback readability assessment system. In which correctly assessing the readability level of online (textual) content for children is the central focus. The importance of having correct readability scores for online content, is that it provides children with a guideline on the difficulty level of textual content on the web. It also allows for external programs i.e. search engines, to potentially take readability scores into account based on the known age/proficiency of the user. Having children actively participate in the process of determining readability levels should improve any current systems which usually rely on fully automated systems/algorithms or human (adult) perception.
The first step in the creation of the aforementioned tool is to make sure the underlying process is scientific valid. This research has adapted the Cloze-test as a method of determining the readability of a text. The Cloze-test is an already established and researched method of readability assessment, which works by omitting certain words from a text and tasking the user with filling in the open spots with the correct words. The resulting overall score determining the readability level. For this research we want to digitize and automate this process. However, while the validity of the Cloze-test and its results in an offline (paper) environment have been proven, this is not the case for any digital adaptation. Therefore the first part of this research focusses on this central issue. By combining the areas of readability assessment (the Cloze-test), gamification (the creation of a digital online adaptation of the Cloze-test) and child computer interaction (a user-test on the target audience with the developed tool) this validity was examined and tested. In the user-test the participants completed several different Cloze-test texts, half of them offline (on paper) and the other half in a recreated online environment. This was done to measure the correlation between the online scores and the offline scores, which we already know are valid. Results of the user-test confirmed the validity of the online version by showing significant correlations between the offline and online versions via both a Pearson correlation coefficient and Spearman’s rank-order analysis.
With the knowledge that the online adaptation of the Cloze-test is valid for determining readability scores, the next step was to automate the process of creating Cloze-tests from texts. Given that the goal of the project was to provide the basis of a scalable gamified approach, and scalable in this context means automated. Several methods were developed to mimic the human process of creating a Cloze-test (i.e. looking at the text and selecting which words to omit given a set of general guidelines). Included in these methods were TF.IDF and NLP approaches in order to find suitable extraction words for the purposes of a Cloze-test. These were tested by comparing the classification performance of each method with a baseline of manually classified/marked set of texts. The final versions of the aforementioned methods were tested, and resulted performance scores of around 50%, i.e. how well they emulated human performance in the creation of Cloze-tests. A combination of automated methods resulted in an even bigger performance score of 63%. The best performing individual method was put to the test in a small Turing-test style user-test which showed promising results. Presented with 2 manually- and 1 automatically created Cloze-test participants attained similar scores across all tests. Participants also gave contradicting responses when asked which of the 3 Cloze-tests was automated. This research concludes the following:

  1. Results of offline- and online Cloze-tests are highly correlated.
  2. Automated methods are able to correctly identify 63% of suitable Cloze-test words as marked by humans.
  3. Users gave conflicting reports when asked to identify the automated test in a mix of both automated- and human-made Cloze-tests.

[download pdf]

IP&M Best Paper Award for A cross-benchmark comparison of 87 learning to rank methods

November 7th, 2016, posted by Djoerd Hiemstra

We are proud of the Information Processing & Management Best Paper Award 2015 for our paper: A cross-benchmark comparison of 87 learning to rank methods.

IPM Best Paper Award Certificate

Published in Information Processing and Management 51(6), pages 757–772

[download preprint]

Inoculating Relevance Feedback Against Poison Pills

November 4th, 2016, posted by Djoerd Hiemstra

by Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra, and Maarten Marx

Relevance Feedback (RF) is a common approach for enriching queries, given a set of explicitly or implicitly judged documents to improve the performance of the retrieval. Although it has been shown that on average, the overall performance of retrieval will be improved after relevance feedback, for some topics, employing some relevant documents may decrease the average precision of the initial run. This is mostly because the feedback document is partially relevant and contains off-topic terms which adding them to the query as expansion terms results in loosing the retrieval performance. These relevant documents that hurt the performance of retrieval after feedback are called “poison pills”. In this paper, we discuss the effect of poison pills on the relevance feedback and present significant words language models (SWLM) as an approach for estimating feedback model to tackle this problem.

To be presented at the 15th Dutch-Belgian Information Retrieval Workshop, DIR 2016 on 25 November in Delft.

[download pdf]

Dutch-Belgian Information Retrieval workshop in Delft

November 2nd, 2016, posted by Djoerd Hiemstra

The Dutch-Belgian Information Retrieval workshop DIR 2016 will be held in Delft on 25 November. The preliminary workshop program contains 2 keynotes, 12 oral presentations and 7 poster presentations. Max Wilson from the University of Nottingham will provide an Human Computer Interaction perspective on Information Retrieval. Carlos Castillo from Eurecat will talk about the detection of algorithmic discrimination.

DIR 2016

Register at

Data Science Platform Netherlands

October 7th, 2016, posted by Djoerd Hiemstra

Data Science Platform Netherlands

The Data Science Platform Netherlands (DSPN) is the national platform for ICT research within the Data Science domain. Data Science is the collection and analysis of so-called ‘Big Data’ according to academic methodology. DSPN unites all Dutch academic research institutions where Data Science is carried out from an ICT perspective, specifically the computer science or applied mathematics perspectives. The objectives of DSPN are to:

  • Highlight the importance of ICT research in Big Data and Data Science, especially in national discussions about research and education.
  • Exchange and disseminate information about Data Science research and education.
  • Build and maintain a network of ICT researchers active in the field of Data Science.

DSPN is launched as part of the ICT Research Platform Netherlands (IPN) to give a voice to the Data Science initiatives of the Dutch ICT research organisations. For more information, see the website at:

#WhoAmI in 160 Characters?

October 5th, 2016, posted by Djoerd Hiemstra

Classifying Social Identities Based on Twitter

by Anna Priante, Djoerd Hiemstra, Tijs van den Broek, Aaqib Saeed, Michel Ehrenhard, and Ariana Need

We combine social theory and NLP methods to classify English-speaking Twitter users’ online social identity in profile descriptions. We conduct two text classification experiments. In Experiment 1 we use a 5-category online social identity classification based on identity and self-categorization theories. While we are able to automatically classify two identity categories (Relational and Occupational), automatic classification of the other three identities (Political, Ethnic/religious and Stigmatized) is challenging. In Experiment 2 we test a merger of such identities based on theoretical arguments. We find that by combining these identities we can improve the predictive performance of the classifiers in the experiment. Our study shows how social theory can be used to guide NLP methods, and how such methods provide input to revisit traditional social theory that is strongly consolidated in offline setting

To be presented at the EMNLP Workshop on Natural Language Processing and Computational Social Science (NLP+CSS) on November 5 in Austin, Texas, USA.

[download pdf]

Download the code book and classifier source code from github.

Data Science guest lectures

September 26th, 2016, posted by Djoerd Hiemstra

On 12 October we organize another Data Science Day in the Design Lab with guest lectures by Thijs Westerveld (Chief Science Officer at WizeNoze, Amsterdam), and Iadh Ounis (Professor of Information Retrieval in the School of Computing Science at the University of Glasgow). For more information and registration, see:

Resource Selection for Federated Search on the Web

September 22nd, 2016, posted by Djoerd Hiemstra

by Dong Nguyen, Thomas Demeester, Dolf Trieschnigg, and Djoerd Hiemstra

A publicly available dataset for federated search reflecting a real web environment has long been absent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines.
First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.

[download pdf]

CLEF keynote slides

September 14th, 2016, posted by Djoerd Hiemstra

The slides of the CLEF keynote can be downloaded below

A case for search specialization and search delegation

Evaluation conferences like CLEF, TREC and NTCIR are important for the field, and keep being important because there is no “one-size-fits-all” for search engines. Different domains need different ranking approaches: For instance, Web search benefits from analyzing the link graph; Twitter search benefits from retweets and likes; Restaurant search benefits from geo-location and reviews; Advertisement search need bids and click-through, etc. Researching many domains will learn us more about the need and the value of the specialization of search engines, and about approaches that can quickly learn rankings for new domains using for instance learning-to-rank and clever feature selection.
A search engine that provides results from multiple domains, therefore better delegates its queries to specialized search engines. This brings up unique research questions on how to best select a specialized search engine. The TREC Federated Web Search track, that ran in 2013 and 2014, studied these questions in two tasks: the resource selection task studied how to select, given a query but before seeing the results for the query, the top specialized search engines for a query. The vertical selection task studied how to select the top domains from a predefined set of domains such as news, video, Q&A, etc.
I will present the lessons that we learned from running the Federated Web Search track, focusing on successful approaches to resource selection and vertical selection. I will conclude the talk by discussing our steps to take this work to full practice by running the University of Twente’s search engine as a federation of more than 30 smaller search engines, including local databases with news, courses, publications, as well as results from social media like Twitter and YouTube. The engine that runs U. Twente search is called Searsia and is available as open source software at:

[download slides]

SIKS/CBS Data Camp & Advanced Course on Managing Big Data

September 12th, 2016, posted by Djoerd Hiemstra

On December 06 and 07 2016 The Netherlands School for Information and Knowledge Systems (SIKS) and Statistics Netherlands (CBS) organize a two day tutorial on the management of Big Data, the DataCamp, hosted at the University of Twente.
The Data Camp’s objective is to use big data sets to produce valuable and innovative answers to research questions with societal relevance. SIKS PhD students and CBS data analysts will learn about big data technologies and create, in small groups, feasibility studies for a research question of their choice.
Participants get access to predefined CBS research questions and massive datasets, including a large collection of Dutch Tweets, traffic data from Dutch high ways, and AIS data from ships. Participants will get access to the Twente Hadoop cluster, a 56 node cluster with almost 1 petabyte of storage space. The tutorial focuses on hands-on experience. The Data Camp participants will work in small, mixed teams in an informal setting, which stimulates intense contact with technologies and research questions. Experienced data scientists will support the teams by short lectures and hands-on support. Short lectures will introduce technologies to manage and visualize big data, that were first adopted by Google and are now used by many companies that manage large datasets. The tutorial teaches how to process terabytes of data on large clusters of commodity machines using new programming styles like MapReduce and Spark. The tutorial will be given in English and is part of the educational program for SIKS PhD students.

Also see the SIKS announcement.