Djoerd Hiemstra – Page 2 – Research, Teaching and More

Artificial intelligence: there are problems we need to address right now, the rest is science fiction

by Frederik Zuiderveen Borgesius, Marvin van Bekkum, and Djoerd Hiemstra

Everywhere you read warnings of ‘existential risks’ from artificial intelligence (AI). Some even warn that AI could wipe out humanity. The tech company OpenAI is predicting the emergence of artificial general intelligence and superintelligence, and of future AI systems that will be more intelligent than humans. Some policymakers also fear this kind of scenario.

But things are not moving that fast. ‘Artificial general intelligence’ means an AI system that, like humans, can perform a variety of different tasks. There is no such general AI at present, and even if it does come one day, creating it will take a very long time.

Many AI systems are useful. Search engines, for example, are indispensable to internet users, and are a good example of specific AI. A specific AI system can perform one task well, such as pointing people to the right website. Modern spam filters, translation software, and speech recognition software also work well thanks to specific AI.

But these are still examples of specific AI – far removed from general AI, let alone ‘superintelligence’. Humans can learn new things. AI systems cannot. What computer scientists are getting better and better at is creating general large language models that can be used for all kinds of specific AI. The same language model can be used for translation software, spam filters, and search engines. Does this mean that such a language model has general intelligence? Could it develop consciousness? Absolutely not! There is therefore no real risk of a science fiction scenario in which an AI system wipes out humanity.

This focus on existential risks distracts us from the real risks at hand, which require our attention right now. Little remains of our privacy, for example. AI systems are trained using data, lots of data. That is why AI developers, mostly big tech companies, are collecting massive amounts of data. For instance, OpenAI presumably gobbled up large sections of the web to develop ChatGPT, including personal data. Incidentally, OpenAI is quite secretive about what data it uses.

Secondly, the use of AI can lead to unfair discrimination. For example, many facial recognition systems do not work well for people with darker skin tones. In the US, the police have repeatedly arrested the wrong person because a facial recognition system wrongly identified the dark-skinned men as criminals.

Thirdly, AI systems consume incredible amounts of electricity. Training and using language models like GPT require a lot of computing power from large data centres, which guzzle energy. Finally, the power of big tech companies is only growing with the use of AI systems. Developing AI systems costs a lot of money, so as the use of AI increases, we become even more dependent on big tech companies. These kinds of risks are already here now. Let’s focus on that, and not let ourselves be distracted by the ghost of sentient AI.

Published by Radboud Recharge.

SIGIR 2023 live at Radboud

On 24, 25 and 26 July we will follow the 46th International ACM SIGIR Conference online from lecture hall 0.28 in the Mercator building. We will start each morning at 8:30h. for the live stream from Tapei, Taiwan and watch recorded sessions and keynotes in the afternoon. There will be presentations from well-known Radboud researchers such as Harrie Oosterhuis, Chris Kamphuis and Negin Ghasemi! 😄

More information at: https://sigir.org/sigir2023/

Fausto de Lang graduates on tokenization for information retrieval

An empirical study of the effect of vocabulary size for various tokenization strategies in passage retrieval performance.

by Fausto de Lang

Many interactions between the the fields of lexical retrieval and large language models still remain underexplored, in particular there is little research into the use of advanced language model tokenizers in combination with classical information retrieval mechanisms. This research looks into the effect of vocabulary size for various tokenization strategies in passage retrieval performance. It also provides an overview of the impact of the WordPiece, Byte-Pair Encoding and Unigram tokenization techniques on the MSMARCO passage retreival task. These techniques are explored in both re-trained tokenizers and in tokenizers trained from scratch. Based on three metrics this research has found the WordPiece tokenization technique is the best performing technique on the MSMARCO passage retrieval tasks. It has also found that a training vocabulary size of around 10,000 tokens is best in regards to Recall performance, while around 320,000 tokens shows the optimal Mean Reciprocal Rank and Normalized Discounted Cumulative Gain scores. Most importantly, the optimum at a relatively small vocabulary size suggests that shorter subwords can benefit the indexing and searching process (up to a certain point). This is a meaningful result since it means that many applications where (re-)trained tokenizers are used in information retrieval capacity might be improved by tweaking the vocabulary size during training. This research has mainly focused on building a bridge between (re-)trainable tokenizers and information retrieval software, while reporting on interesting tunable parameters. Finally, this research recommends researchers to build their
own tokenizer from scratch since it forces one to look at the configuration of the underlying processing steps.

Defended on 27 June 2023

Git repository at: gitlab.com/tokenization/Lucene

Vacancy: professor of software technology

We’re hiring an assistant or associate professor in our programming languages and compiler group. Apply on-line at: https://www.ru.nl/en/working-at/job-opportunities/assistant-or-associate-professor-of-software-technology

UNFair: Search Engine Manipulation, Undetectable by Amortized Inequity

by Tim de Jonge and Djoerd Hiemstra

Modern society increasingly relies on Information Retrieval systems to answer various information needs. Since this impacts society in many ways, there has been a great deal of work to ensure the fairness of these systems, and to prevent societal harms. There is a prevalent risk of failing to model the entire system, where nefarious actors can produce harm outside the scope of fairness metrics. We demonstrate the practical possibility of this risk through UNFair, a ranking system that achieves performance and measured fairness competitive with current state-of-the-art, while simultaneously being manipulative in setup. UNFair demonstrates how adhering to a fairness metric, Amortized Equity, can be insufficient to prevent Search Engine Manipulation. This possibility of manipulation bypassing a fairness metric discourages imposing a fairness metric ahead of time, and motivates instead a more holistic approach to fairness assessments.

To be presented at the ACM Conference on Fairness, Accountability, and Transparency (FAccT 2023) on 12-15 June in Chicago, USA.

[download pdf]

Cross-Market Product-Related Question Answering

by Negin Ghasemi, Mohammad Aliannejadi, Hamed Bonab, Evangelos Kanoulas, Arjen de Vries, James Allan, and Djoerd Hiemstra

Online shops such as Amazon, eBay, and Etsy continue to expand their presence in multiple countries, creating new resource-scarce marketplaces with thousands of items. We consider a marketplace to be resource-scarce when only limited user-generated data is available about the products (e.g., ratings, reviews, and product-related questions). In such a marketplace, an information retrieval system is less likely to help users find answers to their questions about the products. As a result, questions posted online may go unanswered for extended periods. This study investigates the impact of using available data in a resource-rich marketplace to answer new questions in a resource-scarce marketplace, a new problem we call cross-market question answering. To study this problem’s potential impact, we collect and annotate a new dataset, XMarket-QA, from Amazon’s UK (resource-scarce) and US (resource-rich) local marketplaces. We conduct a data analysis to understand the scope of the cross-market question-answering task. This analysis shows a temporal gap of almost one year between the first question answered in the UK marketplace and the US marketplace. Also, it shows that the first question about a product is posted in the UK marketplace only when 28 questions, on average, have already been answered about the same product in the US marketplace. Human annotations demonstrate that, on average, 65% of the questions in the UK marketplace can be answered within the US marketplace, supporting the concept of cross-market question answering. Inspired by these findings, we develop a new method, CMJim, which utilizes product similarities across marketplaces in the training phase for retrieving answers from the resource-rich marketplace that can be used to answer a question in the resource-scarce marketplace. Our evaluations show CMJim’s significant improvement compared to competitive baselines.

To be presented at the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023) on July 23-27 in Taipei, Taiwan.

[download pdf]

Towards a Generic Model for Classifying Software into Correctness Levels and its Application to SQL

by Benard Wanjiru, Patrick van Bommel, and Djoerd Hiemstra

Automated grading systems can save a lot of time when carrying our grading of software exercises. In this paper, we present our ongoing work on a generic model for generating software correctness levels. These correctness levels enable partial grades of students’ software exercises. The generic model can be used as a foundation for correctness of SQL queries and can be generalized to different programming languages.

To be presented at the SEENG 2023 Workshop on Software Engineering for the Next Generation of the 45th International Conference on Software Engineering on Tuesday 16 May in Melbourne, Australia.

[download pdf]

#OSSYM2023 at CERN

The Open Search Symposium #OSSYM2023 brings together the Open Internet Search community in Europe for the fifth time this year. The interactive conference provides a forum to discuss and further develop the ideas and concepts of open internet search. Participants include researchers, data centres, libraries, policy makers, legal and ethical experts, and society.

#OSSYM2023 takes place at CERN, Geneva, Switzerland on 4-6 October 2023 organized by the Open Search Foundation. The Call for Papers ends 31 May 2023.

More info at: https://opensearchfoundation.org/5th-international-open-search-symposium-ossym2023/

Was Fairness in IR Discussed by Cooper and Robertson in the 1970s?

by Djoerd Hiemstra

I discuss fairness in Information Retrieval (IR) through the eyes of Cooper and Robertson’s probability ranking principle. I argue that unfair rankings may arise from blindly applying the principle without checking whether its preconditions are met. Following this argument, unfair rankings originate from the application of learning-to-rank approaches in cases where they should not be applied according to the probability ranking principle. I use two examples to show that fairer rankings may also be more relevant than rankings that are based on the probability ranking principle.

Published in ACM SIGIR Forum 56(2), 2022

[download pdf]

Guest lecture by Hannes Mühleisen

We are proud to announce that Hannes Mühleisen will give a guest lecture on Tuesday 13 December at 13:30h. in LIN-2 for the course Information Modelling and Databases. Hannes Mühleisen is the creator of DuckDB and co-founder and CEO of DuckDB Labs. He is also a senior researcher of the Database Architectures group at the Centrum Wiskunde & Informatica (CWI) in Amsterdam. Students of the course use DuckDB to practice their SQL skills.

Analytical Query Processing and the DuckDB System

by Hannes Mühleisen

DBMSs have historically been created to support transactional (OLTP) workloads. However, a second use case, analytical data analysis (OLAP), quickly appeared. These workloads are characterised by complex, relatively long-running queries that process significant portions of the stored dataset, for example aggregations over entire tables or joins between several large tables. Its rather impossible for an OLTP-focused DBMS to perform well in OLAP scenarios, which is why specialised systems have been developed. In this lecture, I will introduce analytical query processing, give an overview over the state of the art in research and industry, and describe our own analytical DBMS, DuckDB.