Welcome Arnaud van Harmelen

May 16th, 2017, posted by Djoerd Hiemstra

Welcome to the Database Group Arnaud van Harmelen!
Arnaud will work on SEQUOIA.

Greetings from CuriousU Search Engine Technology

May 11th, 2017, posted by Djoerd Hiemstra

CuriousU Search Engine Technology will explore the world of search engines. You will learn how search engines work, what challenges they deal with, and how their performance can be measured. And even beter: you will be guided in building, evaluating, and improving your own search engine on a real-world dataset.

To be presented at CuriousU Summer School 2017 13 - 22 August, 2017 at the University of Twente.

Private search in the browser

April 24th, 2017, posted by Djoerd Hiemstra

Even our smart phones are now powerful enough to search serious-sized document collections, such as personal blogs, sites with software documentation, sites of small and medium-sized enterprises, and even the famous Cranfield collection. In-browser search comes with interesting privacy benefits.

Read more at the Searsia Blog.

Slavica Zivanovic graduates on capturing and mapping QOL using Twitter data

February 27th, 2017, posted by Djoerd Hiemstra

by Slavica Zivanovic

There is an ongoing discussion about the applicability of social media data in scientific research. Moreover, little is known about the feasibility to use these data to capture the Quality of Life (QoL). This study explores the use of social media in QoL research by capturing and analysing people’s perceptions about their QoL using Twitter messages. The methodology is based on a mixed method approach, combining manual coding of the messages, automated classification, and spatial analysis. The city of Bristol is used as a case study, with a dataset containing 1,374,706 geotagged Tweets sent within the city boundaries in 2013. Based on the manual coding results, health, transport, and environment domains were selected to be further analysed. Results show the difference between Bristol wards in number and type of QoL perceptions in every domain, spatial distribution of positive and negative perceptions, and differences between the domains. Furthermore, results from this study are compared to the official QoL survey results from Bristol, statistically and spatially. Overall, three main conclusions are underlined. First, Twitter data can be used to evaluate QoL. Second, based on people’s opinions, there is a difference in QoL between Bristol neighbourhoods. And, third, Twitter messages can be used to complement QoL surveys but not as a proxy. The main contribution of this study is in recognising the potential Twitter data have in QoL research. This potential lies in producing additional knowledge about QoL that can be placed in a planning context and effectively used to improve the decision-making process and enhance quality of life of residents.

[download pdf]

Cum laude degree for Masrour Zoghi

February 24th, 2017, posted by Djoerd Hiemstra

Dueling bandits for online ranker evaluation

by Masrour Zoghi

In every domain where a service or a product is provided, an important question is that of evaluation: given a set of possible choices for deployment, what is the best one? An important example, which is considered in this work, is that of ranker evaluation from the field of information retrieval (IR). The goal of IR is to satisfy the information need of a user in response to a query issued by them, where this information need is typically satisfied by a document (or a small set of documents) contained in what is often a much larger collection. This goal is often attained by ranking the documents according to their usefulness to the issued query using an algorithm, called a ranker, a procedure that takes as input a query and a set of documents and specifies how the documents need to be ordered.
This thesis is concerned with ranker evaluation. The goal of ranker evaluation is to determine the quality of the rankers under consideration to allow us to choose the best option: given a finite set of possible rankers, which one of them leads to the highest level of user satisfaction? There are two main methods for carrying this out: absolute metrics and relative comparisons. This thesis is concerned with the second, relative form of ranker evaluation because it is more efficient at distinguishing between rankers of different quality: for instance interleaved comparisons take a fraction of the time required by A/B testing, but they produce the same outcome. More precisely, the problem of online ranker evaluation from relative feedback can be described as follows: given a finite set of rankers, choose the best using only pairwise comparisons between the rankers under consideration, while minimizing the number of comparisons involving sub-optimal rankers. This problem is an instance of what is referred to as the dueling bandit problem in the literature.
The main contribution of this thesis is devising a dueling bandit algorithm, called Copeland Confidence Bounds (CCB), that solves this problem under practically general assumptions and providing theoretical guarantees for its proper functioning. In addition to that, the thesis contains a number of other algorithms that are better suited for dueling bandit problems with particular properties.

[download pdf]

Wim van der Zijden graduates on Multi-Tenant Customizable Databases

February 16th, 2017, posted by Djoerd Hiemstra

by Wim van der Zijden

A good practice in business is to focus on key activities. For some companies this may be branding, while other businesses may focus on areas such as consultancy, production or distribution. Focusing on key activities means to outsource as much other activities as possible. These other activities merely distract from the main goals of the company and the company will not be able to excel in them.
Many companies are in need of reliable software to persistently process live data transactions and enable reporting on this data. To fulfil this need, they often have large IT departments in-house. Those departments are costly and distract from the company’s main goals. The emergence of cloud computing should make this no longer necessary. All they need is an internet connection and a service contract with an external provider.
However, most businesses are in need of highly customizable software, because each company has slightly different business processes, even those in the same industry. So even if they outsource their IT need, they will still have to pay expensive developers and business analysts to customize some existing application.
These issues are addressed by Multi-Tenant Customizable (MTC) applications. We define such an application as follows:

A single software solution that can be used by multiple organizations at the same time and which is highly customizable for each organization and user within that organization, by domain experts without a technical background.

A key challenge in designing such a system is to develop a proper persistent data storage, because mainstream databases are optimized for single tenant usage. To this end this Master’s thesis consists of two papers: the first paper proposes an MTC-DB Benchmark, MTCB. This Benchmark allows for objective comparison and evaluation of MTC-DB implementations, as well as providing a framework for the definition of MTC-DB. The second paper describes a number of MTC-DB implementations and uses the benchmark to evaluate those implementations.

[download pdf]

Query autocompletions considered harmful

February 9th, 2017, posted by Djoerd Hiemstra

Does Google doubt whether the holocaust happened?

did the holocaust happen

Query autocompletion algorithms that are based on query logs are problematic in two important ways: 1) They return offensive and damaging results; 2) They suffer from a destructive feedback loop.
Read more

Two PhD positions on Maintenance Optimization for the Dutch railroads

January 3rd, 2017, posted by Djoerd Hiemstra

We are hiring two PhD positions on Maintenance Optimization for the Dutch railroads.

The Database and Formal Methods & Tools groups at the University of Twente seek two PhD candidates for SEQUOIA: Smart maintenance optimization via big data & fault tree analysis, a project funded by the Dutch Technology Foundation STW, and the companies ProRail and NS. ProRail is responsible for the Dutch railway network, including its construction, management, maintenance, and safety; NS has the same responsibility for the Dutch train fleed.

Predictive maintenance explained

SEQUOIA aims to improve the reliability of the Dutch railroads by deploying big data analytics to predict and prevent failures. Its scientific core is a novel combination of machine learning, fault tree analysis and stochastic model checking. Key idea is that big data analytics provide the statistics on failures, their correlations, dependencies etc. and fault trees provide the domain knowledge needed to interpret these data. The project outcome aims at fewer train disruptions and delays, lower maintenance cost and more passenger comfort. The project involves an intense cooperation with the RWTH Aachen University and with various engineers from ProRail and NS. The PhD candidates will spend a portion of their time at the ProRail / NS sites in Utrecht.

Apply on-line.

Searsia nominated by ISOC NL

January 2nd, 2017, posted by Djoerd Hiemstra

The Dutch chapter of the Internet Society (ISOC) nominated Searsia for its 2017 Innovation Award.

Read more on the Searsia blog

SIKS/CBS DataCamp Spark tutorial notebook

December 22nd, 2016, posted by Djoerd Hiemstra
Jupyter

by Djoerd Hiemstra and Robin Aly

SIKS/CBS DataCamp participants can download the answers for the Jupyter Scala/Spark notebook exercises below.