Archive for the 'Data Science' Category

Christel Geurts graduates on Cross-Domain Authorship Attribution

Friday, January 12th, 2018, posted by Djoerd Hiemstra

Cross-Domain Authorship Attribution as a Tool for Digital Investigations

by Christel Geurts

On the darkweb sites promoting illegal content are abundant and new sites are constantly created. At the same time Law Enforcement is working hard to take these sites down and track down the persons involved. Often, after taking down a site, users change their name and move to a different site. But what if Law Enforcement could track users across sites? Different sites or sources of information are called a domain. As the domain changes, often the context of a message also changes, making it challenging to track users simply on words used. The aim of this thesis is to develop a system that can link written text of authors in a cross-domain setting. The system was tested on a blog corpus and verified on police data. Tests show that multinomial logistic regression and Support Vector Machines with a linear kernel perform well. Character 3-grams work well as features, combining multiple feature sets increases performance. Tests show that Logistic Regression models with a combined feature set performed best (accuracy = 0.717, MRR = 0.7785, 1000 authors (blog corpus)). On the police data the Logistic Regression model had an accuracy of 0.612 and a MRR of 0.6883 for 521 authors.

Wim van der Zijden graduates on Multi-Tenant Customizable Databases

Thursday, February 16th, 2017, posted by Djoerd Hiemstra

by Wim van der Zijden

A good practice in business is to focus on key activities. For some companies this may be branding, while other businesses may focus on areas such as consultancy, production or distribution. Focusing on key activities means to outsource as much other activities as possible. These other activities merely distract from the main goals of the company and the company will not be able to excel in them.
Many companies are in need of reliable software to persistently process live data transactions and enable reporting on this data. To fulfil this need, they often have large IT departments in-house. Those departments are costly and distract from the company’s main goals. The emergence of cloud computing should make this no longer necessary. All they need is an internet connection and a service contract with an external provider.
However, most businesses are in need of highly customizable software, because each company has slightly different business processes, even those in the same industry. So even if they outsource their IT need, they will still have to pay expensive developers and business analysts to customize some existing application.
These issues are addressed by Multi-Tenant Customizable (MTC) applications. We define such an application as follows:

A single software solution that can be used by multiple organizations at the same time and which is highly customizable for each organization and user within that organization, by domain experts without a technical background.

A key challenge in designing such a system is to develop a proper persistent data storage, because mainstream databases are optimized for single tenant usage. To this end this Master’s thesis consists of two papers: the first paper proposes an MTC-DB Benchmark, MTCB. This Benchmark allows for objective comparison and evaluation of MTC-DB implementations, as well as providing a framework for the definition of MTC-DB. The second paper describes a number of MTC-DB implementations and uses the benchmark to evaluate those implementations.

[download pdf]

SIKS/CBS DataCamp Spark tutorial notebook

Thursday, December 22nd, 2016, posted by Djoerd Hiemstra

by Djoerd Hiemstra and Robin Aly

SIKS/CBS DataCamp participants can download the answers for the Jupyter Scala/Spark notebook exercises below.

Inoculating Relevance Feedback Against Poison Pills

Friday, November 4th, 2016, posted by Djoerd Hiemstra

by Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra, and Maarten Marx

Relevance Feedback (RF) is a common approach for enriching queries, given a set of explicitly or implicitly judged documents to improve the performance of the retrieval. Although it has been shown that on average, the overall performance of retrieval will be improved after relevance feedback, for some topics, employing some relevant documents may decrease the average precision of the initial run. This is mostly because the feedback document is partially relevant and contains off-topic terms which adding them to the query as expansion terms results in loosing the retrieval performance. These relevant documents that hurt the performance of retrieval after feedback are called “poison pills”. In this paper, we discuss the effect of poison pills on the relevance feedback and present significant words language models (SWLM) as an approach for estimating feedback model to tackle this problem.

To be presented at the 15th Dutch-Belgian Information Retrieval Workshop, DIR 2016 on 25 November in Delft.

[download pdf]

Data Science Platform Netherlands

Friday, October 7th, 2016, posted by Djoerd Hiemstra

Data Science Platform Netherlands

The Data Science Platform Netherlands (DSPN) is the national platform for ICT research within the Data Science domain. Data Science is the collection and analysis of so-called ‘Big Data’ according to academic methodology. DSPN unites all Dutch academic research institutions where Data Science is carried out from an ICT perspective, specifically the computer science or applied mathematics perspectives. The objectives of DSPN are to:

  • Highlight the importance of ICT research in Big Data and Data Science, especially in national discussions about research and education.
  • Exchange and disseminate information about Data Science research and education.
  • Build and maintain a network of ICT researchers active in the field of Data Science.

DSPN is launched as part of the ICT Research Platform Netherlands (IPN) to give a voice to the Data Science initiatives of the Dutch ICT research organisations. For more information, see the website at:

#WhoAmI in 160 Characters?

Wednesday, October 5th, 2016, posted by Djoerd Hiemstra

Classifying Social Identities Based on Twitter

by Anna Priante, Djoerd Hiemstra, Tijs van den Broek, Aaqib Saeed, Michel Ehrenhard, and Ariana Need

We combine social theory and NLP methods to classify English-speaking Twitter users’ online social identity in profile descriptions. We conduct two text classification experiments. In Experiment 1 we use a 5-category online social identity classification based on identity and self-categorization theories. While we are able to automatically classify two identity categories (Relational and Occupational), automatic classification of the other three identities (Political, Ethnic/religious and Stigmatized) is challenging. In Experiment 2 we test a merger of such identities based on theoretical arguments. We find that by combining these identities we can improve the predictive performance of the classifiers in the experiment. Our study shows how social theory can be used to guide NLP methods, and how such methods provide input to revisit traditional social theory that is strongly consolidated in offline setting

To be presented at the EMNLP Workshop on Natural Language Processing and Computational Social Science (NLP+CSS) on November 5 in Austin, Texas, USA.

[download pdf]

Download the code book and classifier source code from github.

Data Science guest lectures

Monday, September 26th, 2016, posted by Djoerd Hiemstra

On 12 October we organize another Data Science Day in the Design Lab with guest lectures by Thijs Westerveld (Chief Science Officer at WizeNoze, Amsterdam), and Iadh Ounis (Professor of Information Retrieval in the School of Computing Science at the University of Glasgow). For more information and registration, see:

Luhn Revisited: Significant Words Language Models

Friday, September 2nd, 2016, posted by Djoerd Hiemstra

by Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra, and Maarten Marx

Users tend to articulate their complex information needs in only a few keywords, making underspecified statements of request the main bottleneck for retrieval effectiveness. Taking advantage of feedback information is one of the best ways to enrich the query representation, but can also lead to loss of query focus and harm performance - in particular when the initial query retrieves only little relevant information - when overfitting to accidental features of the particular observed feedback documents. Inspired by the early work of Hans Peter Luhn, we propose significant words language models of feedback documents that capture all, and only, the significant shared terms from feedback documents. We adjust the weights of common terms that are already well explained by the document collection as well as the weight of rare terms that are only explained by specific feedback documents, which eventually results in having only the significant terms left in the feedback model.

Establishing a set of 'Significant Words'

Our main contributions are the following. First, we present significant words language models as the effective models capturing the essential terms and their probabilities. Second, we apply the resulting models to the relevance feedback task, and see a better performance over the state-of-the-art methods. Third, we see that the estimation method is remarkably robust making the models insensitive to noisy non-relevant terms in feedback documents. Our general observation is that the significant words language models more accurately capture relevance by excluding general terms and feedback document specific terms.

To be presented at the 25th ACM International Conference on Information and Knowledge Management (CIKM 2016) on October 24-28, 2016 in Indianapolis, United States.

[download pdf]

Solving the Continuous Cold Start Problem in E-commerce Recommendations

Wednesday, August 3rd, 2016, posted by Djoerd Hiemstra

Beyond Movie Recommendations: Solving the Continuous Cold Start Problem in E-commerce Recommendations

by Julia Kiseleva, Alexander Tuzhilin, Jaap Kamps, Melanie Mueller, Lucas Bernardi, Chad Davis, Ivan Kovacek, Mats Stafseng Einarsen, Djoerd Hiemstra

Many e-commerce websites use recommender systems or personalized rankers to personalize search results based on their previous interactions. However, a large fraction of users has no prior interactions, making it impossible to use collaborative filtering or rely on user history for personalization. Even the most active users may visit only a few times a year and may have volatile needs or different personas, making their personal history a sparse and noisy signal at best. This paper investigates how, when we cannot rely on the user history, the large scale availability of other user interactions still allows us to build meaningful profiles from the contextual data and whether such contextual profiles are useful to customize the ranking, exemplified by data from a major online travel agent
Our main findings are threefold: First, we characterize the Continuous Cold Start Problem (CoCoS) from the viewpoint of typical e-commerce applications. Second, as explicit situational context is not available in typical real world applications, implicit cues from transaction logs used at scale can capture essential features of situational context. Third, contextual user profiles can be created offline, resulting in a set of smaller models compared to a single huge non-contextual model, making contextual ranking available with negligible CPU and memory footprint. Finally we conclude that, in an online A/B test on live users, our contextual ranker increased user engagement substantially over a non-contextual baseline, with click-through-rate (CTR) increased by 20%. This clearly demonstrates the value of contextual user profiles in a real world application.

[download pdf]

3TU NIRICT theme Data Science

Tuesday, January 12th, 2016, posted by Djoerd Hiemstra

The main objective of the NIRICT research in Data Science is to study the science and technology to unlock the intelligence that is hidden inside Big Data.
The amounts of data that information systems are working with are rapidly increasing. The explosion of data happens in a pace that is unprecedented and in our networked world of today the trend is even accelerating. Companies have transactional data with trillions of bytes of information about their customers, suppliers and operations. Sensors in smart devices generate unparalleled amounts of sensor data. Social media sites and mobile phones have allowed billions of individuals globally to create their own enormous trails of data.
The driving force behind this data explosion is the networked world we live in, where information systems, organizations that employ them, people that use them, and processes that they support are connected and integrated, together with the data contained in those systems.

What happens in an internet minute in 2016?

Unlocking the Hidden Intelligence

Data alone is just a commodity, it is Data Science that converts big data into knowledge and insights. Intelligence is hidden in all sorts of data and data systems.
Data in information systems is usually created and generated for specific purposes: it is mostly designed to support operational processes within organizations. However, as a by-product, such event data provide an enormous source of hidden intelligence about what is happening, but organizations can only capitalize on that intelligence if they are able to extract it and transform the intelligence into novel services.
Analyzing the data provides opportunities for organizations to gather intelligence to capitalize historic and current performance of their processes and exploit future chances for performance improvement.
Another rich source of information and insights is data from the Social Web. Analyzing Social Web Data provides governments, society and companies with better understanding of their community and knowledge about human behavior and preferences.
Each 3TU institute has its own Data Science program, where local data science expertise is bundled and connected to real-world challenges.

Delft Data Science (DDS) – TU Delft
Scientific director: Prof. Geert-Jan Houben

Data Science Center Eindhoven (DSC/e) – TU/e
Scientific director: Prof. Wil van der Aalst

Data Science Center UTwente (DSC UT) – UT
Scientific director: Dr. Djoerd Hiemstra

More information at: