November 30th, 2015, posted by Djoerd Hiemstra
Niek Tax was awarded today for his master thesis Scaling Learning to Rank to Big Data: Using MapReduce to Parallelise Learning to Rank by the Dutch association for ICT professionals and managers (Nederlandse beroepsvereniging van en voor ICT-professionals en -managers, Ngi-NGN). More information at Ngi-NGN and UT Nieuws. Congratulations, Niek!

Posted in Photos, Machine Learning | Comments Off

November 18th, 2015, posted by Djoerd Hiemstra
**Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation**

by Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, and Chris Develder

Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.

*To be published in Information Retrieval Journal by Springer*

[download pdf]

Posted in Distributed Search | Comments Off

November 6th, 2015, posted by Djoerd Hiemstra
Check out the Jupyter IPython Notebook Exercises made for the module Web Science. The exercises closely follow the exercises from Chapter 13 and 14 of the wonderful Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg. Download the notebooks here:

**Update** (February 2016). The notebooks with answers are now available below:

Posted in Social Data, Data Science, Course Web Science | Comments Off

November 3rd, 2015, posted by Djoerd Hiemstra
Machine Learning Research at the University of Twente focusses on the application of Machine Learning in Social Signal Processing, Biometric Pattern Recognition, and Text mining. Have a look at our new web site at: http://ml.ewi.utwente.nl.

Posted in Machine Learning | Comments Off

October 16th, 2015, posted by Djoerd Hiemstra
**Co-occurrence Rate Networks: Towards separate training for undirected graphical models**

by Zhemin Zhu

Dependence is a universal phenomenon which can be observed everywhere.
In machine learning, probabilistic graphical models (PGMs) represent dependence
relations with graphs. PGMs find wide applications in natural language
processing (NLP), speech processing, computer vision, biomedicine, information
retrieval, etc. Many traditional models, such as hidden Markov models
(HMMs), Kalman filters, can be put under the umbrella of PGMs. The central
idea of PGMs is to decompose (factorize) a joint probability into a product of
local factors. Learning, inference and storage can be conducted efficiently over
the factorization representation.

Two major types of PGMs can be distinguished: (i) Bayesian networks
(directed graphs), and (ii) Markov networks (undirected graphs). Bayesian networks
represent *directed dependence* with *directed edges*. Local factors of Bayesian
networks are *conditional probabilities*. Directed dependence, directed edges
and conditional probabilities are all *asymmetric* notions. In contrast, Markov
networks represent *mutual dependence* with *undirected edges*. Both of mutual
dependence and undirected edges are *symmetric* notions. For general Markov
networks, based on the Hammersley–Clifford theorem, local factors are *positive
functions* over maximum cliques. These local factors are explained using
intuitive notions like ‘compatibility’ or ‘affinity’. Specially, if a graph forms
a clique tree, the joint probability can be reparameterized into a junction tree
factorization.

In this thesis, we propose a novel framework motivated by the Minimum
Shared Information Principle (MSIP):
*We try to find a factorization in which the information shared between factors is
minimum. In other words, we try to make factors as independent as possible.*

The benefit of doing this is that we can train factors separately without paying
a lot of efforts to guarantee consistency between them. To achieve this goal,
we develop a theoretical framework called co-occurrence rate networks (CRNs)
to obtain such a factorization. Briefly, given a joint probability, the CRN
factorization is obtained as follows. We first strip off singleton probabilities from
the joint probability. The quantity left is called co-occurrence rate (CR). CR is
a symmetric quantity which measures mutual dependence among variables
involved. Then we further decompose the joint CR into smaller and indepen dent
CRs. Finally, we obtain a CRN factorization whose factors consist of all
singleton probabilities and CR factors. There exist two kinds of independencies
between these factors: (i) a singleton probability is independent (Here
independent means two factors do not share information.) of other singleton
probabilities; (ii) a CR factor is independent of other CR factors conditioned by
singleton probabilities. Based on a CRN factorization, we propose an efficient
two-step separate training method: (i) in the first step, we train a separate
model for each singleton probability; (ii) given singleton probabilities, we train
a separate model for each CR factor. Experimental results on three important
natural language processing tasks show that our separate training method is
two orders of magnitude faster than conditional random fields, while achieving
competitive quality (often better on the overall quality metric F1).

The second contribution of this thesis is applying PGMs to a real-world NLP
application: open relation extraction (ORE). In open relation extraction, two
entities in a sentence are given, and the goal is to automatically extract their
relation expression. ORE is a core technique, especially in the age of big data,
for transforming unstructured information into structured data. We propose
our model SimpleIE for this task. The basic idea is to decompose an extraction
pattern into a sequence of simplification operations (components). The benefit
by doing this is that these components can be re-combined in a new way to
generate new extraction patterns. Hence SimpleIE can represent and capture
diverse extraction patterns. This model is essentially a sequence labeling model.
Experimental results on three benchmark data sets show that SimpleIE boosts
recall and F1 by at least 15% comparing with seven ORE systems.

As tangible outputs of this thesis, we contribute open source implementations
of our research results as well as an annotated data set.

[download pdf]

Posted in Graphical Models, PhD defense, Machine Learning | Comments Off

October 15th, 2015, posted by Djoerd Hiemstra
**Estimating Creditworthiness using Uncertain Online Data**

by Maurice Bolhuis

The rules for credit lenders have become stricter since the financial crisis of 2007-2008.
As a consequence, it has become more difficult for companies to obtain a loan. Many
people and companies leave a trail of information about themselves on the Internet.
Searching and extracting this information is accompanied with uncertainty. In this
research, we study whether this uncertain online information can be used as an alternative
or extra indicator for estimating a company’s creditworthiness and how accounting for
information uncertainty impacts the prediction performance.

A data set consisting 3579 corporate ratings has been constructed using the data
of an external data provider. Based on the results of a survey, a literature study and
information availability tests, LinkedIn accounts of company owners, corporate Twitter
accounts and corporate Facebook accounts were chosen as an information source for
extracting indicators. In total, the Twitter and Facebook accounts of 387 companies and
436 corresponding LinkedIn owner accounts of this data set were manually searched.
Information was harvested from these sources and several indicators have been derived
from the harvested information.

Two experiments were performed with this data. In the first experiment, a Naive
Bayes, J48, Random Forest and Support Vector Machine classifier was trained and tested
using solely these Internet features. A comparison of their accuracy to the 31% accuracy
of the ZeroR classifier, which as a rule always predicts the most occurring target class,
showed that none of the models performed statistically better. In a second experiment,
it was tested whether combining Internet features with financial data increases the
accuracy. A financial data mining model was created that approximates the rating model
of the ratings in our data set and that uses the same financial data as the rating model.
The two best performing financial models were built using the Random Forest and J48
classifiers with an accuracy of 68% and 63% respectively. Adding Internet features to
these models gave mixed results with a significant decrease and an insignificant increase
respectively.

An experimental setup for testing how incorporating uncertainty affects the prediction
accuracy of our model is explained. As part of this setup, a search system is
described to find candidate results of online information related to a subject and to
classify the degree of uncertainty of this online information. It is illustrated how uncertainty
can be incorporated into the data mining process.

[download pdf]

Posted in Social Data, Data Science | Comments Off

October 6th, 2015, posted by Djoerd Hiemstra
We are proud to announce the 12th Seminar on Searching and Ranking, with guest presentations by Ingo Frommholz from the University of Bedfordshire, UK, and Tom Heskes from Radboud University Nijmegen, the Netherlands.

More information at: SSR 12.

Posted in SIKS, Graphical Models, Social Data, Data Science | Comments Off

September 15th, 2015, posted by Djoerd Hiemstra
On 23-27 November 2015, the Data Camp, a joint event organized by the Central Bureau for Statistics of the Netherlands (CBS) and the University of Twente (UT). During the camp, a set of CBS data analysts and UT researchers will answer research questions about statistics using big data technologies. On Monday, the participants will be presented with overview presentations about the research questions and technologies. The data camp participants will work in small, mixed teams in an informal setting. Experienced data scientists will support the teams by short mini-workshops and hands-on support. The hope is that the intense contact with the research question in an informal and spontaneous environment will produce valuable and innovative answers to the posed questions.

Guest speakers are Erik Tjong Kim Sang (Meertens Institute, Amsterdam) and David González (Vizzuality, Madrid).

[download report]

Posted in Photos, Course Big Data, Social Data, Data Science | 1 Comment »

September 1st, 2015, posted by Djoerd Hiemstra
Welcome to the M.Sc. course Advanced Database Systems. In this course you will learn what it takes to be a Database Administrator, analysing and improving the performance of databases. You will also learn in detail how transactions are handled by database management systems. The course has a little bit of everything: ordinary lectures, a little practicum, and a small project about handling sensor data. We hope to see you Thursday September 3rd, at 10.45h. in CR-3D.

Mena Badieh Habib Morgan, Maurice van Keulen, and Djoerd Hiemstra.

More info at: http://blackboard.utwente.nl (access still restricted )

Posted in Course Advanced DB | Comments Off

September 1st, 2015, posted by Djoerd Hiemstra
Welcome to the course Information Retrieval. We will introduce some exciting new things in the course: This year’s practical assignments are motivated by use cases of MyDataFactory, a company specialized in product data. The course uses the book “Introduction to Information Retrieval” by Christopher Manning, Prabhakar Raghavan and Hinrich Schütze. Have a look at the schedule on Blackboard under “Course Information” for an overview of the course first quarter of the course. In the second quarter, students will research a specific topic in depth. We hope to see you at the first lecture on Wednesday 2 September at 13.45h. in RA4334.

Theo Huibers, Dolf Trieschnigg and Djoerd Hiemstra.

More info at: http://blackboard.utwente.nl (access restricted)

Posted in Course IR | Comments Off