What is information?

Met computers kun je informatie opslaan en versturen, maar wat is informatie eigenlijk? Hoeveel informatie staat er in een boek van 100 pagina's? En welke boekenserie bevat meer informatie: “De wereld van Darren Shan” of “De griezelbus van Paul van Loon”? Hoe meet je dat?

Computers are used to store and send information, but what is information anyway? How much information does a book of 100 pages contains? What book series contain more information: “The Saga of Darren Shan” or “The Horror Bus of Paul van Loon”? How to measure this?

This lecture for the Museum Jeugduniversiteit for children aged 8 to 12 is based on the wonderful Computer Science Unplugged activities by Tim Bell, Ian Witten and Mike Fellows. In the lecture I explain the theories of Claude Shannon, talk about statistical language models, and we play the Twenty Guesses quiz.

Celebrating Stephen Robertson’s Retirement

by Djoerd Hiemstra, John Tait, Andrew MacFarlane, and Nick Belkin

Stephen Robertson at SIGIR 2013 Stephen Robertson was named fellow of the Association for Computing Machinery (ACM) last week. Robertson retired from the Microsoft Research Lab in Cambridge this year after a long career as one of the most influential, well liked and eminent researchers in Information Retrieval throughout the world. His successful career was celibrated in the latest BCS IRSG Informer. Stephen Robertson continues to be active in Information Retrieval in his retirement at University College London.

[download pdf]

Kien Tjin-Kam-Jet defends PhD thesis on Distributed Deep Web Search

Distributed Deep Web Search

by Kien Tjin-Kam-Jet

The World Wide Web contains billions of documents (and counting); hence, it is likely that some document will contain the answer or content you are searching for. While major search engines like Bing and Google often manage to return relevant results to your query, there are plenty of situations in which they are less capable of doing so. Specifically, there is a noticeable shortcoming in situations that involve the retrieval of data from the deep web. Deep web data is difficult to crawl and index for today's web search engines, and this is largely due to the fact that the data must be accessed via complex web forms. However, deep web data can be highly relevant to the information-need of the end-user. This thesis overviews the problems, solutions, and paradigms for deep web search. Moreover, it proposes a new paradigm to overcome the apparent limitations in the current state of deep web search, and makes the following scientific contributions:

  1. A more specific classification scheme for deep web search systems, to better illustrate the differences and variation between these systems.
  2. Virtual surfacing, a new, and in our opinion better, deep web search paradigm which tries to combine the benefits of the two already existing paradigms, surfacing and virtual integration, and which also raises new research opportunities.
  3. A stack decoding approach which combines rules and statistical usage information for interpreting the end-user's free-text query, and to subsequently derive filled-out web forms based on that interpretation.
  4. A practical comparison of the developed approach against a well-established text-processing toolkit.
  5. Empirical evidence that, for a single site, end-users would rather use the proposed free-text search interface instead of a complex web form.

Analysis of data obtained from user studies shows that the stack decoding approach works as well as, or better than, today’s top-performing alternatives.

[download pdf]

Marije de Heus graduates on Recommender Systems for High School Courses

Design and Evaluation of a Recommender System for High School Courses in The Netherlands

by Marije de Heus

This thesis presents a newly developed recommender system for recommending high school courses in The Netherlands. The recommender system recommends a complete set of courses to a student, based on the choices of similar students that have already completed high school. A large historical database containing information of more than 20% of all new Dutch high school students was used for this recommender. The methodologies used are a structured literature review, interviews for requirements, design of the system and offline (with a historical dataset containing grades from tens of thousands students) and online (with on-site experiments at 4 high schools) experiments. The main findings of this report are the following:

  • There is a definite need for an objective recommendation of high school courses by students and school counselors;
  • The recommendations are not accurate;
  • The recommendations received good reviews in the online experiment;
  • The recommendations did not outperform the random recommendation in the online experiment;
  • A serendipitous result: the offline tests have shown that recommenders can predict future exam grades with high accuracy.

Our recommendation to Topicus, based on these findings, is not to implement the recommender system. Instead, a broader search could be started, to find other possible solutions for the need for objective recommendations. One technique that could be explored further, is the prediction of grades for single courses. We expect that school counselors will find such a tool helpful in advicing students which courses to take.

[download pdf]

Adele Lu Jia defends her PhD thesis on incentives in p2p networks

Adele Lu Jia successfully defended her PhD thesis at Delft University of Technology,

Online Networks as Societies: User Behaviors and Contribution Incentives

by Adele Lu Jia

Online networks like Facebook and BitTorrent have become popular and powerful infrastructures for users to communicate, to interact, and to share social lives with each other. These networks often rely on the cooperation and the contribution of their users. Nevertheless, users in online networks are often found to be selfish, lazy, or even ma- licious, rather than cooperative, and therefore need to be incentivized for contributions. To date, great effort has been put into designing effective contribution incentive policies, which range from barter schemes to monetary schemes. In this thesis, we conduct an analysis of user behaviors and contribution incentives in online networks. We approach online networks as both computer systems and societies, hoping that this approach will, on the one hand, motivate computer scientists to think about the similarities between their artificial computer systems and the natural world, and on the other hand, help people outside the field understand online networks more smoothly.

To summarize, in this thesis we provide theoretical and practical insights into the correlation between user behaviors and contribution incentives in online networks. We demonstrate user behaviors and their consequences at both the system and the individual level, we analyze barter schemes and their limitations in incentivizing users to contribute, we evaluate monetary schemes and their risks in causing the collapse of the entire system, and we examine user interactions and their implications in inferring user relationships. Above all, unlike the offline human society that has evolved for thousands of years, online networks only emerged two decades ago and are still in a primitive state. Yet with their ever-improving technologies we have already obtained many exciting results. This points the way to a promising future for the study of online networks, not only in analyzing online behaviors, but also in cross reference with offline societies.

[more info]

Sabbatical at Q-Able

Starting today, I am on sabbatical at Q-Able, an exciting new internet startup and spinoff of the University of Twente. Q-Able will bring new search capabilities to internet web shops, hotel and travel booking sites, online banking, etc. by replacing multi-field web forms by free text querying. Instead of meticulously filling in one field at a time of a web form, users of your web site get a simple, single search field. Q-Able's solutions provide a better user experience for the visitors of web sites, and it gives the company running the web site the opportunity to find out what their customers really want (you'd be surprised of the things people will enter in single search fields).

More information shortly at: q-able.com.

2013’s DB colloquium

Below you find a list of last year's DB colloquia, usually Tuesday's from 13:45h. – 14:30h. in ZI-3126.

15 January 2013
Zhemin Zhu
Separate Training for Conditional Random Fields Using Co-occurrence Rate Factorization
The standard training method of Conditional Random Fields (CRFs) is very slow for large-scale applications. In this paper, we present separate training for undirected models based on the novel Co-occurrence Rate Factorization (CR-F). Separate training is a local training method. In contrast to piecewise training, separate training is exact. In contrast to MEMMs, separate training is unaffected by the label bias problem. Experiments show that separate training (i) is unaffected by the label bias problem; (ii) reduces the training time from weeks to seconds; and (iii) obtains competitive results to the standard and piecewise training on linear-chain CRFs.

18 January 2013 (Friday from 13.45h. to 15.30h. in Dutch)
Edo Drenthen (Avanade)
Big Data & Hadoop
BigData staat voor data die zo groot en complex is dat ze moeilijk zijn op te slaan en te bewerken met traditionele database management tools. Big Datasets verschillen van traditionele datasets over 3 assen: 1) Volume: We zijn meer dan ook in staat om data te genereren en op te slaan. Dit heeft tot gevolg dat datasets in orde en factor groter worden; 2) Variety: Data analyze is nodig over een combinatie van gestructureerd (bijv RDBMS) en ongestructureerd (bijv logs, tweets); 3) Velocity: Data wordt steeds sneller gegenereerd, bijvoorbeeld door mobile devices, sensor netwerken, logs, streaming systems. Het onderscheid tussen relevante en en niet relevante data is hierdoor voor bedrijven steeds moeilijker te maken. Organisaties staan voor de uitdaging om de juiste informatie uit Big Data te halen en deze om te zetten naar waarde. Afhankelijk van o.a doelstellingen en de maturity van een organisatie zal een strategie bedacht moeten worden om dit te bewerkstelligen. Avanade helpt haar klanten bij het beantwoorden van deze vragen. Deze presentatie zal gaan over BigData integratie en Hadoop op het Microsoft platform. Verschillende aspecten van BigData uiteengezet zullen worden naar aanleiding van een business-case en toegelicht worden door middel van een demo. Voor BI, Datawarehousing en data mining enthousiastelingen is dit een presentatie die je zeker niet mag missen!

Jasper Stoop
Process Mining and Fraud Detection:
A case study on the theoretical and practical value of using process mining for the detection of fraudulent behavior in the procurement process
The presentation is about how process mining can be utilized in fraud detection and what the benefits of using process mining for fraud detection are. The concepts of fraud and fraud detection are discussed. These results are combined with an analysis of existing case studies on the application of process mining and fraud detection to construct an initial setup of two case studies, in which process mining is applied to detect possible fraudulent behavior in the procurement process. Based on the experiences and results of these case studies, the 1+5+1 methodology is presented as a first step towards operationalizing principles with advice on how process mining techniques can be used in practice when trying to detect fraud.

23 January 2013 (Wednesday)
Ander de Keijzer (Windesheim University of Applied Sciences)
Securing Data
With computers everywhere and pretty much always online and/or accessible, securing data is of the utmost importance. In this presentation we highlight several points of interest when thinking of security and suggest some possible (and practical) solutions.

5 February 2013 (15:45h. – 17.30h. in Waaier 1&2)
Ed Brinksma
Try out Euclid
Prof. Brinksma has prepared an amazing lecture for next year's bachelor programme. He will take you to exciting corners of mathematics where you have never been before. But, above all, he will challenge you with some remarkable brain teasers. Solutions may be submitted live through your smartphone or tablet. Among the ten best solutions we will allot nine Spotify Gift Cards. The first prize will be a surprise.

5 March 2013
Robin Aly
A Unified Framework to Evaluate Plurality Oriented Retrieval
A standard way to evaluate search engines is the Cranfield paradigm. The paradigm assumes well-defined information needs, golden standard relevance assessments, and an evaluation measure on graded relevance. The literature believes that the Cranfield paradigm is a too poor model of reality and that better paradigms should include phenomena such as the following: first, users issue ambiguous queries, second, information needs require diverse documents, and finally, relevance assessments could be erroneous. Until now, each approach in the literature addresses only a subset of these phenomena. We propose, however, that these phenomena will almost always co-occur and an search evaluation approach should account for this. In this talk I present a unified framework of approaches from the literature that has exactly one component for each phenomena. Using the framework, we investigate how the comparison of existing search engines changes under a simulated mixture of these phenomena.

18 March 2013 (Monday at 15.45 h. in the SmartXP lab)
Peter Norvig (Google)
Norvig Web Data Science Award Ceremony
In a live video connection from California, USA, Peter Norvig, Director of Research at Google, will award the winners of the Web Data Science Award that was named in his honor: Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente. The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science.
Read more…

26 March 2013
Vincent van Donselaar
TF×IDF ranking: a physical approach
This paper tries to find similarities and differences between TFxIDF ranking and the theory of network analysis. A simplified network model based on the principle of an electrical circuit acts as a guide to gain understanding of the model’s operation. The correctness of this model is tested by implementing it as a function of the Terrier Information Retrieval System, whereupon it is evaluated against Terrier’s predefined TF×IDF model.

2 April 2013
COMMIT day
This day is a follow up to the successful COMMIT Kick-Off meeting of last year. COMMIT/ted to you! is meant only for people who are officially participating in the COMMIT program.
Read more

9 April 2013
Paul Stapersma
Efficient Query Evaluation on Probabilistic XML Data
In this thesis, we instruct MayBMS to cope with probabilistic XML (P-XML) in order to evaluate XPath queries on P-XML data as SQL queries on uncertain relational data. This approach entails two aspects: (1) a data mapping from P-XML to U-Rel that ensures that the same information is represented by database instances of both data structures, and (2) a query mapping from XPath to SQL that ensures that the same question is specified in both query languages.

16 April 2013
Maurice van Keulen
Parels der Informatica (Pearls of Computer Science)
Maurice will talk about the plans for the new Computer Science bachelor curriculum, where student learn about the computer science's most impressive achievements: the pearls of Computer Science.

22 April 2013 (15:00 h. Atrium Ravelijn)
with Maurice van Keulen and Djoerd Hiemstra
TOM inspiration market
On Monday, 22nd April, from 15.00 – 17.00 (Atrium, Ravelijn), there will be an Inspiration market about Twente's Educational Model. You can walk in and out as you wish, there is no fixed schedule. Maurice and Djoerd will both be present at the market, talking about using digital white boards, online lectures, and challenges.
Read more (access restricted)

1 May 2013 (Wednesday)
Rezwan Huq
Inference-based Framework managing Data Provenance
Data provenance allows scientists to validate their model as well as to investigate the origin of an unexpected value. Furthermore, it can be used as a replication recipe of output data products. However, capturing provenance requires enormous effort by scientists in terms of time and training. First, they need to design the workflow of the scientific model, i.e., workflow provenance, which requires both time and training. Second, they need to capture provenance while the model is running, i.e., fine-grained data provenance. Explicit documentation of fine-grained provenance is not feasible because of the massive storage consumption by provenance data in the applications where data is continuously arriving and is processed. We propose an inference-based framework which provides both workflow and fine-grained data provenance at a minimal cost in terms of time, training and disk consumption. Our framework is applicable to any given scientific model and is capable of handling different system dynamics such as variation in the processing time as well as input data products arrival pattern. We evaluate the framework on two different use cases. Our evaluation shows that the proposed framework can infer accurate provenance information at reduced costs and therefore, is relevant and suitable for scientists in different domains.

27 May 2013 (Monday)
Mena Habib
University of Twente at #MSM2013
Twitter messages are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. In this paper we present a hybrid approach for Named Entity Extraction (NEE) and Classification (NEC) for tweets. The system uses the power of the Conditional Random Fields (CRF) and the Support Vector Machines (SVM) in a hybrid way to achieve better results. For named entity type classification we used the AIDA disambiguation system to disambiguate the extracted named entities and hence find their type.

4 June 2013 (10.00h. in Waaier 4)
CTIT Symposium on Big Data
The CTIT Symposium Big Data and the emergence of Data Science addresses the multidisciplinary opportunities and challenges of Big Data from the perspectives of industrial research, academic research, and education.
Read more…

10 June 2013 (Monday)
Dan Ionita and Niek Tax
An API-based Search System for One Click Access to Information
This paper proposes a prototype One Click access system, based on previous work in the field and the related 1CLICK-2@NTCIR10 task. The proposed solution integrates methods from previous such attempts into a three tier algorithm: query categorization, information extraction and output generation and offers suggestions on how each of these can be implemented. Finally, a thorough user-based evaluation concludes that such an information retrieval system outperforms the textual preview collected from Google search results, based on a paired sign test. Based on validation results possible suggestions on future improvements are proposed.

18 June 2013
Design projects
Online leeromgeving voor derde wereld landen
We will present the result of our ‘Ontwerpproject’. The last semester, we have built a system that will help students in Third World Countries with their Master Thesis. These students will graduate at their own university, but can use our system to get in touch with supervisors from around the world to help them with their research. This way, the limitations on resources and expertises in those areas are dealt with to make sure the student can reach out to their potential. During this presentation, we will demonstrate the system and discuss some techniques we used to accomplish our goals.
Science Challenges
For the 'Ontwerpproject' we set out to build a system that will support the Challenge initiative. This initiative strives to organize and promote science challenges for students of the University of Twente. Challenges are short, engaging and fun projects – often combined with a competition – in which students can participate in teams or alone. The presentation will briefly describe the initiative, our workflow and the system that we built.

26 June 2013 (Wednesday)
Ilya Markov (Università della Svizzera italiana)
Reducing the Uncertainty in Resource Selection
The distributed retrieval process is plagued by uncertainty. Sampling, selection, merging and ranking are all based on very limited information compared to centralized retrieval. This talk will be focused on reducing the uncertainty within the resource selection phase by obtaining a number of estimates, rather than relying upon only one point estimate. Three methods for reducing uncertainty will be proposed, which will be compared against state-of-the-art baselines across three distributed retrieval testbeds. The experimental results show that the proposed methods significantly improve baselines, reduce the uncertainty and improve robustness of resource selection.

2 July 2013
Christian Wartena (Hochschule Hannover)
Distributional Similarity of Words with Different Frequencies
In this talk I will present three case studies in which we deal the frequency bias of distributional similarity. In the first study we compare context vectors of words with word distributions of documents in order to find words that can represent the content of a document. Here we explicitly require that the context vectors of the words are similar to the word distribution of the document and dissimilar to the general word distribution of the collection. In two recent studies we used distributional similarity to determine the semantic equivalence of words and short phrases. Here we eliminate the frequency bias by modeling the dependency of vector similarity on the word frequency.
We are not yet able to give a final solution to the frequency problem of distributional semantics, but the three case studies show that distributional similarity can be substantially improved when we take the frequency bias serious. Thus the topic is definitely worth investigating further

9 July 2013
Robin Aly
Taily: Shard Selection Using the Tail of Score Distributions
I propose Taily, a novel shard selection algorithm that models a query’s score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function’s features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time. Read more

See also: 2012's DB colloquium.

Maarten Fokkinga retires

Today, Maarten Fokkinga retires after a scientific career of more than 40 years. Maarten is well-kown for his work on functional programming and category theory. Some of his well-known and well-cited works include: Functional programming with bananas, lenses, envelopes and barbed wire with Eric Meijer and Ross Paterson, Law and Order in Algorithmics, his Ph.D thesis, and Monadic Maps and Folds for Arbitrary Datatypes (yes, those are maps and reduces!)

To celebrate Maarten's long successful career, Jan Kuper and I wrote recipes for curried bananas and pasta, appropriately formalized in Haskell, so Maarten can both cook and enjoy programming after his retirement. Download the recipes from Github.