Archive for the 'Photos' Category

SIKS/CBS Data Camp & Advanced Course on Managing Big Data

Monday, September 12th, 2016, posted by Djoerd Hiemstra

On December 06 and 07 2016 The Netherlands School for Information and Knowledge Systems (SIKS) and Statistics Netherlands (CBS) organize a two day tutorial on the management of Big Data, the DataCamp, hosted at the University of Twente.
The Data Camp’s objective is to use big data sets to produce valuable and innovative answers to research questions with societal relevance. SIKS PhD students and CBS data analysts will learn about big data technologies and create, in small groups, feasibility studies for a research question of their choice.
Participants get access to predefined CBS research questions and massive datasets, including a large collection of Dutch Tweets, traffic data from Dutch high ways, and AIS data from ships. Participants will get access to the Twente Hadoop cluster, a 56 node cluster with almost 1 petabyte of storage space. The tutorial focuses on hands-on experience. The Data Camp participants will work in small, mixed teams in an informal setting, which stimulates intense contact with technologies and research questions. Experienced data scientists will support the teams by short lectures and hands-on support. Short lectures will introduce technologies to manage and visualize big data, that were first adopted by Google and are now used by many companies that manage large datasets. The tutorial teaches how to process terabytes of data on large clusters of commodity machines using new programming styles like MapReduce and Spark. The tutorial will be given in English and is part of the educational program for SIKS PhD students.

Also see the SIKS announcement.

Niek Tax wins the ENIAC thesis award

Tuesday, December 15th, 2015, posted by Djoerd Hiemstra

Another thesis prize for Niek Tax: Best master thesis in computer science in 2014/2015 at the University of Twente, awarded by Alumni Association ENIAC. Photo: Niek Tax receives the award from Johan Noltes on behalf of the ENIAC jury. Congrats, Niek! Other nominees were Justyna Chromik (DACS), Vincent Bloemen (FMT), Maarten Brilman (HMI), Tim Paauw (IEBIS), and Moritz Müller (SCS).

Niek Tax

Niek Tax runner-up in Ngi-NGN thesis awards

Monday, November 30th, 2015, posted by Djoerd Hiemstra

Niek Tax was awarded today for his master thesis Scaling Learning to Rank to Big Data: Using MapReduce to Parallelise Learning to Rank by the Dutch association for ICT professionals and managers (Nederlandse beroepsvereniging van en voor ICT-professionals en -managers, Ngi-NGN). More information at Ngi-NGN and UT Nieuws. Congratulations, Niek!

Niek Tax

CBS / UT Data Camp 2015

Tuesday, September 15th, 2015, posted by Djoerd Hiemstra

On 23-27 November 2015, the Data Camp, a joint event organized by the Central Bureau for Statistics of the Netherlands (CBS) and the University of Twente (UT). During the camp, a set of CBS data analysts and UT researchers will answer research questions about statistics using big data technologies. On Monday, the participants will be presented with overview presentations about the research questions and technologies. The data camp participants will work in small, mixed teams in an informal setting. Experienced data scientists will support the teams by short mini-workshops and hands-on support. The hope is that the intense contact with the research question in an informal and spontaneous environment will produce valuable and innovative answers to the posed questions.

Guest speakers are Erik Tjong Kim Sang (Meertens Institute, Amsterdam) and David González (Vizzuality, Madrid).

[download report]

CTIT Hadoop cluster open

Thursday, January 8th, 2015, posted by Djoerd Hiemstra

CTIT Hadoop cluster with Djoerd, Frederik, Maurice, and Robin

The new CTIT Hadoop cluster, 512 cores, 2 TB ram, and 0.5 PB storage, is now open for researchers and students in Twente. We started by giving accounts to the 60 students that follow the course Managing Big Data. From left to right in the (not so) cold aisle: Djoerd, Frederik, Maurice, and Robin.

Overview of the TREC 2014 Federated Web Search Track

Wednesday, November 26th, 2014, posted by Djoerd Hiemstra

by Thomas Demeester, Dolf Trieschnigg, Dong Nguyen, Ke Zhou, and Djoerd Hiemstra

The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants’ results for the tasks are introduced, analyzed, and compared.

[download pdf]

Presented at the 23rd Text Retrieval Conference (TREC) in Gaithersburg, USA

Niek Tax graduates on scaling learning to rank to big data

Tuesday, November 25th, 2014, posted by Djoerd Hiemstra

Scaling Learning to Rank to Big Data: Using MapReduce to Parallelise Learning to Rank

by Niek Tax

Niek Tax

Learning to rank is an increasingly important task within the scientific fields of machine learning and information retrieval, that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated in terms of ranking accuracy on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by non-existence of a standard set of evaluation benchmark collections. Furthermore, little research is done in the field of scalability of the training procedure of Learning to Rank methods, to prepare us for input data sets that are getting larger and larger. This thesis concerns both the comparison of Learning to Rank methods using a sparse set of evaluation results on benchmark data sets, as well as the speed-up that can be achieved by parallelising Learning to Rank methods using MapReduce.

In the first part of this thesis we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: 1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and 2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF were found to be the best performing learning to rank methods in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number. Of these ranking algorithms, FenchelRank and FSMRank are pairwise ranking algorithms and the others are listwise ranking algorithms.

In the second part of this thesis we analyse the speed-up of the ListNet training algorithm when implemented in the MapReduce computing model. We found that running ListNet on MapReduce comes with a job scheduling overhead in the range of 150-200 seconds per training iteration. This makes MapReduce very inefficient to process small data sets with ListNet, compared to a single-machine implementation of the algorithm. The MapReduce implementation of ListNet was found to be able to offer improvements in processing time for data sets that are larger than the physical memory of the single machine otherwise available for computation. In addition we showed that ListNet tends to converge faster when a normalisation preprocessing procedure is applied to the input data. The training time of our cluster version of ListNet was found to grow linearly in terms of data size increase. This shows that the cluster implementation of ListNet can be used to scale the ListNet training procedure to arbitrarily large data sets, given that enough data nodes are available for computation.

[download pdf]

Jump into the future with the DesignLab

Thursday, August 28th, 2014, posted by Djoerd Hiemstra

DesignLab

DesignLab is a university-wide facility for teaching and research, where the unique profile of the University of Twente flourishes. High Tech, Human Touch: Technology for Society. DesignLab is a place for innovation, creativity and inspiration, where talented students and researchers come to seek and drive application of their work, and where industry, government, NGO’s and SME’s explore options to tackle the challenges they face. Researchers bring in promising new knowledge and technologies, while industrial and societal partners bring in real-world challenges. Together, with joint energy and creativity, new solutions and new possibilities will be explored.

More information at: http://www.utwente.nl/designlab/.

Tesfay Aregay graduates on Ranking Factors for Web Search

Thursday, July 31st, 2014, posted by Djoerd Hiemstra

Ranking Factors for Web Search : Case Study In The Netherlands

by Tesfay Aregay

It is essential for search engines to constantly adjust ranking function to satisfy their users, at the same time SEO companies and SEO specialists are observed trying to keep track of the factors prioritized by these ranking functions. In this thesis, the problem of identifying highly influential ranking factors for better ranking on search engines is examined in detail, looking at two different approaches currently in use and their limitations. The first approach is to calculate correlation coefficient (e.g. Spearman rank) between a factor and the rank of it’s corresponding webpages (ranked document in general) on a particular search engine. The second approach is to train a ranking model using machine learning techniques, on datasets and select the features that contributed most for a better performing ranker. We present results that show whether or not combining the two approaches of feature selection can lead to a significantly better set of factors that improve the rank of webpages on search engines. We also provide results that show calculating correlation coefficients between values of ranking factors and a webpage’s rank gives stronger result if a dataset that contains a combination of top few and least few ranked pages is used. In addition list of ranking factors that have higher contribution to well-ranking webpages, for the Dutch web dataset (our case study) and LETOR dataset are provided.

[download pdf]

Tesfay Aregay
Photo by @Indenty.

The Importance of Prior Probabilities for Entry Page Search

Thursday, July 10th, 2014, posted by Djoerd Hiemstra

by Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra

An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability

[download pdf]

SIGIR 2014 Test of Time Honourable Mention

The paper was published at SIGIR 2002 and received an Honourable Mention for the ACM SIGIR Test of Time award at the 37th Annual ACM SIGIR conference on Research & development in information retrieval in Gold Coast Australia on 9 July 2014.