Archive for the 'Course Big Data' Category

SIKS/CBS DataCamp Spark tutorial notebook

Thursday, December 22nd, 2016, posted by Djoerd Hiemstra

by Djoerd Hiemstra and Robin Aly

SIKS/CBS DataCamp participants can download the answers for the Jupyter Scala/Spark notebook exercises below.

SIKS/CBS Data Camp & Advanced Course on Managing Big Data

Monday, September 12th, 2016, posted by Djoerd Hiemstra

On December 06 and 07 2016 The Netherlands School for Information and Knowledge Systems (SIKS) and Statistics Netherlands (CBS) organize a two day tutorial on the management of Big Data, the DataCamp, hosted at the University of Twente.
The Data Camp’s objective is to use big data sets to produce valuable and innovative answers to research questions with societal relevance. SIKS PhD students and CBS data analysts will learn about big data technologies and create, in small groups, feasibility studies for a research question of their choice.
Participants get access to predefined CBS research questions and massive datasets, including a large collection of Dutch Tweets, traffic data from Dutch high ways, and AIS data from ships. Participants will get access to the Twente Hadoop cluster, a 56 node cluster with almost 1 petabyte of storage space. The tutorial focuses on hands-on experience. The Data Camp participants will work in small, mixed teams in an informal setting, which stimulates intense contact with technologies and research questions. Experienced data scientists will support the teams by short lectures and hands-on support. Short lectures will introduce technologies to manage and visualize big data, that were first adopted by Google and are now used by many companies that manage large datasets. The tutorial teaches how to process terabytes of data on large clusters of commodity machines using new programming styles like MapReduce and Spark. The tutorial will be given in English and is part of the educational program for SIKS PhD students.

Also see the SIKS announcement.

Jeroen Vonk graduates on Bisimulation reduction with MapReduce

Monday, August 29th, 2016, posted by Djoerd Hiemstra

by Jeroen Vonk

Within the field of Computer Science a lot of previous and current research is done on model checking. Model checking allows researchers to simulate a process or system, and exhaustively test for wanted or non-wanted properties. Logically, the result of these test are as dependable as your model represents the actual system. The best model then, would be a model representing the system down to its last atom, allowing for every possible interaction with the model. The model of course will become extremely large, a situation known as state space explosion. Current research therefore focuses on:

  • Storing larger models
  • Processing large models faster and smarter
  • Reducing the size of models, whilst keeping the same properties
In this thesis we will focus on reducing the size of the models using bisimulation reduction. Bisimulation reduction allows to identify similar states that can be merged whilst preserving certain properties of the model. These similar, or redundant states will be identified by comparing them with other states in the model using a bisimulation relation. The bisimulation relation will identify states showing the same behavior, that therefore can be merged. This process is called bisimulation reduction. A common method to determine the smallest model is using partition refinement. In order to use the algorithm on large models it needs to be scalable. Therefore we will be using a framework for distributed processing that is part of Hadoop, called MapReduce. Using this framework provides us with a robust system that automatically recovers from e.g. hardware faults. The use of MapReduce also makes our algorithm scalable, and easily executed at third party clusters.
During our experiments we saw that the execution-time for a MapReduce job takes a relatively long time. We have estimated that there is a startup cost for each job of circa 30 seconds. This means that the reduction of transition systems that need a lot of iterations can be very high. Extreme cases such as the vasy 40 60 which take over 20,000 iterations therefore could not be benchmarked within an acceptable time-frame. Each iteration all of our data is passed over the disk. Therefore it is not unreasonable to see a factor 10-100 slow down compared to a mpi-based implementation (e.g. LTSmin). From our experiments we have concluded that the separate iteration times of our algorithm scale linearly up to 108 transitions for strong bisimulation and 107 for branching bisimulation. On larger models the iteration time increases exponentially, therefore we where not able to benchmark our largest model.

[download pdf]

CBS / UT Data Camp 2015

Tuesday, September 15th, 2015, posted by Djoerd Hiemstra

On 23-27 November 2015, the Data Camp, a joint event organized by the Central Bureau for Statistics of the Netherlands (CBS) and the University of Twente (UT). During the camp, a set of CBS data analysts and UT researchers will answer research questions about statistics using big data technologies. On Monday, the participants will be presented with overview presentations about the research questions and technologies. The data camp participants will work in small, mixed teams in an informal setting. Experienced data scientists will support the teams by short mini-workshops and hands-on support. The hope is that the intense contact with the research question in an informal and spontaneous environment will produce valuable and innovative answers to the posed questions.

Guest speakers are Erik Tjong Kim Sang (Meertens Institute, Amsterdam) and David González (Vizzuality, Madrid).

[download report]

How to build Google in an Afternoon

Friday, May 29th, 2015, posted by Djoerd Hiemstra

How many machines do we need to search and manage an index of billions of documents? In this lecture, I will discuss basic techniques for indexing very large document collections. I will discuss inverted files, index compression, and top-k query optimization techniques, showing that a single desktop PC suffices for searching billions of documents. An important part of the lecture will be spend on estimating index sizes and processing times. At the end of the afternoon, students will have a better understanding of the scale of the web and its consequences for building large-scale web search engines, and students will be able to implement a cheap but powerful new ‘Google’.

To be presented at the SIKS Course Advances in Information Retrieval on 18, 19 June in Vught, The Netherlands.

CTIT Hadoop cluster open

Thursday, January 8th, 2015, posted by Djoerd Hiemstra

CTIT Hadoop cluster with Djoerd, Frederik, Maurice, and Robin

The new CTIT Hadoop cluster, 512 cores, 2 TB ram, and 0.5 PB storage, is now open for researchers and students in Twente. We started by giving accounts to the 60 students that follow the course Managing Big Data. From left to right in the (not so) cold aisle: Djoerd, Frederik, Maurice, and Robin.

Niek Tax graduates on scaling learning to rank to big data

Tuesday, November 25th, 2014, posted by Djoerd Hiemstra

Scaling Learning to Rank to Big Data: Using MapReduce to Parallelise Learning to Rank

by Niek Tax

Niek Tax

Learning to rank is an increasingly important task within the scientific fields of machine learning and information retrieval, that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated in terms of ranking accuracy on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by non-existence of a standard set of evaluation benchmark collections. Furthermore, little research is done in the field of scalability of the training procedure of Learning to Rank methods, to prepare us for input data sets that are getting larger and larger. This thesis concerns both the comparison of Learning to Rank methods using a sparse set of evaluation results on benchmark data sets, as well as the speed-up that can be achieved by parallelising Learning to Rank methods using MapReduce.

In the first part of this thesis we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: 1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and 2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF were found to be the best performing learning to rank methods in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number. Of these ranking algorithms, FenchelRank and FSMRank are pairwise ranking algorithms and the others are listwise ranking algorithms.

In the second part of this thesis we analyse the speed-up of the ListNet training algorithm when implemented in the MapReduce computing model. We found that running ListNet on MapReduce comes with a job scheduling overhead in the range of 150-200 seconds per training iteration. This makes MapReduce very inefficient to process small data sets with ListNet, compared to a single-machine implementation of the algorithm. The MapReduce implementation of ListNet was found to be able to offer improvements in processing time for data sets that are larger than the physical memory of the single machine otherwise available for computation. In addition we showed that ListNet tends to converge faster when a normalisation preprocessing procedure is applied to the input data. The training time of our cluster version of ListNet was found to grow linearly in terms of data size increase. This shows that the cluster implementation of ListNet can be used to scale the ListNet training procedure to arbitrarily large data sets, given that enough data nodes are available for computation.

[download pdf]

Eight questions about Big Data

Friday, July 11th, 2014, posted by Djoerd Hiemstra

8 questions about Big Data (in Dutch)

Iedereen heeft het de laatste tijd over Big Data. Maar wat is het eigenlijk? Waarom is het zo’n big deal? En hoe kun je verantwoord met Big Data omgaan? Acht vragen over Big Data, samen met Peter-Paul Verbeek, Elmer Lastdrager, Oscar Olthoff en Floris Kreiken.

Norvig Web Data Science Award 2014

Monday, May 19th, 2014, posted by Djoerd Hiemstra

The Norvig Web Data Science Award is organized by Common Crawl and SURFsara for researchers and students in the Benelux. SURFsara provides free access to the their Hadoop cluster with a copy of the full Common Crawl web crawl from March 2014 - almost 3 billion web pages. Participants are completely free in choosing their research question. For example, last year there were submissions looking at concept association, connections between languages, readability and more. Be creative and think outside of the box!

The award is named after Peter Norvig, Director of Research at Google, who chairs the jury that will select the winning submission. The contest will run until July 31, 2014. The winning team will be announced at the award ceremony in September 2014 and will get a tablet, smart watch and Github small plan for a year.

Sign up on:

Eenvoudige modellen en Big Data beter dan slimme modellen

Thursday, January 30th, 2014, posted by Djoerd Hiemstra

Eenvoudige modellen en Big Data troeven slimme modellen af

Big Data – of het beter allitererende “Grote Gegevens” – is een term die sinds het begin van deze eeuw wordt gebruikt om gegevensverzamelingen aan te duiden die moeilijk verwerkt konden worden met behulp van de software van die tijd, verzamelingen van vele terabytes of petabytes in grootte. Technieken om zulke enorme verzamelingen gegevens te kunnen verwerken en analyseren werden met name ontwikkeld door Google. Het uitgangspunt van Google: Zet heel veel goedkope machines bij elkaar in grote datacentra, en gebruik slimme gereedschappen zodat applicatieontwikkelaars en gegevensanalisten het hele datacentrum kunnen gebruiken voor hun gegevensanalyses. Het datacentrum is de nieuwe computer! De slimme gereedschappen van Google raken veel kernelementen van de Informatica: bestandssystemen (Google File System), nieuwe programmeerparadigma’s (MapReduce), nieuwe programmeertalen (bijvoorbeeld Sawzall) en nieuwe aanpakken voor het beheren van gegevens (BigTable), allemaal ontwikkeld om grote gegevensverzamelingen gemakkelijk toegankelijk te maken. Deze technieken zijn inmiddels ook beschikbaar in open source varianten. De bekendste, Hadoop, werd voor een belangrijk deel ontwikkeld bij Googles concurrent Yahoo. Aan de Universiteit Twente worden de technieken sinds 2009 onderwezen in het masterprogramma Computer Science. Nu we in staat zijn om te trainen op grootschalige gegevensverzamelingen doet zich het volgende fenomeen voor: Eenvoudige modellen getraind met grote gegevens troeven complexe modellen op basis van minder gegevens af…

[Lees verder]

Verschenen in STAtOR 14(3-4), Vereniging voor Statistiek en Operationele Research