Archive for the 'Course Big Data' Category

Maarten Fokkinga retires

Friday, August 30th, 2013, posted by Djoerd Hiemstra

Today, Maarten Fokkinga retires after a scientific career of more than 40 years. Maarten is well-kown for his work on functional programming and category theory. Some of his well-known and well-cited works include: Functional programming with bananas, lenses, envelopes and barbed wire with Eric Meijer and Ross Paterson, Law and Order in Algorithmics, his Ph.D thesis, and Monadic Maps and Folds for Arbitrary Datatypes (yes, those are maps and reduces!)

To celebrate Maarten’s long successful career, Jan Kuper and I wrote recipes for curried bananas and pasta, appropriately formalized in Haskell, so Maarten can both cook and enjoy programming after his retirement. Download the recipes from Github.

Keynote by Ravi Kumar

Thursday, May 23rd, 2013, posted by Djoerd Hiemstra

Ravi Kumar We are very proud that Ravi Kumar from Google agreed to give a keynote speech at the CTIT Symposium on Big Data and the Emergence of Data Science. Kumar, who is well-known for hist work on web and data mining and algorithms for large data sets, has been a senior staff research scientist at Google since June 2012. Prior to this, he was a research staff member at the IBM Almaden Research Center and a principal research scientist at Yahoo! Research. He obtained his Ph.D. in Computer Science from Cornell University in 1998.
Ravi Kumar’s talk will cover two non- conventional computational models for analyzing big data. The first is data streams: in this model, data arrives in a stream and the algorithm is tasked with computing a function of the data without explicitly storing it. The second is map-reduce: in this model, data is distributed across many machines and computation is done as sequence of map and reduce operations. Kumar will present a few algorithms in these models and discuss their scalability.

The workshop takes place on Tuesday 4 June at the University of Twente. Other invited spearkers at the CTIT symposium are Maarten de Rijke (U. Amsterdam) and Milan Petkovic (Philips).

Mattijs Ugen graduates on scalable performance for digital forensics

Wednesday, April 24th, 2013, posted by Djoerd Hiemstra

Scalable performance for a forensic database application

by Mattijs Ugen

As digital forensic investigations deal with more and more data, the Netherlands Forensic Institute, NFI, foresees scalability issues with the current solution in the near future. Following the global trend towards distributed solutions for ‘Big data’ problems, the NFI wants to find a suitable architecture to replace the currently used XIRAF system. Using experimental implementations on top of a selection of distributed data stores, we present query performance timings in three different scaling dimensions: cluster size, working set size and the amount of parallel clients. We present that scaling characteristics for parallel clients show a linear trend, but proves hard to measure for the other dimensions. A distributed search engine architecture proves the best candidate for the NFI, warranting closer investigation in that area for a real-world deployment.

[download pdf]

Traitor: Associating Concepts using the WWW

Wednesday, April 17th, 2013, posted by Djoerd Hiemstra

by Wanno Drijfhout, Oliver Jundt, and Lesley Wevers

Traitor uses Common Crawl’s 25TB data set of web pages to construct a database of associated concepts using Hadoop. The database can be queried through a web application with two query interfaces. A textual interface allows searching for similarities and differences between multiple concepts using a query language similar to set notation, and a graphical interface allows users to visualize similarity relationships of concepts in a force directed graph.

To be presented at the 13th Dutch-Belgian Information Retrieval Workshop DIR 2013 on 26 April in Delft, The Netherlands

[download pdf]

Try Traitor at

Readability of the Web

Monday, April 15th, 2013, posted by Djoerd Hiemstra

A study on 1 billion web pages.

by Marije de Heus

Automated Readability Index for the Web

We have performed a readability study on more than 1 billion web pages. The Automated Readability Index was used to determine the average grade level required to easily comprehend a website. Some of the results are that a 16-year-old can easily understand 50% of the web and an 18-year old can easily understand 77% of the web. This information can be used in a search engine to filter websites that are likely to be incomprehensible for younger users.

To be presented at the 13th Dutch-Belgian Information Retrieval Workshop DIR 2013 on 26 April in Delft, The Netherlands

[download pdf]

18 March: Norvig Award Ceremony

Wednesday, February 27th, 2013, posted by Djoerd Hiemstra

Update (19 March): See the photos of the event.

On 18 March, starting at 15.45 h. until 17.30 h. the Norvig Web Data Science Award Ceremony takes place in the SmartXP lab in building Zilverling of the University of Twente. During the ceremony, Peter Norvig, Director of Research at Google, will award the prize (funds to attend the 2013 edition of SIGIR in Dublin Ireland, a tablet, and a lightening talk at Hadoop Summit in Amsterdam) to the winners via a live video connection from California, USA. Participation in the event is free of charge. Please register by sending your name and affiliation to: Students and researchers will get the opportunity to ask questions to Peter Norvig during the event. If you have a good question, please send it to the email address above too: Maybe your question will be selected to be asked at the event.

Peter Norvig
Announcement at Inter-Actief

More information at U. Twente Activities

The Winners of The Norvig Web Data Science Award

Tuesday, February 26th, 2013, posted by Lisa Green

by Lisa Green (Common Crawl)

We are very excited to announce that the winners of the Norvig Web Data Science Award: Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente! The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science and named in honor of distinguished computer scientist Peter Norvig.

There were many excellent submissions that demonstrated how you can extract valuable insight and knowledge from web crawl data. Be sure to check out the work of the winning team, Traitor – Associating Concepts Using The World Wide Web, and the other finalists on the award website. You will find descriptions of the projects as well as links to the code that was used. We hope that these projects will serve as an inspiration for what kind of work can be done with the Common Crawl corpus. All code is open source and we are looking forward to seeing it reused and adapted for other projects.

A huge thank you to our distinguished panel of judges: Peter Norvig, Ricardo Baeza-Yates, Hilary Mason, Jimmy Lin, and Evert Lammerts!

Added on 18 March: Award winners Oliver Jundt, Wanno Drijfhout, and Lesley Wevers with their prize: a high-end Android tablet!

Participate in the Dutch Common Crawl Challenge

Wednesday, November 14th, 2012, posted by Djoerd Hiemstra

What can you do with 6 billion webpages?

Together with Common Crawl and SARA, we invite students and researchers studying at or employed by research institutes or universities in the Netherlands to dive into the Common Crawl web corpus using the SARA Hadoop service. The best submission will receive the The Norvig Web Data Science Award, a tablet, and 1500 Euro to spend on travel, accommodation, and conference registration fee for SIGIR 2013 to be held in Dublin, Ireland.

The award is named after Peter Norvig, Google’s director of research with a resume too impressive to summarize. Peter is on the advisory board of Common Crawl, and is chair of the jury for this award. Other jury members are Ricardo Baeza-Yates (Yahoo!), Hilary Mason (, Jimmy Lin (University of Maryland), and Evert Lammerts (SARA).

Find out more at the Norvig Award page at Github, the Common Crawl Blog, or come to the Inter-Actief Challenges Information Lunch on 22 November at 12.30h. in Absint.

Welcome to the Big Data course

Sunday, November 11th, 2012, posted by Djoerd Hiemstra

Welcome to the new course Managing Big Data. We will closely follow developments to manage huge amounts of data on large clusters of commodity machines, initiated by Google, and followed by many other web companies such as Yahoo, Amazon, AOL, Facebook, Hyves, Spotify, Twitter, etc. Big data gives rise to a redesign of many core computer science concepts: We will discusses file systems (Google FS), programming paradigms (MapReduce), programming languages and query languages (for instance Sawzall and Pig Latin), and ‘noSQL’ database paradigms (for instance BigTable and Dynamo) for managing big data. The first lecture is next Friday, 16 November at 10.45 h. in RA 2502.

More information on blackboard. (access restricted, sorry our university does not like me to share courses :-( )