Archive for the 'Course MapReduce' Category

Welcome to the MapReduce course

Friday, October 15th, 2010, posted by Djoerd Hiemstra

Welcome to Distributed Data Processing using MapReduce

This will be a course that is on top of some very exciting developments in cloud computing and data centers, initiated by Google, and followed by many others such as Yahoo, Amazon, AOL, Baidu, Joost, Mylife, Facebook, etc., etc. The course is about processing terabytes of data on large clusters. But not only that, not many courses in the master’s Computer Science will be so “core computer science”: We will discuss new file systems (GFS and Hadoop FS), new programming paradigms (MapReduce), new programming languages and query languages (Sawzall, Pig Latin), and new Database paradigms (BigTable, Cassandra and Dynamo), and of course many web search and data mining applications that made Google one of today’s leading IT companies.

We hope to see you at our lectures on Friday’s 3/4 hour.
Robin Aly, Maarten Fokkinga, and Djoerd Hiemstra.

Expertise centre for cloud computing

Thursday, June 3rd, 2010, posted by Djoerd Hiemstra

Enschede will open an expertise centre for cloud computing on Thursday 17 June. The Centre 4 Cloud Computing will support open innovation and the sharing of knowledge on cloud computing. Cloud computing is an Internet-based computing paradigm, whereby shared resources, software and information are provided on-demand in a highly scalable way.

Cloud computing logical diagram

The expertise centre offers companies and organisations the following:

  1. Knowledge Exchange: To make (applied) knowledge and best practices available to professionals, management and other interested parties
  2. Research: Scientific applied research into technical, security, legal, and business aspects of cloud computing
  3. Commercial: Contribute to business development for companies that offer services based on cloud computing solutions
For more information, see

MIREX: MapReduce IR Experiments

Wednesday, April 28th, 2010, posted by Djoerd Hiemstra

MIREXMIREX (MapReduce Information Retrieval Experiments) provides solutions to easily and quickly run large-scale information retrieval experiments on a cluster of machines using Hadoop. Version 0.1 has tools for the TREC ClueWeb09 collection.The code is available to other researchers at:

Anchor text for ClueWeb09 Category A

Tuesday, April 27th, 2010, posted by Djoerd Hiemstra

We’ve put anchor text for the English Category A documents of the TREC ClueWeb09 collection on line using BitTorrent:

The file contains anchor text for about 88% of the pages in Category A. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. The size is about 24.5 GB (gzipped). The file is a tab-separated text file consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research): The source code is available from:

Google’s MapReduce patent - no threat to stuffed elephants

Tuesday, February 23rd, 2010, posted by Djoerd Hiemstra

You now officially have a 50% chance of getting a job at Google. ;-)

Google hired about half the students who took Bisciglia’s first class.

Read the Register’s article on MapReduce.

MapReduce book by Lin and Dyer

Monday, February 22nd, 2010, posted by Djoerd Hiemstra

Data-Intensive Text Processing with MapReduce

An interesting book of by Jimmy Lin and Chris Dyer is forthcoming, in which they show how MapReduce can be used to solve large-scale text processing problems, including examples that use Expectation Maximization training.

This book is about MapReduce algorithm design, particularly for text processing applications. Although our presentation most closely follows implementations in the Hadoop open-source implementation of MapReduce, this book is explicitly not about Hadoop programming. We don’t for example, discuss APIs, driver programs for composing jobs, command-line invocations for running jobs, etc.

See pre-prints of the book.

Final grades for MapReduce course

Friday, February 19th, 2010, posted by Djoerd Hiemstra

Final grades for the course are out. You find them on Blackboard’s personal grade center.

Final MapReduce assignment on Blackboard

Monday, January 11th, 2010, posted by Djoerd Hiemstra

The final assignment, Assignment 7 is now on Blackboard. The deadline for this assignment is January 29. This is a hard deadline.

More info on Blackboard.

Progress on MapReduce assignments

Thursday, January 7th, 2010, posted by Djoerd Hiemstra

Happy New Year! Here’s an update on the course Distributed Data Processing using MapReduce:

  1. Running and debugging the results of Assignment 4 on the cluster takes some more time than expected. Results of all students were tried at least once on the cluster, or are currently running. For the assignments that need improvements, you can send in new versions.
  2. Assignment 5 (Sawzall) is graded, see the Grade Center on Blackboard.
  3. The dead line Assignment 6 is Friday, January 8. Please let me know before hand if you need more time.

Guest lecture by Giovane Moura

Tuesday, December 8th, 2009, posted by Djoerd Hiemstra

Next Friday, 11 December, in the first part of the lecture Giovane Moura will give a guest lecture about analyzing network management data. Giovane is Ph.D. student at the Design and Analysis of Communication Systems Group (DACS). His research topics include scalability of network analysis and intrusion detection, scalable storage for network flows, and self-management approaches for network management.

In the second part of the lecture we will discuss the SIGMOD 2008 paper by Christopher Olston et al.: “Pig Latin: A Not-So-Foreign Language for Data Processing”. The goal of Assignment 6 is to use Pig Latin for analyzing the network management data provided by Giovane and his collegues of the DACS group.

Assignment 5 (Sawzall) and Assignment 6 (Analyzing Network Management Data) for Distributed Data Processing using MapReduce are now on-line in the Blackboard Assignments section.