Course MapReduce – Page 3 – Djoerd Hiemstra

Solution Assignment 2

To do Assignment 3, you need a correct solution for Assignment 3. Therefore, a possible solution to Assignment 2 is added to Assignment 3. You can of course also use your own solution.

More info on Blackboard.

MapReduce Assignment 1 corrected

For the course Distributed Data Processing using MapReduce, grades and feedback for Assignment 1 can be found under “3. Feedback From Instructor” when you click on “Assignments” on Blackboard.

There are 7 more assignments, so lot's of possibilities to improve. If after 8 assignments the average grade is 5 or lower, then there will be additional assignments to pass the course. If you did not get the system working yet, please note that you will have to be able to run Hadoop in (at minimum in stand-alone mode) for Assignment 4.

Welcome to the MapReduce course

Welcome to the course Distributed Data Processing using MapReduce! Please, find a schedule of the lectures and assignments on Blackboard under “Course Information” (scroll down).

This will be a course that is on top of some very exciting developments in cluster computing and data centers, initiated by Google, and followed by many others such as Yahoo, Amazon, AOL, Baidu, Joost, Mylife, Facebook, etc., etc. The course is not only about processing terabytes of data on large clusters. In fact, not many courses in the master's Computer Science will be so “core computer science”: We will discuss new file systems (GFS and Hadoop FS), new programming paradigms (MapReduce), new programming languages and query languages (Sawzall, Pig), and of course many web search and data mining applications that made Google one of today's leading IT companies.

I hope to see you at our lectures on Friday's 3/4 hour.

Distributed data processing using MapReduce

Distributed data processing using MapReduce is a new course that teaches how to carry out large-scale distributed data analysis using Google's MapReduce as the programming abstraction. MapReduce is a programming abstraction that is inspired by the functions 'map' and 'reduce' as found in functional programming language such as Lisp. It was developed at Google as a mechanism to allow large-scale distributed processing of data on data centers consisting of thousands of low-cost machines. MapReduce allows programmers to distribute their programs over many machines without the need to worry about system failures, threads, locks, semaphores, and other concepts from concurrent and distributed programming. Students will learn to specify algorithms using map and reduce steps and to implement these algorithms using Hadoop, an open source implementation of Google's file system and MapReduce. The course will introduce recent attempts to develop high-level languages for simplified relational data processing on top of Hadoop, such as Yahoo's Pig Latin and Microsoft's DryadLINQ.

The course consists of lectures and practical assignments. Students will solve lab exercises on a large cluster of machines in order to get hands-on experience and solve real large-scale problems. The lab exercises will be done on the University of Twente PRISMA-2 computer, a data center consisting of 16 dual core systems sponsored by Yahoo Research. Examples of lab exercises are: counting bigrams in large web crawls, inverted index construction, and the computation of Google's PageRank. After successful completion of the course, the student is able to:

Disect complex problems in algorithms that use map and reduce steps,
Specify these algorithms in a functional language such as Haskell,
Implement these algorithms using the Hadoop framework,
Specify simplified relational queries using Pig Latin.

More information at Blackboard.