Archive for the 'Course MapReduce' Category

Assignments 5 and 6 on-line

Tuesday, December 8th, 2009, posted by Djoerd Hiemstra

Assignment 5 (Sawzall) and Assignment 6 (Analyzing Network Management Data) for Distributed Data Processing using MapReduce are now on-line in the Blackboard Assignments section.

More info on Blackboard.

Tips and additional information for Assignment 3

Wednesday, December 2nd, 2009, posted by Djoerd Hiemstra

The deadline for assignment 3 is Friday 4 December 10.45 h. (start of lecture). Some tips for Assignment 3:

  • To run the example code for regular expression matching in Haskell you need to import Text.Regex and Data.Maybe
  • Assignment 3.4: Tip: calculate some hash value over the complete web site content. Two duplicates will receive the exact same hash value, but because of collisions two different pages might get the same hash value. After computing the hash, you have to do a final check, removing duplicates from pages with the same hash value.
  • As an example of the result of the sample stage of Assignment 3.5, consider sorting people by their length on three machines. The sample stage would set boundaries on the values that approximately divide the data in three equal parts, for instance:
    • values between 0 and 1,75: part 1
    • values between 1,75 and 1,80: part 2
    • values between 1,80 and infinity: part 3
(You might get this if the sampling stage reveals that about 1/3 of persons is small than 1.75m, 1/3 is between 1.75 and 1.80 tall, and 1/3 is bigger than 1.80m) Note that actual implementation in Hadoop needs a user-defined "partitioner", but for the Haskell assignment this is unimportant.
Finally, for the next lecture, please think of what problem you want to solve with Hadoop for Assignment 4.

More info on Blackboard.

Solution Assignment 2

Friday, November 27th, 2009, posted by Djoerd Hiemstra

To do Assignment 3, you need a correct solution for Assignment 3. Therefore, a possible solution to Assignment 2 is added to Assignment 3. You can of course also use your own solution.

More info on Blackboard.

MapReduce Assignment 1 corrected

Wednesday, November 25th, 2009, posted by Djoerd Hiemstra

For the course Distributed Data Processing using MapReduce, grades and feedback for Assignment 1 can be found under “3. Feedback From Instructor” when you click on “Assignments” on Blackboard.

There are 7 more assignments, so lot’s of possibilities to improve. If after 8 assignments the average grade is 5 or lower, then there will be additional assignments to pass the course. If you did not get the system working yet, please note that you will have to be able to run Hadoop in (at minimum in stand-alone mode) for Assignment 4.

Welcome to the MapReduce course

Monday, November 2nd, 2009, posted by Djoerd Hiemstra

Welcome to the course Distributed Data Processing using MapReduce! Please, find a schedule of the lectures and assignments on Blackboard under “Course Information” (scroll down).

This will be a course that is on top of some very exciting developments in cluster computing and data centers, initiated by Google, and followed by many others such as Yahoo, Amazon, AOL, Baidu, Joost, Mylife, Facebook, etc., etc. The course is not only about processing terabytes of data on large clusters. In fact, not many courses in the master’s Computer Science will be so “core computer science”: We will discuss new file systems (GFS and Hadoop FS), new programming paradigms (MapReduce), new programming languages and query languages (Sawzall, Pig), and of course many web search and data mining applications that made Google one of today’s leading IT companies.

I hope to see you at our lectures on Friday’s 3/4 hour.

Distributed data processing using MapReduce

Friday, August 14th, 2009, posted by Djoerd Hiemstra

Distributed data processing using MapReduce is a new course that teaches how to carry out large-scale distributed data analysis using Google’s MapReduce as the programming abstraction. MapReduce is a programming abstraction that is inspired by the functions ‘map’ and ‘reduce’ as found in functional programming language such as Lisp. It was developed at Google as a mechanism to allow large-scale distributed processing of data on data centers consisting of thousands of low-cost machines. MapReduce allows programmers to distribute their programs over many machines without the need to worry about system failures, threads, locks, semaphores, and other concepts from concurrent and distributed programming. Students will learn to specify algorithms using map and reduce steps and to implement these algorithms using Hadoop, an open source implementation of Google’s file system and MapReduce. The course will introduce recent attempts to develop high-level languages for simplified relational data processing on top of Hadoop, such as Yahoo’s Pig Latin and Microsoft’s DryadLINQ.

The course consists of lectures and practical assignments. Students will solve lab exercises on a large cluster of machines in order to get hands-on experience and solve real large-scale problems. The lab exercises will be done on the University of Twente PRISMA-2 computer, a data center consisting of 16 dual core systems sponsored by Yahoo Research. Examples of lab exercises are: counting bigrams in large web crawls, inverted index construction, and the computation of Google’s PageRank. After successful completion of the course, the student is able to:

  • Disect complex problems in algorithms that use map and reduce steps,
  • Specify these algorithms in a functional language such as Haskell,
  • Implement these algorithms using the Hadoop framework,
  • Specify simplified relational queries using Pig Latin.

More information at Blackboard.