Expertise centre for cloud computing

Enschede will open an expertise centre for cloud computing on Thursday 17 June. The Centre 4 Cloud Computing will support open innovation and the sharing of knowledge on cloud computing. Cloud computing is an Internet-based computing paradigm, whereby shared resources, software and information are provided on-demand in a highly scalable way.

Cloud computing logical diagram

The expertise centre offers companies and organisations the following:

  1. Knowledge Exchange: To make (applied) knowledge and best practices available to professionals, management and other interested parties
  2. Research: Scientific applied research into technical, security, legal, and business aspects of cloud computing
  3. Commercial: Contribute to business development for companies that offer services based on cloud computing solutions

For more information, see http://www.centre4cloud.com

Anchor text for ClueWeb09 Category A

We've put anchor text for the English Category A documents of the TREC ClueWeb09 collection on line using BitTorrent:

The file contains anchor text for about 88% of the pages in Category A. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. The size is about 24.5 GB (gzipped). The file is a tab-separated text file consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):

The source code is available from: http://mirex.sourceforge.net

MapReduce book by Lin and Dyer

Data-Intensive Text Processing with MapReduce

An interesting book of by Jimmy Lin and Chris Dyer is forthcoming, in which they show how MapReduce can be used to solve large-scale text processing problems, including examples that use Expectation Maximization training.

This book is about MapReduce algorithm design, particularly for text processing applications. Although our presentation most closely follows implementations in the Hadoop open-source implementation of MapReduce, this book is explicitly not about Hadoop programming. We don't for example, discuss APIs, driver programs for composing jobs, command-line invocations for running jobs, etc.

See pre-prints of the book.

Progress on MapReduce assignments

Happy New Year! Here's an update on the course Distributed Data Processing using MapReduce:

  1. Running and debugging the results of Assignment 4 on the cluster takes some more time than expected. Results of all students were tried at least once on the cluster, or are currently running. For the assignments that need improvements, you can send in new versions.
  2. Assignment 5 (Sawzall) is graded, see the Grade Center on Blackboard.
  3. The dead line Assignment 6 is Friday, January 8. Please let me know before hand if you need more time.

Guest lecture by Giovane Moura

Next Friday, 11 December, in the first part of the lecture Giovane Moura will give a guest lecture about analyzing network management data. Giovane is Ph.D. student at the Design and Analysis of Communication Systems Group (DACS). His research topics include scalability of network analysis and intrusion detection, scalable storage for network flows, and self-management approaches for network management.

In the second part of the lecture we will discuss the SIGMOD 2008 paper by Christopher Olston et al.: “Pig Latin: A Not-So-Foreign Language for Data Processing”. The goal of Assignment 6 is to use Pig Latin for analyzing the network management data provided by Giovane and his collegues of the DACS group.

Assignment 5 (Sawzall) and Assignment 6 (Analyzing Network Management Data) for Distributed Data Processing using MapReduce are now on-line in the Blackboard Assignments section.

Tips and additional information for Assignment 3

The deadline for assignment 3 is Friday 4 December 10.45 h. (start of lecture). Some tips for Assignment 3:

  • To run the example code for regular expression matching in Haskell you need to import Text.Regex and Data.Maybe

  • Assignment 3.4: Tip: calculate some hash value over the complete web site content. Two duplicates will receive the exact same hash value, but because of collisions two different pages might get the same hash value. After computing the hash, you have to do a final check, removing duplicates from pages with the same hash value.
  • As an example of the result of the sample stage of Assignment 3.5, consider sorting people by their length on three machines. The sample stage would set boundaries on the values that approximately divide the data in three equal parts, for instance:
    • values between 0 and 1,75: part 1
    • values between 1,75 and 1,80: part 2
    • values between 1,80 and infinity: part 3
(You might get this if the sampling stage reveals that about 1/3 of persons is small than 1.75m, 1/3 is between 1.75 and 1.80 tall, and 1/3 is bigger than 1.80m)

Note that actual implementation in Hadoop needs a user-defined "partitioner", but for the Haskell assignment this is unimportant.

Finally, for the next lecture, please think of what problem you want to solve with Hadoop for Assignment 4.

More info on Blackboard.