Anchor text for ClueWeb09 Category A

We’ve put anchor text for the English Category A documents of the TREC ClueWeb09 collection on line using BitTorrent:

The file contains anchor text for about 88% of the pages in Category A. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. The size is about 24.5 GB (gzipped). The file is a tab-separated text file consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research): The source code is available from: http://mirex.sourceforge.net

5 Responses to “Anchor text for ClueWeb09 Category A”

  1. Yin Says:
    Hello, thank you for your technical report regarding MIREX. Would you clarify the specifications of the 15 machines in the cluster in terms of RAM, CPU types (number of cores), hard drive storage, and OS version (i.e. 32 bit or 64 bit)? Thank you.
  2. Djoerd Hiemstra Says:
    Hi Yin, Thanks for your interest. We use the PowerEdge R200 from Dell. The machines have dual core Intel Xeon E3110 64 bit processors with 6MB Cache. Each machine has 8GB main memory (4 times 2GB DDR2), and two hard disks, one of 750 GB is used for the Hadoop distributed file system, the other, 150 GB, is used for the operating system and non-Hadoop related activities on the cluster. The machines run Suse 10.3, and Hadoop 0.19.2
    The machines were a real bargain when we bought them. Still, if we were to buy the systems now, I would opt for a little more storage, and maybe a little less computing power. These machines are very fast, but you would be surprised how easy it is to fill almost 10 TB of storage space. Best, Djoerd.
  3. You Wang Says:
    Hello, thank you for providing these two datasets. I have downloaded the file “mirex-anchors-catb.txt.gz”, however, it is only 416.99MB. Is this size its real size? Can you be so kind to provide an address that supports multi-thread download?
  4. Djoerd Hiemstra Says:
    Dear You Wang, The size of the Category B anchors should be approximately 2GB zipped, so something must have gone wrong. Please try again. We cannot provide more download capacity for these big files, unfortunately (the files come directly from our cluster, otherwise I would have liked to provide a torrent file). The size of the Category A anchors is over ten times as big. Best, Djoerd.
  5. Djoerd Hiemstra Says:
    The Category B anchors mentioned in the comments above are no longer available.