Archive for the 'MIREX' Category

MIREX 0.3 for ClueWeb12

Monday, July 1st, 2013, posted by Djoerd Hiemstra

MIREX 0.3 We released a new version 0.3 for TREC Web track participants that work on the new ClueWeb12 dataset. The code now uses the new Hadoop API. The code was tested on Cloudera’s cdh3u5 Hadoop distribution, Hadoop version 0.20.2, and with some minor tweaks of the build.xml file also on Cloudera cdh4 versions. Download MIREX at:
http://mirex.sourceforge.net.

Anchor text for ClueWeb12

Thursday, June 27th, 2013, posted by Djoerd Hiemstra

We are happy to share the anchor text extracted from the TREC ClueWeb12 collection:

  • ClueWeb12_Anchors (30.4 GB; use a BitTorrent client; please seed until you reach a reasonable share ratio)
The data contains anchor text for 0.5 billion pages, about 64% of the total number of pages in ClueWeb12. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. Web pages were truncated at 50KB before extracting the anchors. The size is about 30.4 GB (gzipped). The data consists of a tab-separated text files consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research): The source code is available from: http://mirex.sourceforge.net.
(See also Anchor Text for ClueWeb09.)

Taily: Shard Selection Using the Tail of Score Distributions

Wednesday, May 15th, 2013, posted by Djoerd Hiemstra

by Robin Aly, Djoerd Hiemstra, and Thomas Demeester

Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query’s score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function’s features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.

SIGIR 2013 presentation

Presented at the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in Dublin, Ireland, 28 July - 1 August.

[download pdf]

Ensemble clustering for result diversification

Friday, October 26th, 2012, posted by Djoerd Hiemstra

by Dong Nguyen and Djoerd Hiemstra

This paper describes the participation of the University of Twente in the Web track of TREC 2012. Our baseline approach uses the Mirex toolkit, an open source tool that sequantially scans all the documents. For result diversification, we experimented with improving the quality of clusters through ensemble clustering. We combined clusters obtained by different clustering methods (such as LDA and K-means) and clusters obtained by using different types of data (such as document text and anchor text). Our two-layer ensemble run performed better than the LDA based diversification and also better than a non-diversification run.

[download pdf]

MIREX in ERCIM News Big Data Special

Wednesday, April 11th, 2012, posted by Djoerd Hiemstra

by Djoerd Hiemstra and Claudia Hauff

ERCIM News 89 MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totalling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available at SourceForge.

More information in ERCIM News 89.

University of Twente at TREC 2010

Thursday, October 28th, 2010, posted by Djoerd Hiemstra

MapReduce for Experimental Search

by Djoerd Hiemstra and Claudia Hauff

This draft report presents preliminary results for the TREC 2010 ad-hoc web search task. We ran our MIREX system on 0.5 billion web documents from the ClueWeb09 crawl. On average, the system retrieves at least 3 relevant documents on the first result page containing 10 results, using a simple index consisting of anchor texts, page titles, and spam removal.

[download pdf]

Let’s quickly test this on 12TB of data

Thursday, June 24th, 2010, posted by Djoerd Hiemstra

MapReduce for Information Retrieval Evaluation

by Djoerd Hiemstra and Claudia Hauff

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net.

The paper will be presented at the CLEF 2010 Conference on Multilingual and Multimodal Information Access Evaluation on 20-23 September 2010 in Padua, Italy

MIREX: MapReduce IR Experiments

Wednesday, April 28th, 2010, posted by Djoerd Hiemstra

MIREXMIREX (MapReduce Information Retrieval Experiments) provides solutions to easily and quickly run large-scale information retrieval experiments on a cluster of machines using Hadoop. Version 0.1 has tools for the TREC ClueWeb09 collection.The code is available to other researchers at: http://mirex.sourceforge.net/.

Anchor text for ClueWeb09 Category A

Tuesday, April 27th, 2010, posted by Djoerd Hiemstra

We’ve put anchor text for the English Category A documents of the TREC ClueWeb09 collection on line using BitTorrent:

The file contains anchor text for about 88% of the pages in Category A. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. The size is about 24.5 GB (gzipped). The file is a tab-separated text file consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research): The source code is available from: http://mirex.sourceforge.net