Ranking XPaths for extracting search result records
by Dolf Trieschnigg, Kien Tjin-Kam-Jet and Djoerd Hiemstra
Extracting search result records (SRRs) from webpages is
useful for building an aggregated search engine which combines
search results from a variety of search engines. Most
automatic approaches to search result extraction are not
portable: the complete process has to be rerun on a new
search result page. In this paper we describe an algorithm to
automatically determine XPath expressions to extract SRRs
from webpages. Based on a single search result page, an
XPath expression is determined which can be reused to
extract SRRs from pages based on the same template. The
algorithm is evaluated on six datasets, including two new
datasets containing a variety of web, image, video, shopping
and news search results. The evaluation shows that for 85%
of the tested search result pages, a useful XPath is determined.
The algorithm is implemented as a browser plugin
and as a standalone application which are available as open
source software.
[download pdf]
Download Search Result Finder Firefox plugin.