Archive for the Category » Course XML&DB 2 «

Monday, May 20th, 2013 | Author:

One of my Master students, Oliver Jundt, has a paper on EUSFLAT 2013.
Sample-based XPath Ranking for Web Information Extraction
Oliver Jundt and Maurice van Keulen
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
The paper will be presented at the EUSFLAT 2013 conference, 11-13 Sep 2013, Milan, Italy [details]