Archive for the 'Deep Web' Category

Kien Tjin-Kam-Jet presents Q-Able at the Young Technology Award

Monday, February 3rd, 2014, posted by Djoerd Hiemstra

Kien at YTA

Thursday 30 January the final of the Young Technology Award was held in Atak with an excellent performance of Kien Tjin-Kam-Jet of Q-Able.
See the photo impression.

Q-Able nominated for the Young Technology Award

Friday, December 20th, 2013, posted by Djoerd Hiemstra

Vote now for Q-Able

Kien Tjin-Kam-Jet defends PhD thesis on Distributed Deep Web Search

Thursday, December 19th, 2013, posted by Djoerd Hiemstra

Distributed Deep Web Search

by Kien Tjin-Kam-Jet

The World Wide Web contains billions of documents (and counting); hence, it is likely that some document will contain the answer or content you are searching for. While major search engines like Bing and Google often manage to return relevant results to your query, there are plenty of situations in which they are less capable of doing so. Specifically, there is a noticeable shortcoming in situations that involve the retrieval of data from the deep web. Deep web data is difficult to crawl and index for today’s web search engines, and this is largely due to the fact that the data must be accessed via complex web forms. However, deep web data can be highly relevant to the information-need of the end-user. This thesis overviews the problems, solutions, and paradigms for deep web search. Moreover, it proposes a new paradigm to overcome the apparent limitations in the current state of deep web search, and makes the following scientific contributions:

  1. A more specific classification scheme for deep web search systems, to better illustrate the differences and variation between these systems.
  2. Virtual surfacing, a new, and in our opinion better, deep web search paradigm which tries to combine the benefits of the two already existing paradigms, surfacing and virtual integration, and which also raises new research opportunities.
  3. A stack decoding approach which combines rules and statistical usage information for interpreting the end-user’s free-text query, and to subsequently derive filled-out web forms based on that interpretation.
  4. A practical comparison of the developed approach against a well-established text-processing toolkit.
  5. Empirical evidence that, for a single site, end-users would rather use the proposed free-text search interface instead of a complex web form.

Analysis of data obtained from user studies shows that the stack decoding approach works as well as, or better than, today’s top-performing alternatives.

[download pdf]

Sabbatical at Q-Able

Monday, September 2nd, 2013, posted by Djoerd Hiemstra

Starting today, I am on sabbatical at Q-Able, an exciting new internet startup and spinoff of the University of Twente. Q-Able will bring new search capabilities to internet web shops, hotel and travel booking sites, online banking, etc. by replacing multi-field web forms by free text querying. Instead of meticulously filling in one field at a time of a web form, users of your web site get a simple, single search field. Q-Able’s solutions provide a better user experience for the visitors of web sites, and it gives the company running the web site the opportunity to find out what their customers really want (you’d be surprised of the things people will enter in single search fields).

More information shortly at: q-able.com.

STW Valorization Grant for Q-Able

Friday, June 21st, 2013, posted by Djoerd Hiemstra

Q-Able.com The University of Twente spin-off Q-Able receives a Valorization Grant Phase 1 from the Dutch Technology Foundation STW to further develop and market their OneBox search technology.

When it comes to web applications, users love the “single text box” interface because it is extremely easy to use. However, much information on the web is stored in structured databases and can only be accessed by filling out a web form with multiple input fields. Examples include planning a trip, booking a hotel room, looking for a second-hand car, etc.

The approach of web search engines – to crawl sites and make a central index of the pages – does not suffice in many cases. First, some sites are hard to crawl because the pages can only be accessed via the web form. Second, some sites provide information that changes quickly, like available hotel rooms, and crawled pages would be almost immediately outdated. Third, some sites provide information that is generated dynamically, like planning a trip from one address to another on a certain date, and it is impossible to crawl all combinations of addresses and dates. Finally, a simple text index that search engines provide does not easily allow structured queries on arbitrary fields. In all these cases, the sites that provide such information can be found using a search engine like Google, but the information itself can only be retrieved after filling in a web form. Filling in one or more web forms with many fields can be a tedious job.

Q-Able replaces a site’s web forms by OneBox, a simple text field, giving complex sites the look and feel of Google: a single field for asking questions and performing simple transactions. OneBox allows users to plan a trip by typing for instance “Next Wednesday from Enschede to Amsterdam arriving at 9am”, or to search for second-hand cars by typing “Ford C-max 4-doors less than 200,000 kilometres from before 2008″. OneBox can be configured to operate on any web site that provides complex web forms. Furthermore, OneBox can be configured to operate on multiple web sites using a single simple text field. This way, to search for instance for a second-hand car, users enter a single query, and search multiple second-hand car sites with a single click. OneBox only replaces the user interface of a web database: It does not copy, crawl or otherwise index the data itself.

OneBox, is the result of the Ph.D. research project done by Kien Tjin-Kam-Jet at the University of Twente. His research identified several successful novel approaches to query understanding by combining rule-based approaches with probabilistic approaches that rank query interpretations. Furthermore, the research resulted in an efficient implementation of OneBox, that needs only a fraction of a second to interpret queries even in complex configurations for accessing multiple web databases. Treinplanner.info, Q-Able’s first public demonstration of OneBox, demonstrates natural search for the Dutch Railways (Nederlandse Spoorwegen) travel planner, and was well-received in user questionnaires, on Twitter, and on the Dutch national public radio and television. Q-Able will use the STW valorisation grant to investigate the technical and and commercial feasibility of OneBox.

TREC Federated Web Search track

Thursday, June 6th, 2013, posted by Djoerd Hiemstra

http://sites.google.com/site/trecfedweb/

First submission due on August 11, 2013

The Federated Web Search track is part of NIST’s Text REtrieval Conference TREC 2013. The track investigates techniques for the selection and combination of search results from a large number of real on-line web search services. The data set, consisting of search results from 157 search engines is now available. The search engines cover a broad range of categories, including news, books, academic, travel, etc. We have included one big general web search engine, which is based on a combination of existing web search engines.

Federated search is the approach of querying multiple search engines simultaneously, and combining their results into one coherent search engine result page. The goal of the Federated Web Search (FedWeb) track is to evaluate approaches to federated search at very large scale in a realistic setting, by combining the search results of existing web search engines. This year the track focuses on resource selection (selecting the search engines that should be queried), and results merging (combining the results into a single ranked list). You may submit up to 3 runs for each task. All runs will be judged. Upon submission, you will be asked for each run to indicate whether you used result snippets and/or pages, and whether any external data was used. Precise guidelines can be found at:
http://sites.google.com/site/trecfedweb/

WWW Fed

Track coordinators

  • Djoerd Hiemstra - University of Twente, The Netherlands
  • Thomas Demeester - Ghent University, Belgium
  • Dolf Trieschnigg - University of Twente, The Netherlands
  • Dong Nguyen - University of Twente, The Netherlands

Federated Search Made Easy

Tuesday, May 21st, 2013, posted by Djoerd Hiemstra

by Dolf Trieschnigg, Kien Tjin-Kam-Jet, and Djoerd Hiemstra

Building a federated search engine based on a large number existing web search engines is a challenge: implementing the programming interface (API) for each search engine is an exacting and time-consuming job. In this demonstration we present SearchResultFinder, a browser plugin which speeds up determining reusable XPaths for extracting search result items from HTML search result pages. Based on a single search result page, the tool presents a ranked list of candidate extraction XPaths and allows highlighting to view the extraction result. An evaluation with 148 web search engines shows that in 90% of the cases a correct XPath is suggested.

The software can be downloaded as a FireFox plugin.

SIGIR 2013 demonstration

The tool was demonstrated at the ACM SIGIR Conference in Dublin.

[download pdf]

Twente at NTCIR 2013

Thursday, May 2nd, 2013, posted by Djoerd Hiemstra

An API-based Search System for One Click Access to Information

by Dan Ionita, Niek Tax, and Djoerd Hiemstra

This paper proposes a prototype One Click access system, based on previous work in the field and the related 1CLICK-2@NTCIR10 task. The proposed solution integrates methods from previous such attempts into a three tier algorithm: query categorization, information extraction and output generation and offers suggestions on how each of these can be implemented. Finally, a thorough user-based evaluation concludes that such an information retrieval system outperforms the textual preview collected from Google search results, based on a paired sign test. Based on validation results possible suggestions on future improvements are proposed.

To be presented at the Japanese National Institute of Informatics (NII) Testbeds and Community for Information access Research (NTCIR-10) Conference at the National Center of Sciences, Tokyo, Japan on June 18-21

[download pdf]

Deep Web Entity Monitoring

Wednesday, March 27th, 2013, posted by Djoerd Hiemstra

by Mohammad Khelghati

Search engines do not cover all the data available on the Web. In addition to the fact that none of these search engines cover all the webpages existing on the Web, they miss the data behind web search forms. This data is defined as hidden web or deep web which is not accessible through search engines. It is estimated that deep web contains data in a scale several times bigger than the data accessible through search engines which is referred to as surface web. Although this information on deep web could be accessed through their own interfaces, finding and querying all the interesting sources of information that might be useful could be a difficult, time-consuming and tiring task. Considering the huge amount of information that might be related to one’s information needs, it might be even impossible for a person to cover all the deep web sources of his interest. Therefore, there is a great demand for applications which can facilitate accessing this big amount of data being locked behind web search forms. Realizing approaches to meet this demand is one of the main issues targeted in this PhD project. Having provided the access to deep web data, different techniques can be applied to provide users with additional values out of this data. Analyzing data, finding patterns and relationships among different data items and also data sources are considered as some of these techniques. However, in this research, monitoring entities existing in deep web sources is targeted.

To be presented at the World Wide Web Conference Doctorial Consortium on 13 May in Rio de Janeiro, Brasil.

A probabilistic approach for mapping free-text queries to complex web forms

Friday, December 21st, 2012, posted by Djoerd Hiemstra

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

Web applications with complex interfaces consisting of multiple input fields should understand free-text queries. We propose a probabilistic approach to map parts of a free-text query to the fields of a complex web form. Our method uses token models rather than only static dictionaries to create this mapping, offering greater flexibility and requiring less domain knowledge than existing systems. We evaluate different implementations of our mapping model and show that our system effectively maps free-text queries without using a dictionary. If a dictionary is available, the performance increases and is significantly better than a rule-based baseline.

[download pdf]