Tag-Archive for » commit/infiniti «

Thursday, June 02nd, 2016 | Author:

Today, a PhD student of mine, Mohammad S. Khelghati, defended his thesis.
Deep Web Content Monitoring [Download]
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles: finding methods that can achieve the best coverage in harvesting data for a topic; reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size; monitoring data changes over time in web data repositories; and we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings. These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.

Category: COMMIT, Web harvasting  | Tags: , ,  | Comments off
Thursday, January 28th, 2016 | Author:

My PhD student Mohammad Khelgathi released his web harvesting software, called HarvestED.

Monday, May 13th, 2013 | Author:

Together with my PhD student Mena Badieh Habib and another PhD student of our group Zhemin Zhu, we participated in the “Making Sense of Microposts” challenge at the WWW 2013 conference … and we won the best IE award!
[paper | presentation | poster]

Tuesday, October 02nd, 2012 | Author:

One of my PhD students, Mohammad Khelgati, has a paper on iiWAS 2012.
Size Estimation of Non-Cooperative Data Collections
Mohammad Khelghati, Djoerd Hiemstra, and Maurice van Keulen
With the increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping the crawling or sampling processes which can be so costly in some cases [14]. This tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [7], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [19], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in the case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [17]. In this paper, the suggested approaches for this purpose in the literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.
The paper will be presented at the iiWAS 2012 conference, 3-5 Dec 2012, Bali, Indonesia [details]

Category: COMMIT  | Tags: ,  | Comments off