Steven Verkuil graduates on Reference Extraction Techniques

July 6th, 2016, posted by Djoerd Hiemstra

Journal Citation Statistics for Library Collections using Document Reference Extraction Techniques

by Steven Verkuil

Providing access to journals often comes with a considerable subscription fee for universities. It is not always clear how these journal subscriptions actually contribute to ongoing research. This thesis provides a multistage process for evaluating which journals are actively referenced in publications. Our software tool for journal citation reports, CiteRep, is designed to aid decision making processes by providing statistics about the number of times a journal is referenced in a document set. Citation reports are automatically generated from online repositories containing PDF documents. The process of extracting citations and identifying journals is user and maintenance friendly. CiteRep allows to filter generated reports by year, faculty and study providing detailed insight in journal usage for specific user groups. Our software tool achieves an overall weighted precision and recall of 66,2% when identifying journals in a fresh set of PDF documents. While leaving open some areas of improvement, CiteRep outperforms the two most popular citation parsing libraries, ParsCit and FreeCite with respect to journal identification accuracy. CiteRep should be considered for creation of journal citation reports from document repositories.

[download pdf]

Clone CiteRep on Github.

Mohammad Khelghati defends PhD thesis on Deep Web Entity Monitoring

June 2nd, 2016, posted by Djoerd Hiemstra

by Mohammadreza Khelghati

Data is one of the keys to success. Whether you are a fraud detection officer in a tax office, a data journalist or a business analyst, your primary concern is to access all the relevant data to your topics of interest. In such an information-thirsty environment, accessing every source of information is valuable. This emphasizes the role of the web as one of the biggest and main sources of data. In accessing web data through either general search engines or direct querying of deep web sources, the laborious work of querying, navigating results, downloading, storing and tracking data changes is a burden on shoulders of users. To decrease this intensive labor work of accessing data, (semi-)automatic harvesters have a crucial role. However, they lack a number of functionalities that we discuss and address in this work.
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles. First, we try to find methods that can achieve the best coverage in harvesting data for a topic. Although using a fully automatic general harvester facilitates accessing web data, it is not a complete solution to collect a thorough data coverage on a given topic. Some search engines, in both surface web and deep web, restrict the number of requests from a user or limit the number of returned results presented to him. We suggest an efficient approach which can pass these limitations and achieve a complete data coverage.
Second, we investigate reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size. Harvesting tasks continue till they face the posed query submission limitations by search engines or consume all the allocated resources. To prevent this undesirable situation, we need to know the size of the targeted source. For a website that hides the true size of its residing data, we suggest an accurate method to estimate its size.
As the third challenge, we focus on monitoring data changes over time in web data repositories. This information is helpful in providing the most up-to-date answers to information needs of users. The fast evolving web adds extra challenges for having an up-to-date data collection. Considering the costly process of harvesting, it is important to find methods which facilitate efficient re-harvesting processes.
Lastly, we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings.
These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.

[download pdf]

13th SSR on Deep Web Entity Monitoring

May 27th, 2016, posted by Djoerd Hiemstra

On 2nd of June 2016, we organize the 13th Seminar on Search and Ranking on Deep Web Entity Monitoring with 3 invited spears: Gianluca Demartini (University of Sheffield, UK), Andrea Calì (Birkbeck, University of London, UK), and Pierre Senellart (Télécom ParisTech, France).

More information at:

A new search engine for the university

March 24th, 2016, posted by Djoerd Hiemstra

As of this today, the university is using our Distributed Search approach as their main search engine on: (and also stand-alone on The UT search engine offers its user not only the results from a large web crawl, but also live results from many sources that were previously invisible, such as courses, timetables, staff contact information, publications, the local photo database “Beeldbank”, vacancies, etc. The search engine combines about 30 of such sources, and learns over time which sources should be included for a query, even if it has never seen that query, nor the results for the query.

University of Twente

Read more in the official announcement (in Dutch).

Efficient Web Harvesting Strategies for Monitoring Deep Web Content

March 22nd, 2016, posted by Djoerd Hiemstra

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

Focused Web Harvesting aims at achieving a complete harvest of a set of related web data for a given topic. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.

[download pdf]

The software for this work is available as: HaverstED.

3TU NIRICT theme Data Science

January 12th, 2016, posted by Djoerd Hiemstra

The main objective of the NIRICT research in Data Science is to study the science and technology to unlock the intelligence that is hidden inside Big Data.
The amounts of data that information systems are working with are rapidly increasing. The explosion of data happens in a pace that is unprecedented and in our networked world of today the trend is even accelerating. Companies have transactional data with trillions of bytes of information about their customers, suppliers and operations. Sensors in smart devices generate unparalleled amounts of sensor data. Social media sites and mobile phones have allowed billions of individuals globally to create their own enormous trails of data.
The driving force behind this data explosion is the networked world we live in, where information systems, organizations that employ them, people that use them, and processes that they support are connected and integrated, together with the data contained in those systems.

What happens in an internet minute in 2016?

Unlocking the Hidden Intelligence

Data alone is just a commodity, it is Data Science that converts big data into knowledge and insights. Intelligence is hidden in all sorts of data and data systems.
Data in information systems is usually created and generated for specific purposes: it is mostly designed to support operational processes within organizations. However, as a by-product, such event data provide an enormous source of hidden intelligence about what is happening, but organizations can only capitalize on that intelligence if they are able to extract it and transform the intelligence into novel services.
Analyzing the data provides opportunities for organizations to gather intelligence to capitalize historic and current performance of their processes and exploit future chances for performance improvement.
Another rich source of information and insights is data from the Social Web. Analyzing Social Web Data provides governments, society and companies with better understanding of their community and knowledge about human behavior and preferences.
Each 3TU institute has its own Data Science program, where local data science expertise is bundled and connected to real-world challenges.

Delft Data Science (DDS) – TU Delft
Scientific director: Prof. Geert-Jan Houben

Data Science Center Eindhoven (DSC/e) – TU/e
Scientific director: Prof. Wil van der Aalst

Data Science Center UTwente (DSC UT) – UT
Scientific director: Dr. Djoerd Hiemstra

More information at:

#SupportTheCause: Online Protest and Advocacy Symposium

January 6th, 2016, posted by Djoerd Hiemstra

21-22 January 2016
University of Twente

#SupportTheCauseIf you’re interested in social media analysis and/or computational social science, there will be interesting guest speakers, including speakers from UCLA, TNO, TU Delft, Greenpeace, Sanquin, and Twitter.

Niels Visser graduates on automated web harvesting

December 16th, 2015, posted by Djoerd Hiemstra

Fully automated web harvesting using a combination of new and existing heuristics

by Niels Visser

Several techniques exist for extracting useful content from web pages. However, the definition of ‘useful’ is very broad and context dependant. In this research, several techniques – existing ones and new ones – are evaluated and combined in order to extract object data in a fully automatic way. The data source used for this, are mostly web shops, sites that promote housing, and vacancy sites. The data to be extracted from these pages, are respectively items, houses and vacancies. Three kinds of approaches are combined and evaluated: clustering algorithms, algorithms that compare pages, and algorithms that look at the structure of single pages. Clustering is done in order to differentiate between pages that contain data and pages that do not. The algorithms that extract the actual data are then executed on the cluster that is expected to contain the most useful data. The quality measure used to assess the performance of the applied techniques are precision and recall per page. It can be seen that without proper clustering, the algorithms that extract the actual data perform very bad. Whether or not clustering performs acceptable heavily depends on the web site. For some sites, URL based clustering outstands (for example: and with precisions of around 33% and recalls of around 85%. URL based clustering is therefore the most promising clustering method reviewed by this research. Of the extraction methods, the existing methods perform better than the alterations proposed by this research. Algorithms that look at the structure (intra page document structure) perform best of all four methods that are compared with an average recall between 30% to 50%, and an average precision ranging from very low (around 2%) to quite low (around 33%). Template induction, an algorithm that compares between pages, performs relatively well as well, however, it is more dependent on the quality of the clusters. The conclusion of this research is that it is not possible yet using a combination of the methods that are discussed and proposed to fully automatically extract data from websites.

Niek Tax wins the ENIAC thesis award

December 15th, 2015, posted by Djoerd Hiemstra

Another thesis prize for Niek Tax: Best master thesis in computer science in 2014/2015 at the University of Twente, awarded by Alumni Association ENIAC. Photo: Niek Tax receives the award from Johan Noltes on behalf of the ENIAC jury. Congrats, Niek! Other nominees were Justyna Chromik (DACS), Vincent Bloemen (FMT), Maarten Brilman (HMI), Tim Paauw (IEBIS), and Moritz Müller (SCS).

Niek Tax

Towards Complete Coverage in Focused Web Harvesting

December 1st, 2015, posted by Djoerd Hiemstra

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

With the goal of harvesting all information about a given entity, in this paper, we try to harvest all matching documents for a given query submitted on a search engine. The objective is to retrieve all information about for instance “Michael Jackson”, “Islamic State”, or “FC Barcelona” from indexed data in search engines, or hidden data behind web forms, using a minimum number of queries. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. These limitations are also applied in deep web sources, for instance in social networks like Twitter. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine’s limitations. The approach minimizes the number of queries that need to be sent by analysing the retrieved results and combining this analysed information with information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.

To be presented at the 17th International Conference on Information Integration and Web-based Applications & Services on 11 - 13 December 2015 in Brussels, Belgium

[download pdf]