Two of my PhD students, Mohammad Khelgati and Victor de Graaff, are presenting on the Dutch-Belgian DataBase Day (DBDBD). Mohammad about “Size Estimation of Non-Cooperative Data Collections” and Victor on “Semantic Enrichment of GPS Trajectories“.
Archive for the Category ◊ Paper abstracts ◊
A journal paper with my vision on data interoperability and a basis formalization has been accepted for a special issue of the Journal of IT volume 54, issue 3.
Managing Uncertainty: The Road Towards Better Data Interoperability.
Maurice van Keulen
Data interoperability encompasses the many data management activities needed for effective information management in anyone´s or any organization´s everyday work such as data cleaning, coupling, fusion, mapping, and information extraction. It is our conviction that a significant amount of money and time in IT that is devoted to these activities, is about dealing with one problem: “semantic uncertainty”. Sometimes data is subjective, incomplete, not current, or incorrect, sometimes it can be interpreted in different ways, etc. In our opinion, clean correct data is only a special case, hence data management technology should treat data quality problems as a fact of life, not as something to be repaired afterwards. Recent approaches treat uncertainty as an additional source of information which should be preserved to reduce its impact. We believe that the road towards better data interoperability, is to be found in teaching our data processing tools and systems about all forms of doubt and how to live with them. In this paper, we show for several data interoperability use cases (deduplication, data coupling/fusion, and information extraction) how to formally model the associated data quality problems as semantic uncertainty. Furthermore, we provide an argument why our approach leads to better data interoperability in terms of natural problem exposure and risk assessment, more robustness and automation, reduced development costs, and potential for natural and effective feedback loops leveraging human attention.
Together with one of my PhD students, Victor de Graaff, and Rolf de By, we got a paper accepted at the WI&C workshop at WWW 2012. It describes our research plans on geo-social recommendation in the COMMIT/TimeTrails project.
Towards Geosocial Recommender Systems
Victor de Graaff, Maurice van Keulen, Rolf de By
The usage of social networks sites (SNSs), such as Facebook, and geosocial networks (GSNs), such as Foursquare, has increased tremendously over the past years. The willingness of users to share their current locations and experiences facilitate the creation of geographical recommender systems based on user generated content (UGC). This idea has been used to create a substantial amount of geosocial recommender systems (GRSs), such as Gogobot, TripIt, and Trippy already, but can be applied to more complex scenarios, such as the recommendation of products with a strong binding to their region, such as real estate or vacation destinations.
This extended form of GRS development requires advanced functionality for information collection (from the web, other social media and sensors), information enrichment (such as data quality assessment and advanced data analysis), and personalized recommendations. The creation of a toolset to cope with these challenges is the goal of this research project, for which the outline is presented in this paper.
The paper will be presented at the WI&C workshop co-located with WWW 2012, 16 April 2012, Lyon, France [details]
I wrote a journal paper with Fabian Panse (University of Hamburg) about handling ambiguous situations in deduplication in a probabilistic way. It has been accepted for the ACM Journal of Data and Information Quality.
Indeterministic Handling of Uncertain Decisions in Deduplication.
Fabian Panse, Norbert Ritter, and Maurice van Keulen
In current research and practice, deduplication is usually considered as a deterministic approach in which tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.
One of my PhD students, Mena Badieh Habib, has given a talk on the Dutch-Belgian DataBase Day (DBDBD) about “Named Entity Extraction and Disambiguation from an Uncertainty Perspective“.
I wrote a position paper about a different approach towards development of information extractors, which I call Sherlock Holmes-style based on his famous quote “when you have eliminated the impossible, whatever remains, however improbable, must be the truth”. The idea is that we fundamentally treat annotations as uncertain. We even start with a “no knowledge”, i.e., “everything is possible” starting point and then interactively add more knowledge, apply the knowledge directly to the annotation state by removing possible annotations and recalculating the probabilities of the remaining ones. For example, “Paris Hilton”, “Paris”, and “Hilton” can all be interpreted as a City, Hotel or Person name. But adding knowledge like “If a phrase is interpreted as a Person Name, then its subphrases should not be interpreted as a City” makes the annotations <"Paris Hilton":Person Name> and <"Paris":City> mutually exclusive. Observe that initially all annotations were independent, and these two are now dependent. We argue in the paper that the main challenge in this approach lies in efficient storage and conditioning of probabilistic dependencies, because trivial approaches do not work.
Handling Uncertainty in Information Extraction.
Maurice van Keulen, Mena Badieh Habib
This position paper proposes an interactive approach for developing information extractors based on the ontology definition process with knowledge about possible (in)correctness of annotations. We discuss the problem of managing and manipulating probabilistic dependencies.
The paper will be presented at the URSW workshop co-located with ICSW 2011, 23 October 2011, Bonn, Germany [details]
A master student performed a problem exploration for the PayDIBI project. This is the report he wrote.
Integration of Biological Sources – Exploring the Case of Protein Homology
Tjeerd W. Boerman, Maurice van Keulen, Paul van der Vet, Edouard I. Severing (Wageningen University)
Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heterogeneous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioinformatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Uncertain databases are able to contain several possible worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration.
One of my PhD students, Mena Badieh Habib, and I submitted a paper about improving the effectiveness of named entity extraction (NEE) with what we call “the reinforcement effect” to the MUD workshop of VLDB2011.
Named Entity Extraction and Disambiguation: The Reinforcement Effect.
Mena Badieh Habib, Maurice van Keulen
Named entity extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. Although these topics are highly dependent, almost no existing works examine this dependency. It is the aim of this paper to examine the dependency and show how one affects the other, and vice versa. We conducted experiments with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms as a representative example of named entities. We experimented with three approaches for disambiguation with the purpose to infer the country of the holiday home. We examined how the effectiveness of extraction influences the effectiveness of disambiguation, and reciprocally, how filtering out ambiguous names (an activity that depends on the disambiguation process) improves the effectiveness of extraction. Since this, in turn, may improve the effectiveness of disambiguation again, it shows that extraction and disambiguation may reinforce each other.
The paper will be presented at the MUD workshop co-located with VLDB 2011, 29 August 2011, Seattle, USA [details]
One of my PhD students, Mena Badieh Habib, submitted a paper with his research plans in the Neogeography project to the PhD workshop of ICDE2011.
Neogeography: The Challenge of Channelling Large and Ill-Behaved Data Streams
Mena Badieh Habib
Neogeography is the combination of user generated data and experiences with mapping technologies. In this paper we propose a research project to extract valuable structured information with a geographic component from unstructured user generated text in wikis, forums, or SMSes. The project intends to help workers communities in developing countries to share their knowledge, providing a simple and cheap way to contribute and get benefit using the available communication technology.
The paper will be presented at the PhD workshop co-located with ICDE 2010, 11 April 2011, Hannover, Germany [details]
For his “Research Topic” course, MSc student Emiel Hollander experimented with a mapping from Probabilistic XML to the probabilistic relational database Trio to investigate whether or not it is feasible to use Trio as a back-end for processing XPath queries on Probabilistic XML.
Storing and Querying Probabilistic XML Using a Probabilistic Relational DBMS
Emiel Hollander, Maurice van Keulen
This work explores the feasibility of storing and querying probabilistic XML in a probabilistic relational database. Our approach is to adapt known techniques for mapping XML to relational data such that the possible worlds are preserved. We show that this approach can work for any XML-to-relational technique by adapting a representative schema-based (inlining) as well as a representative schemaless technique (XPath Accelerator). We investigate the maturity of probabilistic relational databases for this task with experiments with one of the state-of- the-art systems, called Trio.
The paper will be presented at the 4th International Workshop on Management of Uncertain Data (MUD 2010) co-located with VLDB, 13 September 2010, Singapore [details]