Tag-Archive for ◊ uncertainty in data ◊

• Wednesday, February 27th, 2013

Tomorrow, 28 Feb 2013, a PhD student of mine, Victor de Graaff, is going to give a presentation on how to estimate the boundaries for objects for which you only have a point and other public data such as Open Street Map [Announcement].
Point of Interest to Region of Interest Conversion
Date/Time: Thursday, February 28, 2013 – 13:30 to 14:30; Room: 0-142
GPS trajectories from a mobile device, such as a smartphone, indirectly contain a vast amount of information on the interests of the owner of the device. Collections of GPS trajectories even provide insight in the popularity of locations, and the time spent at those locations. To obtain this information, the visited places on such a trajectory need to be recognized. However, the location information on a point of interest (POI) in a database is normally limited to an address and a GPS coordinate, rather than a geometry describing its boundaries. To create a match with a GPS trajectory, a two-dimensional shape representing this place, a region of interest (ROI), is needed. In the absence of expensive and hard to obtain detailed spatial data like cadastral data, we need to estimate this ROI. In this research project, we bridge this gap by presenting several approaches to estimate the size and shape of the ROI, and validate these estimations against the cadastral data of the city of Enschede, The Netherlands.

• Monday, September 05th, 2011

I wrote a position paper about a different approach towards development of information extractors, which I call Sherlock Holmes-style based on his famous quote “when you have eliminated the impossible, whatever remains, however improbable, must be the truth”. The idea is that we fundamentally treat annotations as uncertain. We even start with a “no knowledge”, i.e., “everything is possible” starting point and then interactively add more knowledge, apply the knowledge directly to the annotation state by removing possible annotations and recalculating the probabilities of the remaining ones. For example, “Paris Hilton”, “Paris”, and “Hilton” can all be interpreted as a City, Hotel or Person name. But adding knowledge like “If a phrase is interpreted as a Person Name, then its subphrases should not be interpreted as a City” makes the annotations <"Paris Hilton":Person Name> and <"Paris":City> mutually exclusive. Observe that initially all annotations were independent, and these two are now dependent. We argue in the paper that the main challenge in this approach lies in efficient storage and conditioning of probabilistic dependencies, because trivial approaches do not work.
Handling Uncertainty in Information Extraction.
Maurice van Keulen, Mena Badieh Habib
This position paper proposes an interactive approach for developing information extractors based on the ontology definition process with knowledge about possible (in)correctness of annotations. We discuss the problem of managing and manipulating probabilistic dependencies.

The paper will be presented at the URSW workshop co-located with ICSW 2011, 23 October 2011, Bonn, Germany [details]

• Thursday, October 21st, 2010

QloUDFabian Panse from the University of Hamburg in Germany just lauched a website about our cooperation on the topic of “Quality of Uncertain Data (QloUD)”.

• Friday, July 30th, 2010

For his “Research Topic” course, MSc student Emiel Hollander experimented with a mapping from Probabilistic XML to the probabilistic relational database Trio to investigate whether or not it is feasible to use Trio as a back-end for processing XPath queries on Probabilistic XML.
Storing and Querying Probabilistic XML Using a Probabilistic Relational DBMS
Emiel Hollander, Maurice van Keulen
This work explores the feasibility of storing and querying probabilistic XML in a probabilistic relational database. Our approach is to adapt known techniques for mapping XML to relational data such that the possible worlds are preserved. We show that this approach can work for any XML-to-relational technique by adapting a representative schema-based (inlining) as well as a representative schemaless technique (XPath Accelerator). We investigate the maturity of probabilistic relational databases for this task with experiments with one of the state-of- the-art systems, called Trio.

The paper will be presented at the 4th International Workshop on Management of Uncertain Data (MUD 2010) co-located with VLDB, 13 September 2010, Singapore [details]

• Friday, June 18th, 2010

Ik heb een artikel geschreven over “onzekere databases” voor DB/M Database Magazine van Array Publications. Hij staat in nummer 4, het juni-nummer dus nu te koop. Het thema van dit speciale nummer is “Datakwaliteit”.
Onzekere databases
Een recente ontwikkeling in het databaseonderzoek betreft de zogenaamde ‘onzekere databases’. Dit artikel beschrijft wat onzekere databases zijn, hoe gebruikt kunnen worden en welke toepassingen met name voordeel zouden kunnen hebben van deze technologie [details].

• Tuesday, April 27th, 2010

Ik heb een artikel geschreven over “onzekere databases” voor Database Magazine van Array Publications. Het wordt geplaatst in nummer 4, een speciaal nummer over “Datakwaliteit”.

• Friday, March 12th, 2010

To improve the integration of the new faculty ITC (Geo-Information Science and Earth Observation) into the university, the boards of directors of ITC and UT decided some time ago to subsidize several cooperation projects with each two PhD students, one at ITC and one at the UT. I am involved in one: “Neogeography: the challenge of channelling large and ill-behaved data streams” (see description below). Rolf de By (ITC) and I presented our Neogeography project on the Kick-off meeting 12 March 2010 [presentation]. Rolf’s PhD student is Clarisse Kagoyire and she arrived in The Netherlands just in time to make it to the meeting. My PhD student is Mena Badieh Habib; he will start 1 May 2010.

Neogeography: the challenge of channelling large and ill-behaved data streams
In this project, we develop XML-based data technology to support the channeling of large and ill-behaved neogeographic data streams. In neogeography, geographic information is derived from end-users, not from official bodies like mapping agencies, cadasters or other official, (para-)governmental organizations. The motivation is that multiple (neo)geographic information sources on the same phenomenon can be mutually enriching.
Content provision and feedback from large communities of end-users has great potential for sustaining a high level of data quality. The technology is meant to reach a substantial user community in the less-developed world through content provision and delivery via cell phone networks. Exploiting such neogeographic data requires a.o. the extraction of the where and when from textual descriptions. This comes with intrinsic uncertainty in space, time, but also thematically in terms of entity identification: which is the restaurant, bus stop, farm, market, forest mentioned in this information source? The rise of sensor networks adds to the mix a badly needed verification mechanism for the real-time neogeographic data.
We strive for a proper mix of carefully integrated techniques in geoinformation handling, approaches to spatiotemporal imprecision and incompleteness, as well as data augmentation through sensors in a generic framework with which purpose- oriented end-user communities can be served appropriately.
The UT PhD position focuses on spatiotemporal data technology in XML databases and theory and support technology for storage, manipulation and reasoning with spatiotemporal and thematic uncertainty. The work is to be validated through testbed use cases, such as the H20 project with google.org (water consumers in Zanzibar), AGCommons project with the Gates Foundation (smallholder farmers in sub-Saharan Africa), and other projects with large user communities.

• Friday, January 22nd, 2010

On Friday 22 January 2010, Michiel Punter defended his MSc thesis “Multi-Source Entity Resolution“. The MSc project was supervised by me, Ander de Keijzer, and Riham Abdel Kader.

“Multi-Source Entity Resolution” [download]
Background: The focus of this research was on multi-source entity resolution in the setting of pair-wise data integration. In contrast to most existing approaches to entity resolution this research does not consider matching to be transitive. A consequence of this is that entity resolution on multiple sources is not guaranteed to be associative. The goal of this research was to construct a generic model for multi-source entity resolution in the setting of pair-wise data integration that is associative.
Results: The main contributions of this research are: (1) a formal model for multi-source entity resolution and (2) strategies that can be used to resolve matching conflicts in a way that renders multi-source entity resolution to be associative. The possible worlds semantics is used to handle uncertainty originating from possible matches. The presented model is generic enough to allow different match and merge function as well as allowing different strategies to resolve matching conflicts.
Conclusions: A formalization of an example of multi-source entity resolution is presented to show the utility of the proposed model. By using small examples in which three sources are integrated it is shown that the strategies resulted in associative behavior of the integrate function.

• Thursday, November 05th, 2009

On Thursday 5 November 2009, Tjitze Rienstra defended his MSc thesis “Dealing with uncertainty in the semantic web”. The MSc project was supervised by me, Paul van der Vet, and Maarten Fokkinga. The work was evaluated by the committee as excellent and received the rarely awarded grade of 10.

“Dealing with uncertainty in the semantic web” [download]
Standardizing the Semantic Web is still an ongoing process. For some aspects, the standardization seems to have completed. For example, the syntax layer, the RDF data model layer and the RDFS and OWL semantic extensions have proven to fulfill their purpose in real world applications. Other aspects, while necessary to realize the greater ideal of the Semantic Web, are yet to be standardized. One of these is dealing with uncertainty. Like classical logic, the languages of the Semantic Web (RDF, RDFS and OWL) work under the assumption that knowledge is certain. Many forms of knowledge, e.g. in computer vision, computational linguistics and information retrieval, exhibit notions of uncertainty. Uncertainty also arises as a side effect of knowledge integration and ontology mapping. This thesis describes an extension for the Semantic Web to deal with uncertainty. The extension, called URDF (Uncertain RDF), extends RDF with the capability to express uncertainty by allowing to associate RDF formulas with probabilities. It not only extends RDF, but also supports the semantics of RDFS and part of OWL. The main contribution is an extension that adheres to the incremental design of the Semantic Web language stack. It can act as a unifying framework for different kinds of probabilistic representation and reasoning, at different levels of expressivity (RDF, RDFS or OWL). In this thesis, we focus on two kinds of reasoning: rule based reasoning with RDFS/OWL knowledge and Bayesian networks and inference.

• Monday, September 07th, 2009

In cooperation with ITC (International Institute for Geo-Information Science and Earth Observation), we have a PhD position availble on Neogeography: the challenge of channeling large and ill-behaved data streams. In neogeography, geographic information is derived from end-users, not from official bodies. The technology is meant to reach a substantial user community in the less-developed world through content provision and delivery via cell phone networks. Exploiting such neogeographic data requires a.o. the extraction of the where and when from textual descriptions. This comes with intrinsic uncertainty in space, time, but also thematically in terms of entity identification: which is the restaurant, bus stop, farm, market, forest mentioned in this information source? Anyone with a MSc degree interested in doing PhD research on this topic is welcome to apply before October 10 (see the vacancy for details).