Fabian Panse from the University of Hamburg in Germany just lauched a website about our cooperation on the topic of “Quality of Uncertain Data (QloUD)”.
Tag-Archive for ◊ data integration ◊
Gezocht met spoed: student voor onderstaande afstudeeropdracht tbv het ESCAPE project.
ESCAPE is een project tbv een nieuwe manier van wetenschappelijke communicatie die niet meer gebaseerd is op alleen maar artikelen. Het is gebaseerd op semantic web technologie waarmee brede kennis over artikelen, data, resultaten, onderzoekers, projecten, organisaties, en de relaties daartussen kunnen worden opgeslagen, bevraagd en gemanipuleerd. Het invoeren van de gegevens en kennis is echter nogal arbeidsintensief. Deze opdracht gaat erover om tools te ontwikkelen voor automatische verrijking van de gegevens en kennis. Daarmee bedoelen we op ‘t laagste niveau import van publicatiegegevens van websites van uitgevers e.d., maar ook op een hoger niveau verrijking door automatisch links te leggen met Open Linked Data en andere databases en websites.
On Wednesday 14 July 2010, Tom Palsma defended his MSc thesis entitled “Discovering groups using short messages from social network profiles”. The MSc project was carried out at Topicus FinCare. It was supervised by me, Dolf Trieschnigg (UT), Jasper Laagland (FinCare), and Wouter de Jong (FinCare).
Discovering groups using short messages from social network profiles[download]
In the past few years people used the internet more and more for sharing their photos, thoughts and activities via photo albums, user profiles, blogs and short text messages on online social networking sites, like Facebook and MySpace. This information could be very useful for (personal) marketing and advertising. Besides groups formed by the users of the social network sites explicitly, people could have a relation based on similar interests that they implicitly leave in short text messages.
This research has a focus on discovering groups, taking into account semantic relations between user profiles and describing the characteristics of the groups and relations. Because it is hard to discover these types of relations at word level by matching similar words in messages, we introduce a hierarchical structure of concepts obtained from the Wikipedia category system to discover groups and (semantic) relations at more abstract conceptual levels between profiles with short text messages obtained from Twitter.
In order to provide a general approach that is not limited to a specific set of concepts we use a naive classification approach. Concepts (and their parent concepts) are assigned to profiles when concept (related) terms occur in a short message of the profile. Manual evaluation of this approach shows 37.4% of the assignments is correct (the precision), which results in an F-score of 0.54. To improve the precision of the classification results we use Support Vector Machines. Using features related to characteristics of the concept structure and the relations between concepts and profiles improves the classification results with 14% according to the F-score (0.68).
The grouping process consists of clustering of statistical data of concept occurrences in user profiles. Interesting groups discovered based on the clustering results are groups of concepts that are not grouped together in the original Wikipedia category structure. Besides these types of groups the results also show groups of concepts that have a semantic relation, which is not reflected in the Wikipedia category structure. This information could be used to improve the Wikipedia category structure.
The overall process shows that the usage of hierarchical concepts and clustering helps to discover groups based on semantic relations on abstract conceptual levels. The selection of concepts and assigning them to user profiles could guide the grouping results to desired domains and the concepts help to describe the groups. However, due to problems with ambiguous meaning of concepts and characteristics of the messages, another approach of assigning concepts could improve the quality of the discovered groups. To know how useful the groups are for marketing and advertising requires more research.
On Friday 22 January 2010, Michiel Punter defended his MSc thesis “Multi-Source Entity Resolution“. The MSc project was supervised by me, Ander de Keijzer, and Riham Abdel Kader.
“Multi-Source Entity Resolution” [download]
Background: The focus of this research was on multi-source entity resolution in the setting of pair-wise data integration. In contrast to most existing approaches to entity resolution this research does not consider matching to be transitive. A consequence of this is that entity resolution on multiple sources is not guaranteed to be associative. The goal of this research was to construct a generic model for multi-source entity resolution in the setting of pair-wise data integration that is associative.
Results: The main contributions of this research are: (1) a formal model for multi-source entity resolution and (2) strategies that can be used to resolve matching conflicts in a way that renders multi-source entity resolution to be associative. The possible worlds semantics is used to handle uncertainty originating from possible matches. The presented model is generic enough to allow different match and merge function as well as allowing different strategies to resolve matching conflicts.
Conclusions: A formalization of an example of multi-source entity resolution is presented to show the utility of the proposed model. By using small examples in which three sources are integrated it is shown that the strategies resulted in associative behavior of the integrate function.
On Friday 23 October 2009, Irma Veldman defended her MSc thesis “Matching Profiles from Social Network Sites – Similarity Calculations with Social Network Support”. The MSc project was carried out at Topicus FinCare. It was supervised by me, Ander de Keijzer (UT), Jasper Laagland (FinCare), and Wouter de Jong (FinCare).
“Matching Profiles from Social Network Sites – Similarity Calculations with Social Network Support [download]
In recent years social networking sites have become very popular. Many people are member of one or more of these profile sites and tend to put a lot of informa- tion about themselves online. This often publicly available data can be useful for many purposes. Retrieving all available data from one person and merging it into one profile even more. Detection of which profiles belong to the same person becomes very important. This task is called Entity Resolution (ER).
In this research we develop a model to solve the ER problem for profiles from social networking sites. First we present a simple model. Then we try to improve this model by making use of the social networks a member can have on these sites. We believe that involving the networks can improve the results significantly.
General idea is that we have two sites with profiles. With the model we try to find out which profiles of the first profile site correspond to which profiles of the second profile site, whereby we assume a person to have at most one profile at each profile site.
In the simple model, we compare all profiles of the first profile site against all profiles of the second site. This comparison will result in a score for each pair: the pairwise similarity score. The higher this score, the higher the probability that these profiles belong to the same person. The pairs that satisfy the so-called pairwise threshold are the candidate matches. From these candidate matches, the matches are chosen.
In the network model, we start the same way. When the list of candidate matches is determined, the network phase is started. For each candidate match the network similarity score is calculated. This is done by determining the overlap in the networks of both profiles in the candidate match. The more overlap between the networks, the higher the network similarity score, the higher the probability that the profiles in the candidate match belong to the same person. This time, the candidate matches should satisfy a network threshold in order to stay a candidate match. Then from the remaining candidate matches, the matches are chosen.
In order to test whether the network model would indeed improve the simple model, we have set up experiments. Since no suitable data sets were available, we retrieved our own data set. Unfortunately, it appeared to have some limitations. Also, we have built a prototype that implemented the model. The prototype has several parameters for which we could vary the values in the experiments to find a good configuration.
The network model ensures that there are more conditions that need to be met to be a match. The experimental results confirm this. That means that the precision of the results increases. On the other side, due to these strict conditions, corresponding profiles are missed, which is undesired. However, in case there are ambiguous profiles in the set, the network model can distinguish the correct profile, which is highly desired. This situation will occur frequently in real life, hence we think the network model can really contribute to solving the ER problem.
I gave a lecture of 4 hours including 1 hour exercise together with Ander de Keijzer for the SIKS Advanced Course “Probabilistic Methods for Entity Resolution and Entity Ranking” (see Program for the slides). Our lecture was about a “Probabilistic Data Integration approach to Entity Resolution”. In the exercise we tried to integrate information about movies from two independent sources using the probabilistic database Trio. This proved too difficult as a 1-hour exercise in such a course, but at the same time very interesting as an example use of a probabilistic database. We decided to try to turn it into a sample script both illustrating the power of probabilistic databases like Trio and at the same time how to do probabilistic data integration.