On Friday 23 October 2009, Irma Veldman defended her MSc thesis “Matching Profiles from Social Network Sites – Similarity Calculations with Social Network Support”. The MSc project was carried out at Topicus FinCare. It was supervised by me, Ander de Keijzer (UT), Jasper Laagland (FinCare), and Wouter de Jong (FinCare).
“Matching Profiles from Social Network Sites – Similarity Calculations with Social Network Support [download]
In recent years social networking sites have become very popular. Many people are member of one or more of these profile sites and tend to put a lot of informa- tion about themselves online. This often publicly available data can be useful for many purposes. Retrieving all available data from one person and merging it into one profile even more. Detection of which profiles belong to the same person becomes very important. This task is called Entity Resolution (ER).
In this research we develop a model to solve the ER problem for profiles from social networking sites. First we present a simple model. Then we try to improve this model by making use of the social networks a member can have on these sites. We believe that involving the networks can improve the results significantly.
General idea is that we have two sites with profiles. With the model we try to find out which profiles of the first profile site correspond to which profiles of the second profile site, whereby we assume a person to have at most one profile at each profile site.
In the simple model, we compare all profiles of the first profile site against all profiles of the second site. This comparison will result in a score for each pair: the pairwise similarity score. The higher this score, the higher the probability that these profiles belong to the same person. The pairs that satisfy the so-called pairwise threshold are the candidate matches. From these candidate matches, the matches are chosen.
In the network model, we start the same way. When the list of candidate matches is determined, the network phase is started. For each candidate match the network similarity score is calculated. This is done by determining the overlap in the networks of both profiles in the candidate match. The more overlap between the networks, the higher the network similarity score, the higher the probability that the profiles in the candidate match belong to the same person. This time, the candidate matches should satisfy a network threshold in order to stay a candidate match. Then from the remaining candidate matches, the matches are chosen.
In order to test whether the network model would indeed improve the simple model, we have set up experiments. Since no suitable data sets were available, we retrieved our own data set. Unfortunately, it appeared to have some limitations. Also, we have built a prototype that implemented the model. The prototype has several parameters for which we could vary the values in the experiments to find a good configuration.
The network model ensures that there are more conditions that need to be met to be a match. The experimental results confirm this. That means that the precision of the results increases. On the other side, due to these strict conditions, corresponding profiles are missed, which is undesired. However, in case there are ambiguous profiles in the set, the network model can distinguish the correct profile, which is highly desired. This situation will occur frequently in real life, hence we think the network model can really contribute to solving the ER problem.