Tag-Archive for » clustering «

Wednesday, July 14th, 2010 | Author:

On Wednesday 14 July 2010, Tom Palsma defended his MSc thesis entitled “Discovering groups using short messages from social network profiles”. The MSc project was carried out at Topicus FinCare. It was supervised by me, Dolf Trieschnigg (UT), Jasper Laagland (FinCare), and Wouter de Jong (FinCare).

Discovering groups using short messages from social network profiles[download]
In the past few years people used the internet more and more for sharing their photos, thoughts and activities via photo albums, user profiles, blogs and short text messages on online social networking sites, like Facebook and MySpace. This information could be very useful for (personal) marketing and advertising. Besides groups formed by the users of the social network sites explicitly, people could have a relation based on similar interests that they implicitly leave in short text messages.
This research has a focus on discovering groups, taking into account semantic relations between user profiles and describing the characteristics of the groups and relations. Because it is hard to discover these types of relations at word level by matching similar words in messages, we introduce a hierarchical structure of concepts obtained from the Wikipedia category system to discover groups and (semantic) relations at more abstract conceptual levels between profiles with short text messages obtained from Twitter.
In order to provide a general approach that is not limited to a specific set of concepts we use a naive classification approach. Concepts (and their parent concepts) are assigned to profiles when concept (related) terms occur in a short message of the profile. Manual evaluation of this approach shows 37.4% of the assignments is correct (the precision), which results in an F-score of 0.54. To improve the precision of the classification results we use Support Vector Machines. Using features related to characteristics of the concept structure and the relations between concepts and profiles improves the classification results with 14% according to the F-score (0.68).
The grouping process consists of clustering of statistical data of concept occurrences in user profiles. Interesting groups discovered based on the clustering results are groups of concepts that are not grouped together in the original Wikipedia category structure. Besides these types of groups the results also show groups of concepts that have a semantic relation, which is not reflected in the Wikipedia category structure. This information could be used to improve the Wikipedia category structure.
The overall process shows that the usage of hierarchical concepts and clustering helps to discover groups based on semantic relations on abstract conceptual levels. The selection of concepts and assigning them to user profiles could guide the grouping results to desired domains and the concepts help to describe the groups. However, due to problems with ambiguous meaning of concepts and characteristics of the messages, another approach of assigning concepts could improve the quality of the discovered groups. To know how useful the groups are for marketing and advertising requires more research.