Archive for the 'Social Data' Category

Guest speakers at 12th SSR

Tuesday, October 6th, 2015, posted by Djoerd Hiemstra

We are proud to announce the 12th Seminar on Searching and Ranking, with guest presentations by Ingo Frommholz from the University of Bedfordshire, UK, and Tom Heskes from Radboud University Nijmegen, the Netherlands.
More information at: SSR 12.

CBS / UT Data Camp 2015

Tuesday, September 15th, 2015, posted by Djoerd Hiemstra

On 23-27 November 2015, the Data Camp, a joint event organized by the Central Bureau for Statistics of the Netherlands (CBS) and the University of Twente (UT). During the camp, a set of CBS data analysts and UT researchers will answer research questions about statistics using big data technologies. On Monday, the participants will be presented with overview presentations about the research questions and technologies. The data camp participants will work in small, mixed teams in an informal setting. Experienced data scientists will support the teams by short mini-workshops and hands-on support. The hope is that the intense contact with the research question in an informal and spontaneous environment will produce valuable and innovative answers to the posed questions.

Guest speakers are Erik Tjong Kim Sang (Meertens Institute, Amsterdam) and David González (Vizzuality, Madrid).

[download report]

The Influence of Prosocial Norms and Online Network Structure on Prosocial Behavior

Friday, August 28th, 2015, posted by Djoerd Hiemstra

The Influence of Prosocial Norms and Online Network Structure on Prosocial Behavior: An Analysis of Movember’s Twitter Campaign in 24 Countries

by Tijs van den Broek, Ariana Need, Michel Ehrenhard, Anna Priante and Djoerd Hiemstra

Sociological research points at norms and social networks as antecedents of prosocial behavior. To date, the literature remains undecided on how these factors jointly influence prosocial behavior. Furthermore, the use of social media by campaign organizations may change the need for formal networks to organize large-scale collective action. Hence, in this paper we examine the interplay of prosocial norms and the structure of online social networks on offline prosocial behavior. For this purpose we use donation data from the global Movember campaign, messages about the Movember campaign on the online social networking site Twitter, and data from the World Giving Index. A multi-level analysis of Movember’s campaigns in 24 countries finds support for the logic of connective action: larger and more decentralized networks raise more donations. Furthermore, we find that the effect of prosocial norms on donations is decreased by larger and denser campaign networks.

To be presented at Social media, Activism, and Organizations 2015 (SMAO) on 6 November in Londen, UK.

Anne van de Venis graduates on Recommendations using DBpedia

Wednesday, August 26th, 2015, posted by Djoerd Hiemstra

Recommendations using DBpedia: How your Facebook profile can be used to find your next greeting card

by Anne van de Venis

Recommender systems (RS) are systems that provide suggestions that users may find interesting. In this thesis we present our Interest-Based Recommender System (IBRS) that can recommend tagged item sets from any domain. This RS is validated with item sets from two different domains, namely postcards and holidays homes. While postcards and holiday homes are very different items, with different characteristics, IBRS uses the same recommender engine to create recommendations. IBRS solves several problems that are present in classic RSs, such as the cold-start problem and language independence. The cold-start problem for new users, is solved by using Facebook likes for creating a user profile. It uses information in DBpedia to create recommendations in a tag-based item set for multiple domains, independent of the language. Using both external knowledge sources and user content, makes our system a hybrid of a knowledge-based and content-based RS. We validated our system through an online evaluation system in two evaluation rounds with test user groups of approximately 71 and 44 people. The main contributions in this thesis are:

  • a literature study of existing recommendation approaches;
  • a language-independent mapping approach for tags and social media resource onto DBpedia resources;
  • a domain-independent algorithm for detecting related concepts in the DBpedia graph;
  • a recommendation approach based on both Facebook and DBpedia;
  • a validation of our recommendation approach.

[download pdf]

On the Impact of Twitter-based Health Campaigns

Friday, August 21st, 2015, posted by Djoerd Hiemstra

A Cross-Country Analysis of Movember

by Nugroho Dwi Prasetyo (TU Delft), Claudia Hauff (TU Delft), Dong Nguyen, Tijs van den Broek, Djoerd Hiemstra

Health campaigns that aim to raise awareness and subsequently raise funds for research and treatment are commonplace. While many local campaigns exist, very few attract the attention of a global audience. One of those global campaigns is Movember, an annual campaign during the month of November, that is directed at men’s health with special focus on cancer and mental health. Health campaigns routinely use social media portals to capture people’s attention. Recently, researchers began to consider to what extent social media is effective in raising the awareness of health campaigns. In this paper we expand on those works by conducting an investigation across four different countries, while not only restricting ourselves to the impact on awareness but also on fund-raising. To that end, we analyze the 2013 Movember Twitter campaigns in Canada, Australia, the United Kingdom and the United States.

To be presented at the 6th International Workshop on Health Text Mining and Information Analysis (Louhi 2015) Workshop at EMNLP 2015 on September 17 in Lisbon, Portugal.

[download pdf]

Han van der Veen graduates on composing a more complete and relevant Twitter dataset

Tuesday, August 18th, 2015, posted by Djoerd Hiemstra

Composing a more complete and relevant Twitter dataset

by Han van der Veen

Social data is widely used by many researchers. Facebook, Twitter and other social networks are producing huge amounts of social data. This social data can be used for analyzing human behavior. Social datasets are typically created by a hashtag, however not all relevant data includes the hashtag. A better overview can be constructed with more data. This research is focusing on creating a more complete and relevant dataset. Using additional keywords for finding more relevant tweets and a filtering mechanism to filter out the irrelevant tweets. Three additional keywords methods are proposed and evaluated. One based on word frequency, one on probability of word in a dataset and the last method is using estimates about the volume of tweets. Two classifiers are used for filtering Tweets. A Naive Bayes classifier and a Support Vector Machine classifier are compared. Our method increases the size of the dataset with 105%. The average precision was reduced from 95% of only using a hashtag to 76% for a resulting dataset. These evaluations were executed on two TV-Shows and two sport events. A tool was developed that automatically executes all parts of the program. As input a specific hashtag of an event is required and using the hash will output a more complete and relevant dataset than using the original hashtag. This is useful for social researchers that uses Tweets, but also other researchers that uses Tweets as their data.

[download pdf]

#SupportTheCause: Identifying Motivations to Participate in Online Health Campaigns

Monday, August 10th, 2015, posted by Djoerd Hiemstra

by Dong Nguyen, Tijs van den Broek, Claudia Hauff (TU Delft), Djoerd Hiemstra, and Michel Ehrenhard

We consider the task of automatically identifying participants’ motivations in the public health campaign Movember and investigate the impact of the different motivations on the amount of campaign donations raised. Our classification scheme is based on the Social Identity Model of Collective Action (van Zomeren et al., 2008). We find that automatic classification based on Movember profiles is fairly accurate, while automatic classification based on tweets is challenging. Using our classifier, we find a strong relation between types of motivations and donations. Our study is a first step towards scaling-up collective action research methods.

The paper will be presented at the Conference on Empirical Methods in Natural Language Processing (EMNLP) on September 17-21, in Lisbon, Portugal.

[download pdf]

What country was this tweeted from?

Monday, August 3rd, 2015, posted by Djoerd Hiemstra

Determine the User Country of a Tweet

by Han van der Veen, Djoerd Hiemstra, Tijs van den Broek, Michel Ehrenhard, and Ariana Need

In the widely used message platform Twitter, about 2% of the tweets contains the geographical location through exact GPS coordinates (latitude and longitude). Knowing the location of a tweet is useful for many data analytics questions. This research is looking at the determination of a location for tweets that do not contain GPS coordinates. An accuracy of 82% was achieved using a Naive Bayes model trained on features such as the users’ timezone, the user’s language, and the parsed user location. The classiffier performs well on active Twitter countries such as the Netherlands and United Kingdom. An analysis of errors made by the classiffier shows that mistakes were made due to limited information and shared properties between countries such as shared timezone. A feature analysis was performed in order to see the effect of different features. The features timezone and parsed user location were the most informative features.

[download pdf]

Mike Kolkman graduates on cross-domain geocoding

Thursday, July 2nd, 2015, posted by Djoerd Hiemstra

Cross-domain textual geocoding: influence of domain-specific training data

by Mike Kolkman

Modern technology is more and more able to understand natural language. To do so, unstructured texts need to be analysed and structured. One such structuring method is geocoding, which is aimed at recognizing and disambiguating references to geographical locations in text. These locations can be countries and cities, but also streets and buildings, or even rivers and lakes. A word or phrase that refers to a location is called a toponym. Approaches to tackle the geocoding task mainly use natural language processing techniques and machine learning. The difficulty of the geocoding task is dependent of multiple aspects, one of which is the data domain. The domain of a text describes the type of the text, like its goal, degree of formality, and target audience. When texts come from two (or more) different domains, like a Twitter post and a news item, they are said to be cross-domain.
An analysis of baseline geocoding systems shows that identifying toponyms on cross-domain data has still room for improvement, as existing systems depend significantly on domain-specific metadata. Systems focused on Twitter data are often dependent on account information of the author and other Twitter specific metadata. This causes the performance of these systems to drop significantly when applied on news item data.
This thesis presents a geocoding system, called XD-Geocoder, aimed at robust cross-domain performance by using text-based and lookup list based features only. Such a lookup list is called a gazetteer and contains a vast amount of geographical locations and information about these locations. Features are built up using word shape, part-of-speech tags, dictionaries and gazetteers. The features are used to train SVM and CRF classifiers.
Both classifiers are trained and evaluated on three corpora from three domains: Twitter posts, news items and historical documents. These evaluations show Twitter data to be the best for training out of the tested data sets, because both classifiers show the best overall performance when trained on tweets. However, this good performance might also be caused by the relatively high toponym to word ratio in the used Twitter data.
Furthermore, the XD-Geocoder was compared to existing geocoding systems. Although the XD-Geocoder is outperformed by state-of-the-art geocoders on single-domain evaluations (trained and evaluated on data from the same domain), it outperforms the baseline systems on cross-domain evaluations.

[download pdf]

Where to go on your next trip?

Wednesday, June 3rd, 2015, posted by Djoerd Hiemstra

Optimizing Travel Destinations Based on User Preferences

by Julia Kiseleva (TU Eindhoven), Melanie Müller (, Lucas Bernardi (, Chad Davis (, Ivan Kovacek (, Mats Stafseng Einarsen (, Jaap Kamps (University of Amsterdam), Alexander Tuzhilin (New York University), Djoerd Hiemstra

Recommendation based on user preferences is a common task for e-commerce websites. New recommendation algorithms are often evaluated by offline comparison to baseline algorithms such as recommending random or the most popular items. Here, we investigate how these algorithms themselves perform and compare to the operational production system in large scale online experiments in a real-world application. Specifically, we focus on recommending travel destinations at, a major online travel site, to users searching for their preferred vacation activities. To build ranking models we use multi-criteria rating data provided by previous users after their stay at a destination. We implement three methods and compare them to the current baseline in random, most popular, and Naive Bayes. Our general conclusion is that, in an online A/B test with live users, our Naive-Bayes based ranker increased user engagement significantly over the current online system.

To be presented at SIGIR 2015, the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, on 12 August in Santiago de Chile.

[download preprint]