Archive for the 'Social Data' Category

Welcome to REDI

Monday, April 23rd, 2018, posted by Djoerd Hiemstra

Welcome to Research Experiments in Databases and Information Retrieval (REDI)! The theme of this year’s course is: Recommendation in federated social networks. Federated social networks consist of multiple independent servers that cooperate. An example is Mastodon, a free open source implementation of a micro-blogging social network that resembles Twitter. Unlike Twitter (or Facebook for that matter), nobody has a complete view of all accounts and posts in a federated social network. We will address two research problems: 1) How to implement recommendations using only local knowledge of the network? and 2) How to evaluate your system in such a highly dynamic environment?

U. Twente Mastodon

We are the first University of Twente course with a public Canvas syllabus. Of course, we will appropriately use Mastodon to communicate about REDI. Please make an account on and follow the hash tag #REDI. Use the hash tag in questions and toots about the course.

UT Mastodon now live for all students, alumni and employees

Sunday, April 15th, 2018, posted by Djoerd Hiemstra

U. Twente Mastodon

The University of Twente is the first Dutch university to run its own Mastodon server. Mastodon is a social network based on open web protocols and free, open-source software. It is decentralized like e-mail. Learning from failures of other networks, Mastodon aims to make ethical design choices to combat the misuse of social media. By joining U. Twente Mastodon, you join a global social network with more than a million people. The university will not sell your data, nor show you advertisements. Mastodon U. Twente is available to all students, alumni, and employees.

Join Mastodon U. Twente now.

Christel Geurts graduates on Cross-Domain Authorship Attribution

Friday, January 12th, 2018, posted by Djoerd Hiemstra

Cross-Domain Authorship Attribution as a Tool for Digital Investigations

by Christel Geurts

On the darkweb sites promoting illegal content are abundant and new sites are constantly created. At the same time Law Enforcement is working hard to take these sites down and track down the persons involved. Often, after taking down a site, users change their name and move to a different site. But what if Law Enforcement could track users across sites? Different sites or sources of information are called a domain. As the domain changes, often the context of a message also changes, making it challenging to track users simply on words used. The aim of this thesis is to develop a system that can link written text of authors in a cross-domain setting. The system was tested on a blog corpus and verified on police data. Tests show that multinomial logistic regression and Support Vector Machines with a linear kernel perform well. Character 3-grams work well as features, combining multiple feature sets increases performance. Tests show that Logistic Regression models with a combined feature set performed best (accuracy = 0.717, MRR = 0.7785, 1000 authors (blog corpus)). On the police data the Logistic Regression model had an accuracy of 0.612 and a MRR of 0.6883 for 521 authors.

Slavica Zivanovic graduates on capturing and mapping QOL using Twitter data

Monday, February 27th, 2017, posted by Djoerd Hiemstra

by Slavica Zivanovic

There is an ongoing discussion about the applicability of social media data in scientific research. Moreover, little is known about the feasibility to use these data to capture the Quality of Life (QoL). This study explores the use of social media in QoL research by capturing and analysing people’s perceptions about their QoL using Twitter messages. The methodology is based on a mixed method approach, combining manual coding of the messages, automated classification, and spatial analysis. The city of Bristol is used as a case study, with a dataset containing 1,374,706 geotagged Tweets sent within the city boundaries in 2013. Based on the manual coding results, health, transport, and environment domains were selected to be further analysed. Results show the difference between Bristol wards in number and type of QoL perceptions in every domain, spatial distribution of positive and negative perceptions, and differences between the domains. Furthermore, results from this study are compared to the official QoL survey results from Bristol, statistically and spatially. Overall, three main conclusions are underlined. First, Twitter data can be used to evaluate QoL. Second, based on people’s opinions, there is a difference in QoL between Bristol neighbourhoods. And, third, Twitter messages can be used to complement QoL surveys but not as a proxy. The main contribution of this study is in recognising the potential Twitter data have in QoL research. This potential lies in producing additional knowledge about QoL that can be placed in a planning context and effectively used to improve the decision-making process and enhance quality of life of residents.

[download pdf]

#WhoAmI in 160 Characters?

Wednesday, October 5th, 2016, posted by Djoerd Hiemstra

Classifying Social Identities Based on Twitter

by Anna Priante, Djoerd Hiemstra, Tijs van den Broek, Aaqib Saeed, Michel Ehrenhard, and Ariana Need

We combine social theory and NLP methods to classify English-speaking Twitter users’ online social identity in profile descriptions. We conduct two text classification experiments. In Experiment 1 we use a 5-category online social identity classification based on identity and self-categorization theories. While we are able to automatically classify two identity categories (Relational and Occupational), automatic classification of the other three identities (Political, Ethnic/religious and Stigmatized) is challenging. In Experiment 2 we test a merger of such identities based on theoretical arguments. We find that by combining these identities we can improve the predictive performance of the classifiers in the experiment. Our study shows how social theory can be used to guide NLP methods, and how such methods provide input to revisit traditional social theory that is strongly consolidated in offline setting

To be presented at the EMNLP Workshop on Natural Language Processing and Computational Social Science (NLP+CSS) on November 5 in Austin, Texas, USA.

[download pdf]

Download the code book and classifier source code from github.

Marco Schultewolter graduates on Verification of User Information

Thursday, July 7th, 2016, posted by Djoerd Hiemstra

by Marco Schultewolter

Often, software providers ask users to insert personal data in order to grant them the right to use their software. These companies want the user profile as correct as possible, but users sometimes tend to enter incorrect information. This thesis researches and discusses approaches to automatically verify this information using third-party web resources.
Therefore, a series of experiments is done. One experiment compares different similarity measures in the context of a German phone book directory for again different search approaches. Another experiment takes the approach to use a search engine without a specific predefined data source. Ways of finding persons in search engines and of extracting address information from unknown websites are compared in order to do so.
It is shown, that automatic verification can be done to some extent. The verification of name and address data using external web resources can support the decision with Jaro-Winkler as similarity measure, but it is still not solid enough to only rely on it. Extracting address information from unknown pages is very reliable when using a sophisticated regular expression. Finding persons on the internet should be done by using just the full name without any additions.

[download pdf]

#SupportTheCause: Online Protest and Advocacy Symposium

Wednesday, January 6th, 2016, posted by Djoerd Hiemstra

21-22 January 2016
University of Twente

#SupportTheCauseIf you’re interested in social media analysis and/or computational social science, there will be interesting guest speakers, including speakers from UCLA, TNO, TU Delft, Greenpeace, Sanquin, and Twitter.

IPython Notebook Exercises for Web Science

Friday, November 6th, 2015, posted by Djoerd Hiemstra

Check out the Jupyter IPython Notebook Exercises made for the module Web Science. The exercises closely follow the exercises from Chapter 13 and 14 of the wonderful Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg. Download the notebooks here:

Graph Update (February 2016). The notebooks with answers are now available below:

Maurice Bolhuis graduates on Estimating Creditworthiness using Uncertain Online Data

Thursday, October 15th, 2015, posted by Djoerd Hiemstra

Estimating Creditworthiness using Uncertain Online Data

by Maurice Bolhuis

The rules for credit lenders have become stricter since the financial crisis of 2007-2008. As a consequence, it has become more difficult for companies to obtain a loan. Many people and companies leave a trail of information about themselves on the Internet. Searching and extracting this information is accompanied with uncertainty. In this research, we study whether this uncertain online information can be used as an alternative or extra indicator for estimating a company’s creditworthiness and how accounting for information uncertainty impacts the prediction performance.
A data set consisting 3579 corporate ratings has been constructed using the data of an external data provider. Based on the results of a survey, a literature study and information availability tests, LinkedIn accounts of company owners, corporate Twitter accounts and corporate Facebook accounts were chosen as an information source for extracting indicators. In total, the Twitter and Facebook accounts of 387 companies and 436 corresponding LinkedIn owner accounts of this data set were manually searched. Information was harvested from these sources and several indicators have been derived from the harvested information.
Two experiments were performed with this data. In the first experiment, a Naive Bayes, J48, Random Forest and Support Vector Machine classifier was trained and tested using solely these Internet features. A comparison of their accuracy to the 31% accuracy of the ZeroR classifier, which as a rule always predicts the most occurring target class, showed that none of the models performed statistically better. In a second experiment, it was tested whether combining Internet features with financial data increases the accuracy. A financial data mining model was created that approximates the rating model of the ratings in our data set and that uses the same financial data as the rating model. The two best performing financial models were built using the Random Forest and J48 classifiers with an accuracy of 68% and 63% respectively. Adding Internet features to these models gave mixed results with a significant decrease and an insignificant increase respectively.
An experimental setup for testing how incorporating uncertainty affects the prediction accuracy of our model is explained. As part of this setup, a search system is described to find candidate results of online information related to a subject and to classify the degree of uncertainty of this online information. It is illustrated how uncertainty can be incorporated into the data mining process.

[download pdf]

Guest speakers at 12th SSR

Tuesday, October 6th, 2015, posted by Djoerd Hiemstra

We are proud to announce the 12th Seminar on Searching and Ranking, with guest presentations by Ingo Frommholz from the University of Bedfordshire, UK, and Tom Heskes from Radboud University Nijmegen, the Netherlands.
More information at: SSR 12.