Archive for the 'Data Science' Category

IPython Notebook Exercises for Web Science

Friday, November 6th, 2015, posted by Djoerd Hiemstra

Check out the Jupyter IPython Notebook Exercises made for the module Web Science. The exercises closely follow the exercises from Chapter 13 and 14 of the wonderful Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg. Download the notebooks here:

Graph Update (February 2016). The notebooks with answers are now available below:

Maurice Bolhuis graduates on Estimating Creditworthiness using Uncertain Online Data

Thursday, October 15th, 2015, posted by Djoerd Hiemstra

Estimating Creditworthiness using Uncertain Online Data

by Maurice Bolhuis

The rules for credit lenders have become stricter since the financial crisis of 2007-2008. As a consequence, it has become more difficult for companies to obtain a loan. Many people and companies leave a trail of information about themselves on the Internet. Searching and extracting this information is accompanied with uncertainty. In this research, we study whether this uncertain online information can be used as an alternative or extra indicator for estimating a company’s creditworthiness and how accounting for information uncertainty impacts the prediction performance.
A data set consisting 3579 corporate ratings has been constructed using the data of an external data provider. Based on the results of a survey, a literature study and information availability tests, LinkedIn accounts of company owners, corporate Twitter accounts and corporate Facebook accounts were chosen as an information source for extracting indicators. In total, the Twitter and Facebook accounts of 387 companies and 436 corresponding LinkedIn owner accounts of this data set were manually searched. Information was harvested from these sources and several indicators have been derived from the harvested information.
Two experiments were performed with this data. In the first experiment, a Naive Bayes, J48, Random Forest and Support Vector Machine classifier was trained and tested using solely these Internet features. A comparison of their accuracy to the 31% accuracy of the ZeroR classifier, which as a rule always predicts the most occurring target class, showed that none of the models performed statistically better. In a second experiment, it was tested whether combining Internet features with financial data increases the accuracy. A financial data mining model was created that approximates the rating model of the ratings in our data set and that uses the same financial data as the rating model. The two best performing financial models were built using the Random Forest and J48 classifiers with an accuracy of 68% and 63% respectively. Adding Internet features to these models gave mixed results with a significant decrease and an insignificant increase respectively.
An experimental setup for testing how incorporating uncertainty affects the prediction accuracy of our model is explained. As part of this setup, a search system is described to find candidate results of online information related to a subject and to classify the degree of uncertainty of this online information. It is illustrated how uncertainty can be incorporated into the data mining process.

[download pdf]

Guest speakers at 12th SSR

Tuesday, October 6th, 2015, posted by Djoerd Hiemstra

We are proud to announce the 12th Seminar on Searching and Ranking, with guest presentations by Ingo Frommholz from the University of Bedfordshire, UK, and Tom Heskes from Radboud University Nijmegen, the Netherlands.
More information at: SSR 12.

CBS / UT Data Camp 2015

Tuesday, September 15th, 2015, posted by Djoerd Hiemstra

On 23-27 November 2015, the Data Camp, a joint event organized by the Central Bureau for Statistics of the Netherlands (CBS) and the University of Twente (UT). During the camp, a set of CBS data analysts and UT researchers will answer research questions about statistics using big data technologies. On Monday, the participants will be presented with overview presentations about the research questions and technologies. The data camp participants will work in small, mixed teams in an informal setting. Experienced data scientists will support the teams by short mini-workshops and hands-on support. The hope is that the intense contact with the research question in an informal and spontaneous environment will produce valuable and innovative answers to the posed questions.

Guest speakers are Erik Tjong Kim Sang (Meertens Institute, Amsterdam) and David González (Vizzuality, Madrid).

[download report]

Where to go on your next trip?

Wednesday, June 3rd, 2015, posted by Djoerd Hiemstra

Optimizing Travel Destinations Based on User Preferences

by Julia Kiseleva (TU Eindhoven), Melanie Müller (, Lucas Bernardi (, Chad Davis (, Ivan Kovacek (, Mats Stafseng Einarsen (, Jaap Kamps (University of Amsterdam), Alexander Tuzhilin (New York University), Djoerd Hiemstra

Recommendation based on user preferences is a common task for e-commerce websites. New recommendation algorithms are often evaluated by offline comparison to baseline algorithms such as recommending random or the most popular items. Here, we investigate how these algorithms themselves perform and compare to the operational production system in large scale online experiments in a real-world application. Specifically, we focus on recommending travel destinations at, a major online travel site, to users searching for their preferred vacation activities. To build ranking models we use multi-criteria rating data provided by previous users after their stay at a destination. We implement three methods and compare them to the current baseline in random, most popular, and Naive Bayes. Our general conclusion is that, in an online A/B test with live users, our Naive-Bayes based ranker increased user engagement significantly over the current online system.

To be presented at SIGIR 2015, the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, on 12 August in Santiago de Chile.

[download preprint]

Data Science Day

Wednesday, April 1st, 2015, posted by Djoerd Hiemstra

On 20 April, we organize a Data Science Day in the DesignLab. Invited speakers at the Data Science Colloquium are Piet Daas, methodologist and big data research coordinator of the CBS (Centraal Bureau Statistiek) who will talk about big data from Twitter and Facebook as a data source for official statistics; Rolf de By and Raul Zurita Milla, professors of ITC Geo-Information Science and Earth Observation, will talk about remote sensing techniques, using satelites and drones for helping economies in poor areas in the world, a prestigious project funded by the Bill and Melinda Gates Foundation; and Jan Willem Tulp creator of interactive data visualisations for magazines like Scientific American and Popular Science, as well as companies, for instance the Tax Free Retail Analysis Tool for Schiphol Amsterdam Airport.

The Data Science colloquia are kindly sponsored by the CTIT and the Netherlands Research School for Information and Knowledge Systems (SIKS) and part of the SIKS educational program.

[more information]

CTIT Hadoop cluster open

Thursday, January 8th, 2015, posted by Djoerd Hiemstra

CTIT Hadoop cluster with Djoerd, Frederik, Maurice, and Robin

The new CTIT Hadoop cluster, 512 cores, 2 TB ram, and 0.5 PB storage, is now open for researchers and students in Twente. We started by giving accounts to the 60 students that follow the course Managing Big Data. From left to right in the (not so) cold aisle: Djoerd, Frederik, Maurice, and Robin.

Twente Data Science Center

Wednesday, October 8th, 2014, posted by Djoerd Hiemstra

Scientific and economic progress is increasingly powered by our capabilities to explore big datasets. Data is the driving force behind the successful innovation of Internet companies like Google, Twitter, and Yahoo, and job advertisements show an increasing need for data scientists and big data analysts. Data scientists dig for value in data by analyzing for instance texts, application usage logs, and sensor data. The need for data scientists and big data analysts is apparent in almost every sector in our society, including business, health care, and education.

The Twente Center for Data Science is a collaboration between research groups of the University of Twente to research, promote and facilitate big data analysis for all scientific disciplines. The center operates by the participants sharing their expertise, sharing their contacts, sharing their data, and sharing their research infrastructure (hardware and software) for large-scale data analysis.

The Twente Data Science Center offers a unique combination of expertise in computer science, mathematics, management, behavioral sciences and social sciences; collaborations with leading international companies such as Google, Twitter and Yahoo; and local infra­structure and support for the analysis of very large datasets.

More information

Norvig Web Data Science Award 2014

Monday, May 19th, 2014, posted by Djoerd Hiemstra

The Norvig Web Data Science Award is organized by Common Crawl and SURFsara for researchers and students in the Benelux. SURFsara provides free access to the their Hadoop cluster with a copy of the full Common Crawl web crawl from March 2014 - almost 3 billion web pages. Participants are completely free in choosing their research question. For example, last year there were submissions looking at concept association, connections between languages, readability and more. Be creative and think outside of the box!

The award is named after Peter Norvig, Director of Research at Google, who chairs the jury that will select the winning submission. The contest will run until July 31, 2014. The winning team will be announced at the award ceremony in September 2014 and will get a tablet, smart watch and Github small plan for a year.

Sign up on:

Keynote by Ravi Kumar

Thursday, May 23rd, 2013, posted by Djoerd Hiemstra

Ravi Kumar We are very proud that Ravi Kumar from Google agreed to give a keynote speech at the CTIT Symposium on Big Data and the Emergence of Data Science. Kumar, who is well-known for hist work on web and data mining and algorithms for large data sets, has been a senior staff research scientist at Google since June 2012. Prior to this, he was a research staff member at the IBM Almaden Research Center and a principal research scientist at Yahoo! Research. He obtained his Ph.D. in Computer Science from Cornell University in 1998.
Ravi Kumar’s talk will cover two non- conventional computational models for analyzing big data. The first is data streams: in this model, data arrives in a stream and the algorithm is tasked with computing a function of the data without explicitly storing it. The second is map-reduce: in this model, data is distributed across many machines and computation is done as sequence of map and reduce operations. Kumar will present a few algorithms in these models and discuss their scalability.

The workshop takes place on Tuesday 4 June at the University of Twente. Other invited spearkers at the CTIT symposium are Maarten de Rijke (U. Amsterdam) and Milan Petkovic (Philips).