Archive for 2013

Empirical Training for Conditional Random Fields

Tuesday, February 26th, 2013, posted by Djoerd Hiemstra

A Closed Form Maximum Likelihood Estimator Of Conditional Random Fields

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers and Andreas Wombacher

Training Conditional Random Fields (CRFs) can be very slow for big data. In this paper, we present a new training method for CRFs called Empirical Training which is motivated by the concept of co-occurrence rate. We show that the standard training (unregularized) can have many maximum like-lihood estimations (MLEs). Empirical training has a unique closed form MLE which is also a MLE of the standard training. We are the first to identify the Test Time Problem of the standard training which may lead to low accuracy. Empirical training is immune to this problem. Empirical training is also unaffected by the label bias problem even it is locally normalized. All of these have been verified by experiments. Experiments also show that empirical training reduces the training time from weeks to seconds, and obtains competitive results to the standard and piecewise training on linear-chain CRFs, especially when data are insufficient.

[download pdf]

The Winners of The Norvig Web Data Science Award

Tuesday, February 26th, 2013, posted by Lisa Green

by Lisa Green (Common Crawl)

We are very excited to announce that the winners of the Norvig Web Data Science Award: Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente! The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science and named in honor of distinguished computer scientist Peter Norvig.

There were many excellent submissions that demonstrated how you can extract valuable insight and knowledge from web crawl data. Be sure to check out the work of the winning team, Traitor – Associating Concepts Using The World Wide Web, and the other finalists on the award website. You will find descriptions of the projects as well as links to the code that was used. We hope that these projects will serve as an inspiration for what kind of work can be done with the Common Crawl corpus. All code is open source and we are looking forward to seeing it reused and adapted for other projects.

A huge thank you to our distinguished panel of judges: Peter Norvig, Ricardo Baeza-Yates, Hilary Mason, Jimmy Lin, and Evert Lammerts!

Added on 18 March: Award winners Oliver Jundt, Wanno Drijfhout, and Lesley Wevers with their prize: a high-end Android tablet!

Snippet-based Relevance Predictions

Wednesday, February 13th, 2013, posted by Djoerd Hiemstra

Snippet-based Relevance Predictions for Federated Web Search

by Thomas Demeester, Dong Nguyen, Dolf Trieschnigg, Chris Develder, and Djoerd Hiemstra

How well can the relevance of a page be predicted, purely based on snippets? This would be highly useful in a Federated Web Search setting where caching large amounts of result snippets is more feasible than caching entire pages. The experiments reported in this paper make use of result snippets and pages from a diverse set of actual Web search engines. A linear classifier is trained to predict the snippet-based user estimate of page relevance, but also, to predict the actual page relevance, again based on snippets alone. The presented results confirm the validity of the proposed approach and provide promising insights into future result merging strategies for a Federated Web Search setting.

The paper will be presented at the 35th European Conference on Information Retrieval (ECIR) on 25 March 2013 in Moscow, Russia

[download pdf]

DIY online lecture

Wednesday, January 30th, 2013, posted by Djoerd Hiemstra

Education is changing rapidly. Many universities start to provide their courses on-line, free for everyone to take. To keep up with these developments, I made a short do-it-yourself (DIY) video lecture — appropriately discussing online education — which combines 4 techniques: 1) an ordinary lecture with slides, as always; 2) my talking head from the webcam; 3) written notes on the slides, “Khan Academy-style”; 4) a screencast, capturing live actions on the screen.

In case you like to do this too: The video was made using recordMyDesktop, CamDesk, Xournal, and edited with OpenShot, all running on Ubuntu. I used a webcam to record myself, and a pen tablet (Wacom Bamboo) to annotate the slides. As can be seen, the result is not great: The number of frames per second is low (default is 15 frames per second in recordMyDesktop), the audio quality is low (I need a better microphone), and audio and video are slightly off (not sure why this happened). On Windows a combination of CamStudio, Powerpoint, and Movie Maker might do better.

Many thanks to the following people that advised me or otherwise supported me in making the test online lecture: Wanno Drijfhout & Marije de Heus (Course Managing Big Data), Theo Huibers & Eelco Eerenberg (Thaesis), Tonnie Tibben (Twente iTunes U), Alfred de Vries (SmartXP lab), Peter de Boer & Roy Juninck (FB lecture halls and digital whiteboards). Additional advise and comments are very much appreciated.

Google Online Marketing Challenge

Wednesday, January 23rd, 2013, posted by Djoerd Hiemstra

Google Online Marketing Challenge Interested in online advertising and marketing? Together with Inter-Actief we will run a second science challenge in the next quarter from 11 Februari to 4 April. With a US$250 budget provided by Google, students will develop an online advertising strategy for a real business or non-profit organization that has not used Google’s AdWords in the last six months. The winners will receive a trip to the Google Headquarters in Mountain View, California to meet with the AdWords team. For more information, and to enroll, visit

Also, see the Google Online Marketing Challenge page.

DIR 2013 in Delft

Wednesday, January 9th, 2013, posted by Djoerd Hiemstra

On 26 April, the 13th edition of the Dutch-Belgian Information Retrieval Workshop series, DIR 2013, will be hosted at Delft University of Technology in the Netherlands. DIR invites novel previously unpublished work, compressed presentations of previous major international contributions, as well as demonstrations of applied research and industry applications. The workshop serves as a forum for exchange and discussion on relevant challenges in the fields of information retrieval, data mining and natural language processing.

More information at:

In memory of Joost van Honschoten

Thursday, January 3rd, 2013, posted by Djoerd Hiemstra

Today would have been the 41st birthday of Joost van Honschoten, who passed away almost 2 years ago. Joost was a talented young researcher, holding grants from STW and NWO, working as a professor at the Transducers Science and Technology Group of the Unversity of Twente. Joost and I published several “papers” together around 1983, not as researchers, but as comic book writers when we were about 11 and 12 years old. One of them, “Honne & Ponnie en de Jacht op Ruige Robbie” can be downloaded from the link below. The comic gives an idea of the friendship, creativity and humour that we shared.

Honne en Ponnie en de Jacht op Ruige Robbie