Archive for the 'Uncategorized' Category

New team member: Mohammad Khelghati

Thursday, December 15th, 2011, posted by Djoerd Hiemstra
Mohammad Khelghati joined the database group to work on deep web entity monitoring. Welcome Mohammad!

Rianne Kaptein defends PhD thesis on Focused Retrieval

Monday, October 10th, 2011, posted by Djoerd Hiemstra

Effective Focused Retrieval by Exploiting Query Context and Document Structure

by Rianne Kaptein

The classic IR (Information Retrieval) model of the search process consists of three elements: query, documents and search results. A user looking to fulfill an information need formulates a query usually consisting of a small set of keywords summarizing the information need. The goal of an IR system is to retrieve documents containing information which might be useful or relevant to the user. Throughout the search process there is a loss of focus, because keyword queries entered by users often do not suitably summarize their complex information needs, and IR systems do not sufficiently interpret the contents of documents, leading to result lists containing irrelevant and redundant information. The main research question of this thesis is to exploit query context and document structure to provide for more focused retrieval.

More info

PhD position: Deep Web Entity Monitoring

Monday, May 16th, 2011, posted by Djoerd Hiemstra

The Database Group of the University of Twente offers a PhD student position in the Dutch national project COMMIT, a 100M Euro project involving 10 universities and 70 companies. The program brings together leading researchers in search engines, parallel computing, databases, interaction in context, embedded systems and knowledge technology.

A large part of the web, the invisible web or deep web, cannot be indexed by web crawlers, for instance dynamic web pages that are returned in response to filling in a web form, or performing a search in a search engine. Instead of crawling deep web data, the approach will monitor web pages for certain (types of) queries. The objective is to develop approaches for monitoring web data that allow users to see a page’s full history of relevant/important changes by identifying entities: people, organizations, products, geographic locations, events, etc. The approach should relate changes in multiple web sites, giving the user a data-warehouse-like overview of the pages they monitor; drilling down to time periods, persons, events, etc.

The research will be done in co-operation with WCC. WCC, started in 1996 and is a successful software company based in Utrecht (NL) and Reston (USA). WCC’s current focus areas are the Employment and Identification Security markets. Both commercial and government customers worldwide use WCC’s smart search & match solutions to support their primary processes. Both WCC and the Database Group of the University of Twente have made significant advances in entity matching and entity ranking applied to for instance Employment Matching and Expert Search. This project will extend this work to monitoring of deep web pages, such a social networking sites, micro-blogging sites, job sites, etc. The candidate will spend part of the time at WCC in Utrecht.

[official vacancy text] (deadline: July 3rd, 2011)

Open source alternatives for Blackboard?

Thursday, November 11th, 2010, posted by Djoerd Hiemstra

Starting in 2009, the University of Twente uses Blackboard as on-line learning management system. However, Blackboard turns out to be very insecure; see for instance the news item (in Dutch) Universiteitssoftware blijkt langdurig lek. Among other things, it is not only possible but actually easy for students to hack into a teacher’s account and invisibly change grades. As it turns out, this has been known amongst our students for quite some time.

Blackboard is a commercial system and its internals are a company secret. Kerckhoffs’ Principle states that a secure system must not require secrecy. This way, it can be stolen by the enemy without causing trouble. In the design of software systems, this argument is used in favour of open source software security: Security through obscurity is considered bad practice, see for instance Jaap-Henk Hoepman and Bart Jacobs’ Communications of the ACM article Increased security through open source (CACM 50-1, 2007). So, maybe it is time to look at some of the open source alternatives out there, such as Sakai or Moodle. Both come with commercial support, in case our technical university does not want to invest in the expertise to deploy such a system in-house.

Keith van Rijsbergen retired

Thursday, July 22nd, 2010, posted by Djoerd Hiemstra

Keith van Rijsbergen is retiring this year. To celebrate his long successful career, you can download his book “Information Retrieval” in the popular epub format, an open format that is supported by most e-readers.

InformationRetrieval.epub
InformationRetrieval.epub

Since the publication in 1976 of the first edition of Van Rijsbergen’s book, it has established itself as a classic. The book gives a thorough introduction to “automatic ranked” retrieval, which today forms the basis of web search engines, but at that time was still highly experimental. The book covers all important information retrieval topics, but it is Van Rijsbergen’s personal view on information retrieval that makes the book so different from other scientific books on information retrieval: The book is written in the first person, a writing style I would normally not recommend for scientific documents. In this book, however, Van Rijsbergen’s personal style of writing inspired me a lot. Maybe it is his undisputed expertise, maybe it is his critical analysis of the work of others, or maybe it is merely his enthousiastic account of science, whatever it is, it is a pleasure to read the book, even almost 35 years after its first publications. Here is a nice example, where Van Rijsbergen’s shares his view on significance tests:

Keith van RijsbergenUnfortunately, I have to agree with the findings of the Comparative Systems Laboratory in 1968, that there are no known statistical tests applicable to IR. This may sound like a counsel of defeat but let me hasten to add that it is possible to select a test which violates only a few of the assumptions it makes.

His analysis let me to use the paired sign test in my PhD thesis, and I motivated this by adding that Van Rijsbergen says I am allowed to do so. (Actually, he claims I am allowed to do so only conservatively, because some of the test’s assumptions are not met…) The book is also a no-nonsense book in many respects, with many practical approaches that are directly applicable. In several of our experiments, we used the stop word list printed in the book (see Table 2.1). This is science in its best form. Experiments should be easily reproducible, and what is more easy than the usage of a officially published stop word list?

So, if you are still looking for a good, personal, entertaining, no-nonsense, scientific book on information retrieval to be read by the pool during the holidays, please consider Information Retrieval. No e-reader yet? Then you can read the ebook using the EPUBReader Firefox addon.

[download epub]

Ralf Schimmel graduates on keyword suggestion

Monday, March 15th, 2010, posted by Djoerd Hiemstra

Keyword Suggestion for Search Engine Marketing

by Ralf Schimmel

Every person acquainted with the web, is also a frequent user of search engines like Yahoo and Google. Any person with a web site makes this web site with a vision in mind, most of the times this entails being found on the web. Search engines offer several methods to users that help them to be found. One group of the techniques used in this field is Search Engine Optimization (SEO), which covers everything that can be done to optimize a web site for the search engine. The whole idea of SEO is to ensure that a web site is listed in the set of search results once a matching query is entered by a user. A second important part of the search engines is Search Engine Advertisement (SEA). Billions of dollars are paid by companies that bid on keywords that match their advertisements to a users query. These keywords are hard to find, of course a company knows what it sells, but it does not know how the users search for the same products or services. Advertising in search engines can be done in multiple ways. The focus of this research lies in finding many long-tail keywords, words that often have a low search volume, but which are cheap (low competition) and which are often specific enough to ensure high conversion rates (a visitor becomes a customer). Several keyword suggestion techniques are researched and evaluated for practical use. One applicable technique is chosen, implemented and evaluated. The chosen technique is a web based technique which is using an undirected weighted graph of candidate terms (nodes), where the weight of the vertices is the semantic similarity between the two nodes, and where the term frequency of the term is stored in the node. The evaluation shows that it is a technique capable of suggesting a lot of relevant keywords that can be used for search engine marketing. According to the evaluation the technique is capable of using the term frequencies and the semantic similarities to find and rank suggestions based on popularity and relevance. The most important conclusion is that, for single term suggestions, the system outperforms Google’s suggestion system. Google’s precision on single term suggestions is better then the precision of the new tool, however the relative recall of Google is a lot worse, for both obvious and non-obvious single term suggestions. Currently the tool can only be used to complement Google’s tool, however once extended with support for multi term suggestions it can replace the entire system.

[download pdf]

Searching in the free world

Wednesday, January 13th, 2010, posted by Djoerd Hiemstra

Google faced a cyber attack originating from computers in China, that was serious enough to send an ultimatum to the Chinese government:

…We have decided we are no longer willing to continue censoring our results on Google.cn, and so over the next few weeks we will be discussing with the Chinese government the basis on which we could operate an unfiltered search engine within the law, if at all…

See: Google’s blog.

Sander Bockting wint ENIAC scriptieprijs

Monday, December 7th, 2009, posted by Djoerd Hiemstra

Sander Bockting heeft dit jaar de ENIAC scriptieprijs gewonnen. ENIAC is de de alumnivereniging voor oud-studenten van Informatica, Bedrijfsinformatietechnologie en Telematica. ENIAC reikt elk jaar een prijs uit voor de beste afstudeerscriptie. Het juryrapport luidt:

De jury heeft besloten de ENIAC scriptieprijs 2009 toe te kennen aan de scriptie “Collection Selection for Distributed Web Search: Using Highly Discriminative Keys, Query-driven Indexing and ColRank”, van Sander Bockting. De jury heeft gekozen voor deze scriptie, vanwege de relevantie van het onderzoek, de wetenschappelijke benadering en het grote deel ‘ontwerp’ (het prototype Sophos) dat in het werk besloten ligt. Hiernaast biedt Sanders onderzoek een (mogelijk) antwoord op het toegankelijke houden van het internet. Zoeken op internet en de bijbehorende zoekmachines vervullen een maatschappelijke functie in het ontsluiten van informatie. Door de sterke groei van het internet is het echter onmogelijk om het gehele internet centraal te blijven indexeren. Tevens geeft deze methode veel macht aan de eigenaren van enkele centrale zoekmachines. Sander laat zien dat het toepassen van gedistribueerde zoeksystemen een veelbelovende aanpak is, die in potentie gegevens beter ontsluit terwijl de afhankelijkheid van enkele centrale zoekmachines afneemt. De vijf door hem vergelijken technieken zijn dan ook een prima basis voor maatschappelijk en wetenschappelijk relevant vervolgonderzoek.

Searching in the 21st Century

Thursday, November 26th, 2009, posted by Djoerd Hiemstra

Information retrieval (IR) can be defined as the process of representing, managing, searching, retrieving, and presenting information. Good IR involves understanding information needs and interests, developing an effective search technique, system, presentation, distribution and delivery. The increased use of the Web and wider availability of information in this environment led to the development of Web search engines. This change has brought fresh challenges to a wider variety of users’ needs, tasks, and types of information. Today, search engines are seen in enterprises, on laptops, in individual websites, in library catalogues, and elsewhere. Information Retrieval: Searching in the 21st Century focuses on core concepts, and current trends in the field. This book focuses on:

  • Information Retrieval Models
  • User-centred Evaluation of Information Retrieval Systems
  • Multimedia Resource Discovery
  • Image Users’ Needs and Searching Behaviour
  • Web Information Retrieval
  • Mobile Search
  • Context and Information Retrieval
  • Text Categorisation and Genre in Information Retrieval
  • Semantic Search
  • The Role of Natural Language Processing in Information Retrieval: Search for Meaning and Structure
  • Cross-language Information Retrieval
  • Performance Issues in Parallel Computing for Information Retrieval
This book is an invaluable reference for graduate students on IR courses or courses in related disciplines (e.g. computer science, information science, human-computer interaction, and knowledge management), academic and industrial researchers, and industrial personnel tracking information search technology developments to understand the business implications. Intermediate-advanced level undergraduate students on IR or related courses will also find this text insightful. Chapters are supplemented with exercises to stimulate further thinking.

More information at Wiley.

Susan Dumais won the Salton award

Friday, August 7th, 2009, posted by Djoerd Hiemstra

Susan DumaisSue Dumais won the Salton award, and gave a terrific keynote talk at the SIGIR Conference in Boston entitled “An Interdisciplinary Perspective on Information Retrieval”. Susan was awarded for “nearly thirty years of significant, sustained, and continuing contributions to research, for exceptional mentorship, and for leadership in bridging the fields of information retrieval and human computer interaction. Her contributions to both the theoretical development and practical implementations of Latent Semantic Indexing, question-answering, desktop search, combining search and navigation, and incorporating the user and their context, have all substantially advanced and enriched the field of Information Retrieval.”

More info at ACM SIGIR.