## Russian Summer School in Information Retrieval

I will give several lectures on information retrieval modeling at the Russian Summer School in Information Retrieval, which will be held September 11-16, 2009 in Petrozavodsk, Russia. The main audience of the school is graduate and post-graduate students, young scientists and professionals who have experience in development of information retrieval applications. The school will host approximately 100 participants.

**Information Retrieval Modeling**

There is no such thing as a dominating model or theory of information retrieval, unlike the situation in for instance the area of databases where the relational model is the dominating database model. In information retrieval, some models work for some applications, whereas others work for other applications. For instance, vector space models are well-suited for similarity search and relevance feedback in many (also non-textual) situations if a good weighting function is available; the probabilistic retrieval model or naive Bayes model might be a good choice if examples of relevant and nonrelevant documents are available; Google’s PageRank model is often used in situations that need modelling of more of less static relations between documents; region models have been designed to search in structured text; and language models are helpful in situations that require models of language similarity or document priors; In this tutorial, I carefully describe all these models by explaining the consequences of modelling assumptions. I address approaches based on statistical language models in great depth. After the course, students are able to choose a model of information retrieval that is adequate in new situations, and to apply the model in practice.

More information at RuSSIR 2009.

September 21st, 2009 at 12:19 pm

A small photo impression of the Russian Summer School in Information Retrieval:

A nice canal in St. Petersburg

Last minute organizational phone call by Pavel

The Hermitage

6:00 am local time: Jimmi arrives in Petrozavodsk (4 hours before lectures)

Boat trip with Sergey, Pavel, and Kseniya

Karelian folk music and dance at the banquet

Sunday trip: decorated tree

Mineral water…

…no tap water

tastes rather, err, minerally

Me and a waterfall

Wooden church

This guy is still here

Back in St. Petersburg: Pavel in front of the Church of the Savior on Blood.

October 29th, 2009 at 11:40 am

In response to follow-up questions on my last last lecture, in which I did a life Expectation Maximization training on the blackboard, let me explain how it is done. From the top of my head, I had the following model (LaTeX-like equations):

where P(T|A) is the author language model, and P(A|D) is 1 over the number of authors. We’re going to use EM to estimate the best author language models, and the best author models are those that optimize the probability of the data (the documents), so the likelihood function is something like:

EM will give those estimates for P(T|A) that maximize the likelihood function. I assumed the documents are independent (product over documents), and so are the terms given the documents (product over terms). We’re estimating P(T|A), for every pair (term, author) in a document there are two possibilities: it was either written by the author (=1 occurrence), or not (=0 occurrences; i.e. it was written by one of the other authors in the document). I do this for every occurrence of a term in a document. So the steps are:

The M-step simply normalizes the probabilities of each author model. I included a little Perl script to show what happens.

Picture taken by Alexander.

April 26th, 2010 at 12:03 pm All lectures are on-line now at: http://videolectures.net/russir09_petrozavodsk/