Russian Summer School in Information Retrieval

I will give several lectures on information retrieval modeling at the Russian Summer School in Information Retrieval, which will be held September 11-16, 2009 in Petrozavodsk, Russia. The main audience of the school is graduate and post-graduate students, young scientists and professionals who have experience in development of information retrieval applications. The school will host approximately 100 participants.

Information Retrieval Modeling
There is no such thing as a dominating model or theory of information retrieval, unlike the situation in for instance the area of databases where the relational model is the dominating database model. In information retrieval, some models work for some applications, whereas others work for other applications. For instance, vector space models are well-suited for similarity search and relevance feedback in many (also non-textual) situations if a good weighting function is available; the probabilistic retrieval model or naive Bayes model might be a good choice if examples of relevant and nonrelevant documents are available; Google’s PageRank model is often used in situations that need modelling of more of less static relations between documents; region models have been designed to search in structured text; and language models are helpful in situations that require models of language similarity or document priors; In this tutorial, I carefully describe all these models by explaining the consequences of modelling assumptions. I address approaches based on statistical language models in great depth. After the course, students are able to choose a model of information retrieval that is adequate in new situations, and to apply the model in practice.

More information at RuSSIR 2009.

3 Responses to “Russian Summer School in Information Retrieval”

  1. Djoerd Hiemstra Says:

    A small photo impression of the Russian Summer School in Information Retrieval:

    A nice canal in St. Petersburg

    Last minute organizational phone call by Pavel

    The Hermitage

    6:00 am local time: Jimmi arrives in Petrozavodsk (4 hours before lectures)

    Boat trip with Sergey, Pavel, and Kseniya

    Karelian folk music and dance at the banquet

    Sunday trip: decorated tree

    Mineral water…

    …no tap water

    tastes rather, err, minerally

    Me and a waterfall

    Wooden church

    This guy is still here

    Back in St. Petersburg: Pavel in front of the Church of the Savior on Blood.

  2. Djoerd Hiemstra Says:

    In response to follow-up questions on my last last lecture, in which I did a life Expectation Maximization training on the blackboard, let me explain how it is done. From the top of my head, I had the following model (LaTeX-like equations):

    P(T|D) = \sum_{A \in Authors of D} P(T|A)P(A|D)

    where P(T|A) is the author language model, and P(A|D) is 1 over the number of authors. We’re going to use EM to estimate the best author language models, and the best author models are those that optimize the probability of the data (the documents), so the likelihood function is something like:

    \prod_{D \in Documents} \prod{T \in D} \sum{A \in D} (P(T|A)P(A|D))^{freq(T,D)}

    EM will give those estimates for P(T|A) that maximize the likelihood function. I assumed the documents are independent (product over documents), and so are the terms given the documents (product over terms). We’re estimating P(T|A), for every pair (term, author) in a document there are two possibilities: it was either written by the author (=1 occurrence), or not (=0 occurrences; i.e. it was written by one of the other authors in the document). I do this for every occurrence of a term in a document. So the steps are:

    e_{T,A} = \sum_D ( freq(T, D) * ( P(T|A)P(A|D) / ( \sum_{A’ \in D} P(T|A’)P(A’|D) ) ) )
    P_{new}(T|A) = e_{T,A} / ( \sum_{T’} e_{T’,A} )

    The M-step simply normalizes the probabilities of each author model. I included a little Perl script to show what happens.

    Me teaching at RuSSIR
    Picture taken by Alexander.

  3. Djoerd Hiemstra Says:
    All lectures are on-line now at: