Eurolan '97 Summer School on Corpus Linguistics

From the 13th of July till the 26th of July 1997 I attended the Eurolan Summer School on Corpus Linguistics in Tusnad, Romania. On this page you can find a little report of the serious stuff I did there.

The Eurolan Summer School was organised for the third time in its history and is getting bigger and bigger. This time there were 15 faculty members giving lectures to 75 students from 13 different European countries. Topics covered by the Summer School were: Corpus Annotation, Word-sense Disambiguation, Lexicography, Discourse Linguistics, Statistical methods, Grammar Engineering and Finite State methods. In the remainder of this report a short description will be given of the most striking lectures per topic.

Corpus Annotation and Sense Disambiguation

The lecture of Tomaz Erjavec from Josef Stefan Institute, Ljubljana covered the use of SGML (Standard Generalised Mark-up Language) for Corpus Annotation. SGML was used in the Multext East Project ( for the annotation of parallel versions of 1984 of Orwell, fiction and Newspapers. Dan Tufis from the Romanian Academy, Bucharest also presented work on Multext East.

Nancy Ide from the Vassar College, USA and University of Aix-en- Province, France also presented work on corpus annotation. Ide mentioned standards developed in the Text Encoding Initiative (TEI) project ( and the Corpus Encoding Standard (CES) project ( Ide also gave an extensive overview (from the sixties until now) of work on word sense disambiguation, which will be published in the Computational Linguistics of early 1998.


John Sinclair from the University of Birmingham presented his work on Cobuild Corpus-based dictionaries. Cobuild uses large corpora to extract concordances of words or phenomena. Sinclair distinguishes five 'levels of meaning': the core which is a single word or phrase; the collocation which is physical co-ocurrence; the colligation which is grammatical co-ocurrence; semantic preference which are regularities of word choice and the prosody which are pragmatic regularities. (

Nicoletta Calzolari from the University of Pisa presented some aspects of the management of multilingual computational lexicons; Especially building multilingual lexicons using Machine Readable Dictionaries (MRD) and parallel corpora. Within the European project SPARKLE (Shallow Parsing and Knowledge extraction for Language Engineering) SPARKLE will use Shallow Parsing for (semi-) automatic lexicon acquisition and word sense disambiguation for English, French, German and Italian. Companies like Xerox and Sharp will use technology developed in SPARKLE to build pilot systems for multilingual information retrieval systems (

Discourse Linguistics

Massimo Poessio from the University of Edinburg presented the collection and annotation of a dialogue corpus in the Maptask project. In Maptask dialogues are collected by giving two people slightly different maps and giving them the instruction to guide each other to a goal. The linguistic interpretation of the maptask corpus is automated (partially) for: time stamps, speech segmentation, part- of-speech tagging, syntactic analysis and speech-acts. Especially annotation of speech-acts is difficult. Speech-acts must be chosen in a way that humans assign them consistently. This can be evaluated with the kappa statistic. Laurant Romary from CRIN-CNRS, Nancy also presented work on the annotation of spoken dialogues.

Statistical methods

Martin Rajman from the Swiss Federal Institute of Technology in Lausanne (EPFL) presented work on Statistical Context Free Grammars, Hidden Markov Models and Data Oriented Parsing. At EPFL interesting work is done on comparing taggers from different companies. They will probably be the first to verify Rens Bod's results on Data Oriented Parsing.

Grammar Engineering

Paola Monachesi from the University of Tübingen and Liviu Ciortuz from DFKI both presented work on HPSG, respectively for Italian and Romanian.

Hans Uszkoreit from DFKI and the University of Saarbrücken presented work on grammar development and evaluation. For grammar development DFKI developed the PAGE system ( For grammar evaluation DFKI developed the TSNLP test suites which exist of annotated example sentences which are representative for certain language phenomena ( Test-suites are considered to be competence data. For grammar engineering also 'performance data' will be used for evaluation, i.e. linguistically interpreted 'real-life' data.

Aravind Joshi from the University of Pennsylvania presented work on Lexicalised Tree Adjoining Grammars (LTAG). In a lexicalised grammar all rules are associated with one lexical item (a word). Parsing with LTAG is difficult because a derived tree may have several derivations. An alternative way of 'parsing' with LTAG is tagging each word or lexical item with a partial tree using standard Hidden Markov techniques, so-called supertagging. (

Finite State methods

Jean-Pierre Chanod from Xerox Research Centre in Grenoble presented work on Finite State methods. At Xerox, Finite State Transducers are used for Tokenisation, Morphological Analysis, Part of Speech Tagging and Shallow Parsing. More information on Finite State Tools at Xerox can be found at: