B.W. van Schooten
19 may 1998. Definitely a draft.
In this part, PARADISE will be described and commented on, and the troubles that were found by trying to use it to evaluate PADIS will be discussed.
The goal of PARADISE [WHF
97][WLKA97], developed at
AT&, is to allow some basic analysis of a dialogue strategy, or comparison
of different (competing) dialogue strategies, even among different tasks. In
accordance with the competitive, goal-oriented nature of commercial
companies, the emphasis of this evaluation method lies on overall performance
comparison of different (sub)strategies.
The evaluation is achieved by using a task success metric based on the Kappa value, and by supplying a formula into which task cost metrics of one's own can be filled in along with the task success metric, and correlated against a separate subjective measurement of user satisfaction. The resulting equation may be used to predict and/or compare performance of specific subdialogues or other strategies, in order to gain some insight in what parts the evaluated system is succeeding or failing, or to choose between competing strategies.
Kappa is typically used as a measure of agreement between different information sources.

with

In the case of PARADISE, the 'agreement' in a NL dialogue is the agreement between the contents of a piece of information sent by a user and the contents as it was eventually interpreted by the system. This assumes that the main objective of the dialogue is to pass specific, clearly identifiable items of information.
These items of information are identified by determining an Attribute-Value Matrix (AVM). Each row in the matrix stands for one attribute of information which can be seen as clearly identifiable in the task domain. In the left column, each attribute type is named, while in the right column, the set of possible values for each attribute type is listed.
Using the AVM and a corpus which contains the user's goal and system's
interpretation at the end of each dialogue, a confusion matrix
can be
determined. This is a square matrix of dimension
,with

Using
,
in the
equation can be determined
using the obvious

with

In case we assume the probability distribution of the different attribute
value when guessed randomly by the system equals the distribution of the
attribute values the users are trying to convey among all the different tasks
(which is the best guess the system can make if it can't understand its users
at all),
turns out as

The basic idea behind using
is to be able to correct for differences
in confusion matrices, and hence, for differences in tasks. However, note
that
is typically very small for
s of any serious size: if we
assume that each value occurs about equally often,
.
The article [WLKA97] notes this, and offers a couple of
alternatives:
s for each single attribute and average these
s
can be calculated.
Except for identifying attributes with a very limited range of values (like
yes-no questions), or with a very-often occurring default value,
will
still not account for values that are easier or less easy to confuse with
others for other reasons, for example, because certain words are pronounced
in nearly the same way, or because the typical structure of sentences in the
problem domain makes values easier or harder to interpret or disambiguate.
So, this probably means that one should be careful with interpreting the
results of success across different tasks.
Also, the confusion over the subject or meaning of a (sub)dialogue is not explicitly accounted for: the AVM approach only accounts for data passed and not for control of the dialogue. At best, the existence of very high confusions across attributes may indicate some things went wrong at the control level. It might be possible to add some control aspects to the AVM by adding topic selections as attributes. The idea proposed in alternative 2 seems to go some way into this direction, though no concrete examples in this direction are available.
Another incompleteness of the
metric became clear after a personal
discussion with Dr. Eckert: it is not clear what one should do in case
a piece of information was communicated wrongly but was actually received
correctly.
Actually, there seems to be a slight error in the definition of M: attributes that were never communicated (for example, because the dialogue came to an untimely end) are not accounted for. Several solutions for this are thinkable, but probably the simplest solution is adding an extra row to M, in which all omissions or unclear utterances of the users can be collected. This also corresponds neatly to the one that was suggested for dealing with the case of more concepts in the total system than are used in the user tasks ([Bou98], see also 2.1.1).
General cost metrics
can be determined, which can then be incorporated
in the performance calculation, along with the
metric, resulting in

with

The Z-score normalisation normalises the results to have mean 0 and standard
deviation 1. It is used to normalise for difference in scales or units among
the different metrics
.
The normalisation is probably not good for normalisation of one metric across
tasks. Each may have its own mean and deviation, at least this may be so for
the
values, as argued before, and probably for some of the
typically-used metrics as well. For example, number of utterances and
dialogue time vary with task size; they are not corrected for the nature of
the task. So, it seems better to normalise for each individual task: this
amounts to having a separate Z-normalisation or perhaps even a separate
metric for each task.
User Satisfaction (US) should then be determined, for example by means of a
user survey. The weights
and
can then be determined by linear
regression of US against the weights in the
formula.
The precise limitations and caveats of linear regression are at the moment not known to me, but linearity itself may not take into account some important features of performance. For example, one can imagine users tend to be especially irritated with the part of the dialogue that performs worst. This example would suggest taking something like the minimum of the performance metrics as basis for user satisfaction, rather than a linear combination of them.
The method also allows for evaluation of subdialogues. This assumes that each subdialogue always corresponds neatly to a submatrix of the AVM, and that, if different methods are evaluated, they have comparable subdialogues.
In order to do this, one has to determine
for the appropriate
submatrix of M and
for the appropriate subdialogues. It is therefore
necessary that the chosen metrics
are also applicable to subdialogues.
The basic theory assumes one does not obtain US values for every kind of subdialogue, therefore the weights as calculated in 1.3 have to be used. The obtained formula can be used to predict performance of the subdialogue.
Despite its limitations, the method could be used as a preselection of very large corpora to determine interesting subsets of the data, or to detect interesting global features, or comparison of similar systems by means of the same tasks and users.
The method allows for some free interpretation of the formulas, for example
there are several alternatives for
and the other metrics,
normalisation, AVM specification, and Z-score normalisation. There is no
clear way to go about this; one apparently has to trust one's intuition or
some experimental design methodology, as can be seen in [WHF
97].
It is not clear how to make sure the metrics will not omit properties of the
system which are essential for proper evaluation. Since there is no universal
set of metrics, metrics of very different systems are probably not
comparable. Note for example [WHF
97], in which the actual
database queries took most of the time, making the duration of the dialogue
itself less important to the users, and actually giving the impression that
an efficient dialogue had a slower pace than an inefficient one.
PADIS [Bou98], an automated telephone operator, allows users to make direct connections to, or ask phone and room numbers or email addresses of other people by identifying them. A variety of ways to identify people are possible, and the amount or type of information needed to identify specific people uniquely varies.
The systems is evaluated by means of a fixed set of tasks each of a number of users is asked to perform. Success, cost, and user satisfaction data are thus obtained.
However, what was claimed in [Bou98] was that PARADISE was not fit to use for evaluation of PADIS. There were two arguments, which are given and commented on below:
97], as was also explained in the
correspondence with Dr. Walker), together in a single attribute. This
does mean glossing over differences in implementation of the same task,
so this solution does always mean that subtasks cannot be evaluated
separately; information is thrown away.
97]: specify a logical expression which
shows the necessary attributes that were still required to complete the
task. For example, if the user specified name but still has to
specifiy either group or the combination title+gender,
the attribute column in the AVM turns out as

However, there is no method given to calculate
for AVMs with
such logical expressions. An obvious but crude way to do this is by using
a statistical approach, i.e. basing the calculation on probabilities
obtained from the task corpus. First, one writes the logical expression
in 'canonical' form, like

Then,
can be calculated using

is the confusion matrix in which the logical expression has been
replaced with the actual attributes corresponding to
. The
probabilities
are obtained by counting the numbers of times each
of the implementations
was actually attempted to be communicated in
the corpus of that specific task.
However, a problem still remains: the set of AVMs you get after the evaluation is unpredictable. As argued before, the results of different AVMs may not be quite comparable.
A new objective evaluation method was introduced at the latest TWLT workshop (TWLT13) [ELP98].
Basically, the idea is to construct a simulated user, the behaviour of which is based on corpus material, which makes it possible to do intermediate evaluation of the dialogue system without needing real users.
Like PARADISE, performance is calculated using a metric formula which should represent overall performance, and which is calibrated using separate user satisfaction surveys.
If we assume PARADISE-like normalisation may be incorporated into the metrics somewhere, the performance function is similar to the PARADISE function. However, it is calculated on a per-dialogue rather than a per-user basis:

with

The text does not specify whether to use
or what other kind of
metrics to use.
Much depends on the performance function: the evaluation of the dialogue system as well as that of the simulated user (see next section). However, as in PARADISE, the performance function itself is not evaluated. There is still no clear way how to make sure the performance function covers all relevant aspects of the dialogue.
The method proposes the construction of a simulated user, which should be statistical in nature (in order to some humanlike, lifelike `randomness' to the simulation), and based on corpus material. This simulated user can now generate unlimited amounts of `virtual' corpus material on demand. This approach is argued to have the following uses:
A simple implementation of the simulated user, based on utterance bigrams, is proposed.
The realism of the simulated user can be evaluated by calculating the difference between real user performance and simulated user performance.
One problem is the fact that the simulated user proposed here is only an average, therefore not accounting for individual differences between users.
The actual validity of the first use is dubious. It was claimed that, in case the real corpus is too small, one `... can run a large number of dialogues and reach results that are significant according to a predetermined confidence level.'. However, this would mean the amount of `virtual' corpus material would far exceed the amount of `real' material, which would mean that one is actually testing the system with simulated users instead of real ones.
In case the system is seriously used to evaluate competing systems without feeding back to real users (as in the third use), some methods will obviously have an advantage over others because of artifacts in the simulated user model. This also leads the discussion to a more fundamental problem, as is discussed in the next section.
The idea of simulating user behaviour somehow and feeding it back into a dialogue system has actually crossed my mind even before I saw this article, but I dismissed the idea because I thought there was something fundamentally wrong with it: one would need pretty good AI to simulate a user so realistically that it would actually aid in improving one's dialogue system. However, if this AI is already available, and it is more advanced than the dialogue system, it seems more logical to incorporate it into the system instead of teasing the system with it. Back then, I wasn't thinking of a statistical corpus-based simulated user (the architecture of which is probably fundamentally different from the dialogue system's, which means it might provide an additional 'viewpoint' of the problem next to that of the system's without being 'better' than the system), but I will argue the same still goes for this case.
There is a very easy, sure-fire 'cheat' to get a perfect score with this evaluation method: since the simulated user's behaviour is simple (in the actual implementation proposed) and fully specified formally (if only statistically), the dialogue system should have an easy time predicting what the user will do, and get maximum results, once one actually incorporates knowledge about the user's behaviour into the dialogue system.
Of course, a dialogue system engineered only to get optimal scores with simulated users may not work well in reality. Actually, the method even dictates that the system should be tested with real users after re-engineering done with simulated users. So, this would imply that the dialogue system is engineered separately from knowledge about the implementation of the simulated user; one does not 'cheat'. However, this means one deliberately does not use the knowledge that is found in the simulated user model, i.e. one is deliberately throwing away knowledge. Actually, incorporating knowledge about the simulated user in addition to the system that is already there is perhaps not cheating, but merely making optimal use of the existing knowledge, unless it should prove somehow impossible to do this (which I do not believe).
97
About objective dialogue evaluation methods
This document was generated using the LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 using-paradise.tex.
The translation was initiated by Boris van Schooten on Tue May 19 14:41:37 MET DST 1998