About objective dialogue evaluation methods

About objective dialogue evaluation methods

B.W. van Schooten

19 may 1998. Definitely a draft.

Abstract:

'Objective' evaluation means evaluation using numbers (metrics) which can be calculated without intervention by humans. An obvious advantage is that human effort is reduced. It is also claimed that it is less biased by human opinion, but in this text it is shown that there are a lot of choices to be made when applying these methods, which introduces bias at a different level. Two objective evaluation methods which are claimed to be state-of-the-art, both developed in AT&, are described and commented upon. Both are black-box methods, meaning that they can be used without referring to the implementation of the dialogue system.

Contents

Using PARADISE

In this part, PARADISE will be described and commented on, and the troubles that were found by trying to use it to evaluate PADIS will be discussed.

Summary of and remarks on PARADISE

The goal of PARADISE [WHF97][WLKA97], developed at AT&, is to allow some basic analysis of a dialogue strategy, or comparison of different (competing) dialogue strategies, even among different tasks. In accordance with the competitive, goal-oriented nature of commercial companies, the emphasis of this evaluation method lies on overall performance comparison of different (sub)strategies.

The evaluation is achieved by using a task success metric based on the Kappa value, and by supplying a formula into which task cost metrics of one's own can be filled in along with the task success metric, and correlated against a separate subjective measurement of user satisfaction. The resulting equation may be used to predict and/or compare performance of specific subdialogues or other strategies, in order to gain some insight in what parts the evaluated system is succeeding or failing, or to choose between competing strategies.

Task success

Kappa value

Kappa is typically used as a measure of agreement between different information sources.

with

In the case of PARADISE, the 'agreement' in a NL dialogue is the agreement between the contents of a piece of information sent by a user and the contents as it was eventually interpreted by the system. This assumes that the main objective of the dialogue is to pass specific, clearly identifiable items of information.

Attribute-Value Matrix

These items of information are identified by determining an Attribute-Value Matrix (AVM). Each row in the matrix stands for one attribute of information which can be seen as clearly identifiable in the task domain. In the left column, each attribute type is named, while in the right column, the set of possible values for each attribute type is listed.

Confusion matrix

Using the AVM and a corpus which contains the user's goal and system's interpretation at the end of each dialogue, a confusion matrix can be determined. This is a square matrix of dimension ,with

Using , in the equation can be determined using the obvious

with

In case we assume the probability distribution of the different attribute value when guessed randomly by the system equals the distribution of the attribute values the users are trying to convey among all the different tasks (which is the best guess the system can make if it can't understand its users at all), turns out as

Discussion: Validity of

The basic idea behind using is to be able to correct for differences in confusion matrices, and hence, for differences in tasks. However, note that is typically very small for s of any serious size: if we assume that each value occurs about equally often, . The article [WLKA97] notes this, and offers a couple of alternatives:

  1. calculate s for each single attribute and average these
  2. separate the identification of the attributes and the filling-in of the attribute values into two different subtasks, for which separate s can be calculated.  

Except for identifying attributes with a very limited range of values (like yes-no questions), or with a very-often occurring default value, will still not account for values that are easier or less easy to confuse with others for other reasons, for example, because certain words are pronounced in nearly the same way, or because the typical structure of sentences in the problem domain makes values easier or harder to interpret or disambiguate. So, this probably means that one should be careful with interpreting the results of success across different tasks.

Also, the confusion over the subject or meaning of a (sub)dialogue is not explicitly accounted for: the AVM approach only accounts for data passed and not for control of the dialogue. At best, the existence of very high confusions across attributes may indicate some things went wrong at the control level. It might be possible to add some control aspects to the AVM by adding topic selections as attributes. The idea proposed in alternative 2 seems to go some way into this direction, though no concrete examples in this direction are available.

Another incompleteness of the metric became clear after a personal discussion with Dr. Eckert: it is not clear what one should do in case a piece of information was communicated wrongly but was actually received correctly.

Discussion: undefined attributes in M

Actually, there seems to be a slight error in the definition of M: attributes that were never communicated (for example, because the dialogue came to an untimely end) are not accounted for. Several solutions for this are thinkable, but probably the simplest solution is adding an extra row to M, in which all omissions or unclear utterances of the users can be collected. This also corresponds neatly to the one that was suggested for dealing with the case of more concepts in the total system than are used in the user tasks ([Bou98], see also 2.1.1).

Costs

General cost metrics can be determined, which can then be incorporated in the performance calculation, along with the metric, resulting in

with

The Z-score normalisation normalises the results to have mean 0 and standard deviation 1. It is used to normalise for difference in scales or units among the different metrics .

Discussion: how to normalise

The normalisation is probably not good for normalisation of one metric across tasks. Each may have its own mean and deviation, at least this may be so for the values, as argued before, and probably for some of the typically-used metrics as well. For example, number of utterances and dialogue time vary with task size; they are not corrected for the nature of the task. So, it seems better to normalise for each individual task: this amounts to having a separate Z-normalisation or perhaps even a separate metric for each task.

Correlation against User Satisfaction

 

User Satisfaction (US) should then be determined, for example by means of a user survey. The weights and can then be determined by linear regression of US against the weights in the formula.

Discussion: limits of linear regression

The precise limitations and caveats of linear regression are at the moment not known to me, but linearity itself may not take into account some important features of performance. For example, one can imagine users tend to be especially irritated with the part of the dialogue that performs worst. This example would suggest taking something like the minimum of the performance metrics as basis for user satisfaction, rather than a linear combination of them.

Sudialogues

The method also allows for evaluation of subdialogues. This assumes that each subdialogue always corresponds neatly to a submatrix of the AVM, and that, if different methods are evaluated, they have comparable subdialogues.

In order to do this, one has to determine for the appropriate submatrix of M and for the appropriate subdialogues. It is therefore necessary that the chosen metrics are also applicable to subdialogues.

The basic theory assumes one does not obtain US values for every kind of subdialogue, therefore the weights as calculated in 1.3 have to be used. The obtained formula can be used to predict performance of the subdialogue.

Discussion: general issues

Despite its limitations, the method could be used as a preselection of very large corpora to determine interesting subsets of the data, or to detect interesting global features, or comparison of similar systems by means of the same tasks and users.

The method allows for some free interpretation of the formulas, for example there are several alternatives for and the other metrics, normalisation, AVM specification, and Z-score normalisation. There is no clear way to go about this; one apparently has to trust one's intuition or some experimental design methodology, as can be seen in [WHF97].

It is not clear how to make sure the metrics will not omit properties of the system which are essential for proper evaluation. Since there is no universal set of metrics, metrics of very different systems are probably not comparable. Note for example [WHF97], in which the actual database queries took most of the time, making the duration of the dialogue itself less important to the users, and actually giving the impression that an efficient dialogue had a slower pace than an inefficient one.

Using PARADISE for PADIS

PADIS [Bou98], an automated telephone operator, allows users to make direct connections to, or ask phone and room numbers or email addresses of other people by identifying them. A variety of ways to identify people are possible, and the amount or type of information needed to identify specific people uniquely varies.

Evaluation of PADIS

The systems is evaluated by means of a fixed set of tasks each of a number of users is asked to perform. Success, cost, and user satisfaction data are thus obtained.

PARADISE and PADIS according to [Bou98]

 

However, what was claimed in [Bou98] was that PARADISE was not fit to use for evaluation of PADIS. There were two arguments, which are given and commented on below:

  1. M is too large because of the large vocabulary. Since not all words occur in the evaluation tasks anyway, the following solution was suggested: use only the task words for M. Other words can be incorporated by adding an extra row 'others', which was probably needed anyway to account for undefined values, as mentioned before. However, it was still argued that there is no way to guarantee representativeness of the used subset. Representativeness is an issue with any method that evaluates only a subset of what really occurs or could occur-which is probably just about always. If the subset is not representative, the outcome of the evaluation cannot be seen as a general performance metric, but only in relation to the actual tasks (and users) in question, which does not mean it cannot be useful for detecting interesting features.

  2. A task does not necessarily correspond 1-to-1 to the AVM of that task; there are several possible ways to 'implement' each task. For example, there is a person who may be identified by giving either name+group or name+title+gender. Given are two possible ways around this problem:

    1. lump all tasks, or all implementations of each task (the latter is the approach taken in [WHF97], as was also explained in the correspondence with Dr. Walker), together in a single attribute. This does mean glossing over differences in implementation of the same task, so this solution does always mean that subtasks cannot be evaluated separately; information is thrown away.

    2. construct the AVM from the dialogue afterwards, so that some account is given for partial success, i.e. what the user did or did not succeed to convey. As was argued, the question of what the user failed to supply, given that there are multiple possible solutions, remains. It would however be possible to specify this in the way tasks were specified in [WHF97]: specify a logical expression which shows the necessary attributes that were still required to complete the task. For example, if the user specified name but still has to specifiy either group or the combination title+gender, the attribute column in the AVM turns out as

      However, there is no method given to calculate for AVMs with such logical expressions. An obvious but crude way to do this is by using a statistical approach, i.e. basing the calculation on probabilities obtained from the task corpus. First, one writes the logical expression in 'canonical' form, like

      Then, can be calculated using

      is the confusion matrix in which the logical expression has been replaced with the actual attributes corresponding to . The probabilities are obtained by counting the numbers of times each of the implementations was actually attempted to be communicated in the corpus of that specific task.

      However, a problem still remains: the set of AVMs you get after the evaluation is unpredictable. As argued before, the results of different AVMs may not be quite comparable.

Eckert et al's `simulated user' method

A new objective evaluation method was introduced at the latest TWLT workshop (TWLT13) [ELP98].

The method

Basically, the idea is to construct a simulated user, the behaviour of which is based on corpus material, which makes it possible to do intermediate evaluation of the dialogue system without needing real users.

Like PARADISE, performance is calculated using a metric formula which should represent overall performance, and which is calibrated using separate user satisfaction surveys.

Performance function

If we assume PARADISE-like normalisation may be incorporated into the metrics somewhere, the performance function is similar to the PARADISE function. However, it is calculated on a per-dialogue rather than a per-user basis:

with

The text does not specify whether to use or what other kind of metrics to use.

Discussion: validity of the performance function

Much depends on the performance function: the evaluation of the dialogue system as well as that of the simulated user (see next section). However, as in PARADISE, the performance function itself is not evaluated. There is still no clear way how to make sure the performance function covers all relevant aspects of the dialogue.

Simulated user

The method proposes the construction of a simulated user, which should be statistical in nature (in order to some humanlike, lifelike `randomness' to the simulation), and based on corpus material. This simulated user can now generate unlimited amounts of `virtual' corpus material on demand. This approach is argued to have the following uses:

  1. increasing the corpus size
  2. testing a re-engineered system without needing real users
  3. being able to compare two systems with the same (simulated) user population

A simple implementation of the simulated user, based on utterance bigrams, is proposed.

The realism of the simulated user can be evaluated by calculating the difference between real user performance and simulated user performance.

Discussion: validity of user simulation

One problem is the fact that the simulated user proposed here is only an average, therefore not accounting for individual differences between users.

The actual validity of the first use is dubious. It was claimed that, in case the real corpus is too small, one `... can run a large number of dialogues and reach results that are significant according to a predetermined confidence level.'. However, this would mean the amount of `virtual' corpus material would far exceed the amount of `real' material, which would mean that one is actually testing the system with simulated users instead of real ones.

In case the system is seriously used to evaluate competing systems without feeding back to real users (as in the third use), some methods will obviously have an advantage over others because of artifacts in the simulated user model. This also leads the discussion to a more fundamental problem, as is discussed in the next section.

Discussion: the principle of the simulated user

The idea of simulating user behaviour somehow and feeding it back into a dialogue system has actually crossed my mind even before I saw this article, but I dismissed the idea because I thought there was something fundamentally wrong with it: one would need pretty good AI to simulate a user so realistically that it would actually aid in improving one's dialogue system. However, if this AI is already available, and it is more advanced than the dialogue system, it seems more logical to incorporate it into the system instead of teasing the system with it. Back then, I wasn't thinking of a statistical corpus-based simulated user (the architecture of which is probably fundamentally different from the dialogue system's, which means it might provide an additional 'viewpoint' of the problem next to that of the system's without being 'better' than the system), but I will argue the same still goes for this case.

There is a very easy, sure-fire 'cheat' to get a perfect score with this evaluation method: since the simulated user's behaviour is simple (in the actual implementation proposed) and fully specified formally (if only statistically), the dialogue system should have an easy time predicting what the user will do, and get maximum results, once one actually incorporates knowledge about the user's behaviour into the dialogue system.

Of course, a dialogue system engineered only to get optimal scores with simulated users may not work well in reality. Actually, the method even dictates that the system should be tested with real users after re-engineering done with simulated users. So, this would imply that the dialogue system is engineered separately from knowledge about the implementation of the simulated user; one does not 'cheat'. However, this means one deliberately does not use the knowledge that is found in the simulated user model, i.e. one is deliberately throwing away knowledge. Actually, incorporating knowledge about the simulated user in addition to the system that is already there is perhaps not cheating, but merely making optimal use of the existing knowledge, unless it should prove somehow impossible to do this (which I do not believe).

References

Bou98
A.G.G. Bouwman. Spoken dialog system evaluation and user-centered redesign with reliability measurements. Master's thesis, University of Twente, Deparment of Computer Science, 1998. February draft.

ELP98
Wieland Eckert, Esther Levin, and Roberto Pieraccini. Automatic evaluation of spoken dialogue systems. In TWLT13: Formal semantics and pragmatics of dialogue, 1998.

WHF97
Marilyn Walker, Donald Hindle, Jeanne Fromer, Giuseppe Di Fabbrizio, and Craig Mestel. Evaluating competing agent strategies for a voice email agent. In EUROSPEECH97: Proceedings of the European Conference on Speech Communication and Technology, 1997.

WLKA97
Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, and Alicia Abella. PARADISE: a framework for evaluating spoken dialogue agents. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 1997.

About this document ...

About objective dialogue evaluation methods

This document was generated using the LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 using-paradise.tex.

The translation was initiated by Boris van Schooten on Tue May 19 14:41:37 MET DST 1998


Boris van Schooten
Tue May 19 14:41:37 MET DST 1998