Dialogue modelling

We assume a "shared plan" or "shared state" model. This is part of the common ground, which includes the shared state of a specific dialogue plus static information about the context. Dialogue is the "editing of the shared state". Both partners aim to keep the shared state consistent. In principle, everything that is represented may be discussed. This implies that the parties know (at an intuitive or rational level) the representation and dynamics of the shared state. While the system will have trouble understanding the representation of a user's shared state (we will have to use a psychological/social model and assume it is sufficiently correct), the user may have less trouble understanding the system's representation, if the system is in some way transparent about the model it uses. Users should (intuitively) understand the system's abilities and limitations. In particular, the different kinds of task that the system may perform should be known. The system may decide to tell the user about this. Thus, the system may assume that the user will not try to surpass the boundaries thus set, making its task easier.

Error handling / problem solving and "levels of understanding"

Maintaining consistency of the shared state can be done by detecting and repairing inconsistencies, or by prevention. Unexpected or unintelligible dialogue moves can be signs of inconsistencies. Expected moves, on the other hand, can be signs that the other partner agrees that the dialogue is consistent. Locating inconsistencies can be troublesome, as theoretically it is possible that the shared state models of the partners diverge without them being aware of it. Repair can be done by reconsidering what was uttered, or by throwing away uncertain information and asking for it again, or by suggesting a model correction to the other party. Prevention can be done by verifying the dialogue state implicitly or explicitly.

When we consider error handling, the question emerges whether there is a fundamental distinction between error handling in the dialogue and problem solving in goal execution. Error handling is solving a problem with the shared state; problem solving is solving a problem with an external subsystem. Note that, in the case of information dialogues, the location of the "database" subsystem is unclear: is it internal or external? Still, in both cases we may speak of updated of the shared state which are used to drive communication. If one party tries to perform a task, both parties will know this, and the outcome of the task should be consistent in the shared state.

The use of understanding levels seems attractive, but becomes harder to grasp when one thinks about it. It seems to be related to modularity issues.

We assume the understanding of the user input to occur in stages, with each stage augmenting the information passed from the previous stage with its own knowledge. For example: speech recognition, pen recognition, nl parser, reference resolution, semantic tagging, dialogue management, plan management. A level may correspond to a stage. If something is not fully understood at one level, the problem may be compensated for by another level.

For example, using such a levels concept, one may arrive at something like a "Brooks subsumption architecture" (a concept from robotics). This means that "interaction levels" are introduced into the system to enable problem solving to be separated into levels. The responsibility of problems that occur are delegated to a specific level, which should be able to solve the problem so that other levels need not concern themselves with the problem and do not even need to be aware that a problem is occurring. Only if problem solving fails, the problem is passed to the next level. The use of a subdialogue for solving repair problems is a case of levels-based problem solving. Attaining a generalisation of this may be interesting. Rather than levels, we may need a more general concept of "responsibility domains", with some customised responsibility delegation and transparency scheme.

However it need not be true that problems at one level are always solved at that level. Problem delegation may be done at the end of the parsing process when maximum knowledge is available that enables the solving of the problem.

EXAMPLE: An obvious example is ASR failure. If ASR replies with "garbage", there may be a "ASR-level repair module" that is responsible for asking the user again in some way that elicits a response that is more likely to give good output out of the ASR. The module must have some knowledge of humans as well as of the ASR technology. The module will give up after a certain time or when it has detected that the user will not cooperate in solving the subproblem. Responsibility delegation is a difficult problem though. While total ASR failure will usually result in the system asking again at least once, it might be the case that one case of "garbage" output is enough for a plan manager to decide on a different course of action and not dwell on finding out what was said. The plan manager may ask the asr repair module to try and elicit the response with a certain urgency, like say it may spend M time or N utterances to find out the response.

Option-based utterance parsing and dialogue management?

We discussed the principle of looking at the goals of the dialogue system first, rather than doing some analysis and finding out afterwards whether we needed the analysis anyway. A striking example application was that of giving an ASR module the a priori knowledge that the only speech input consists of one out of a set of 10 questions (possibly with multiple variations per question). When this knowledge is properly used, enormous performance gains are possible: one knows exactly what word sequences to expect. This is what humans seem to do in noisy situations.

Let us assume that our dialogue system is, at any moment of time, capable of a certain set of, say, 3-20 actions (with each action having zero or more parameters). Let us also assume that the user is (mostly) aware of this set of actions. Then, analysis of the user's input boils down to the following questions:

Such knowledge can for example be used to remodel uncertainty information from analysis modules (possibly including the answer set in the case of imix).

Dialogue history modelling

I think we will also need an explicit representation of the dialogue history. We can at least use it to resolve certain references. The dialogue history should at best contain all information necessary for understanding what happened. We have full information if we store: With this information it is possible to analyse the course of the dialogue as needed. Note that we do not include any tags added by analysis modules. Actually there is no reason for not recomputing all derived information in the entire dialogue (annotations and references) every time new information is added except for efficiency reasons. In fact, even the system utterance can be said to follow from the system's dialogue model, and can hence be recomputed. Recalculating everything would give an optimal analysis as it enables complete reconsideration of the dialogue. Humans are capable of this to some extent. In practice our system will store computed information rather than recompute it, in other words, we use a caching scheme.