Maintaining consistency of the shared state can be done by detecting and repairing inconsistencies, or by prevention. Unexpected or unintelligible dialogue moves can be signs of inconsistencies. Expected moves, on the other hand, can be signs that the other partner agrees that the dialogue is consistent. Locating inconsistencies can be troublesome, as theoretically it is possible that the shared state models of the partners diverge without them being aware of it. Repair can be done by reconsidering what was uttered, or by throwing away uncertain information and asking for it again, or by suggesting a model correction to the other party. Prevention can be done by verifying the dialogue state implicitly or explicitly.
When we consider error handling, the question emerges whether there is a fundamental distinction between error handling in the dialogue and problem solving in goal execution. Error handling is solving a problem with the shared state; problem solving is solving a problem with an external subsystem. Note that, in the case of information dialogues, the location of the "database" subsystem is unclear: is it internal or external? Still, in both cases we may speak of updated of the shared state which are used to drive communication. If one party tries to perform a task, both parties will know this, and the outcome of the task should be consistent in the shared state.
The use of understanding levels seems attractive, but becomes harder to grasp when one thinks about it. It seems to be related to modularity issues.
We assume the understanding of the user input to occur in stages, with each stage augmenting the information passed from the previous stage with its own knowledge. For example: speech recognition, pen recognition, nl parser, reference resolution, semantic tagging, dialogue management, plan management. A level may correspond to a stage. If something is not fully understood at one level, the problem may be compensated for by another level.
For example, using such a levels concept, one may arrive at something like a "Brooks subsumption architecture" (a concept from robotics). This means that "interaction levels" are introduced into the system to enable problem solving to be separated into levels. The responsibility of problems that occur are delegated to a specific level, which should be able to solve the problem so that other levels need not concern themselves with the problem and do not even need to be aware that a problem is occurring. Only if problem solving fails, the problem is passed to the next level. The use of a subdialogue for solving repair problems is a case of levels-based problem solving. Attaining a generalisation of this may be interesting. Rather than levels, we may need a more general concept of "responsibility domains", with some customised responsibility delegation and transparency scheme.
However it need not be true that problems at one level are always solved at that level. Problem delegation may be done at the end of the parsing process when maximum knowledge is available that enables the solving of the problem.
EXAMPLE: An obvious example is ASR failure. If ASR replies with "garbage", there may be a "ASR-level repair module" that is responsible for asking the user again in some way that elicits a response that is more likely to give good output out of the ASR. The module must have some knowledge of humans as well as of the ASR technology. The module will give up after a certain time or when it has detected that the user will not cooperate in solving the subproblem. Responsibility delegation is a difficult problem though. While total ASR failure will usually result in the system asking again at least once, it might be the case that one case of "garbage" output is enough for a plan manager to decide on a different course of action and not dwell on finding out what was said. The plan manager may ask the asr repair module to try and elicit the response with a certain urgency, like say it may spend M time or N utterances to find out the response.
Let us assume that our dialogue system is, at any moment of time, capable of a certain set of, say, 3-20 actions (with each action having zero or more parameters). Let us also assume that the user is (mostly) aware of this set of actions. Then, analysis of the user's input boils down to the following questions: