Building a framework for developing interaction models:
Overview of current research
on dialogue and interactive systems
B.W. van Schooten
Faculty of Computer Science, University of Twente
Postbus 217, 7500 AE, Enschede, The Netherlands
schooten@cs.utwente.nl
February 15, 1999
This text is an overview of literature relevant to the development and modelling of interactive systems. It is the most comprehensive part of the documentation I made during the literature research stage of my PhD project, which should result in a modelling framework for interactive systems. More precisely, the scope of this project is natural language, multimodal, and virtual reality systems, and ways of modelling the dynamics of interaction of such systems for the purpose of aiding system development. The topics addressed in this text naturally include existing systems and modelling techniques, but theoretical frameworks and methodological issues are also addressed. Short introductions are given to cognitive psychology, semiotics, communication models, and HCI. Systems and strategies are reviewed, covering the topics multimodality, redundancy and confirmation, and ways of modelling dialogue and task structure. Methodological issues are reviewed, in particular some methodologies, and strategies for design/implementation and analysis/evaluation. Finally, some conclusions are given, as well as an indication of the directions of my future research.
This text is a result of my work as a PhD student at the Parlevink project, which is part of the TKI (Taal, Kennis, en Interactie) group, formerly called SETI (Software Engineering en Theoretische Informatica). Thanks go to my advisors E.M.A.G. van Dijk, A. Nijholt, and R. op den Akker for reviewing this text.
This text is meant as an overview of the field of natural language interaction, multimodal interaction, and general HCI. This includes theories, general HCI issues, methodological issues, modelling and analysis techniques, design enviroments, existing systems and existing interaction strategies. It is an overview of most of the literature I have accumulated until now, classified and summarised in order to allow me, and other people, to retrieve information and literature on the subject more easily. This text is a neatened snapshot of a draft text that will be maintained further as new literature and ideas come in. The newest version of the draft text can be found on my homepage:
http://wwwhome.cs.utwente.nl/~schooten/
In the beginning of most sections, some introductory literature on the subject is listed. Note that some of the topics are covered in more detail than others. This reflects:
The preliminary title of my PhD thesis is `Interaction models in (spoken) dialogue systems'. The initial PhD proposal consists of building a modelling framework to obtain more control over the development of interaction models in (spoken and natural language) interactive human/computer systems. Meanwhile, the scope has extended to include multimodal systems as well.
An `interaction model' or `dialogue model' dictates the interaction patterns, styles or strategies that are and can be used in the interaction between human and computer system. Knowledge about existing or possible interaction models should facilitate a more structured Human Computer Interaction (HCI) design. In particular, more needs to be known about designing interaction for information retrieval and exploration tasks, done by different kinds of users, both computer-naive and experienced. Also, more needs to be known about some of the more interesting interaction types by means of which such interaction may be achieved, in particular natural language (NL) and multimodal interaction.
Note that the following objectives are very much abstract and general objectives, but it does show the benefits of each objective, and the different options available. The purpose of this text is to review the literature in order to enable us to identify more specific objectives, and to obtain an idea of the specific form that an actual framework may take.
A framework may provide a means to specify an interaction model, i.e. to `cast' any (envisioned) dialogue system or user interface within a well-defined class of such systems into a uniform specification, with as little loss of information as possible. This may include the relevant task, goal, and environment. Possible benefits are: easier comparison of systems and implementations, processing of corpora, or more mutual applicability and transferability of software tools or theories based on the framework.
Furthermore, if the framework would allow abstraction (i.e. simplification with as little loss of relevant information as possible) of specific aspects of the interaction, then interaction models may be understood more easily in some respects. For example, it may lay a stronger link between NL and other types of HCI, allow abstraction from content or modality in multimodal interaction, or enable a more balanced choice of evaluation metrics.
If the framework would explicitly incorporate description of features or variables of human cognition, it may provide a means to understand, model, and generally talk about human cognition. It might provide a basis of understanding and possibly predicting aspects of individuals and individual differences. It might also provide a basis for forming criteria for when, or reasons why, some types of interaction may or may not work in some situations.
Part i describes various theoretic views of and approaches to human cognition, human interaction with the environment, and communication. This should give a notion of the available intuitive ideas and concepts being used in the relevant areas. The areas that have been considered relevant up to now are: theories related to cognition, HCI, formal communication models, agent models, and semiotics.
Part ii gives an overview of real situations, existing interaction strategies and dialogue systems, and existing HCI methodologies, methods and tools. The emphasis of this part is practical, explaining real human/computer systems, their architecture, and how they are and can be built.
In this part, we will look at cognition, interaction, and communication. In our terminology, cognition is anything that goes on inside the human mind. A cognitive framework is any theoretical framework that helps in explaining aspects of cognition. Communication is similar to interaction, but implies meaning and cooperation: the parties involved must understand and meaningfully react to each other in a certain sense. Interaction is a more general term, and covers any kind of interaction, whether possibly meaningful or not.
The best-known cognitive framework is cognitive psychology. It is the study of human intelligence, viewing thought and intelligence as computation [Pasquini et al., 1998]. This is in clear contrast with its predecessor, behaviourism, which effectively denies that there is anything substantial going on inside the mind. The traditional viewpoint is that `computation' is analogous to symbol manipulation as found in classical AI.
In an introductory text on cognitive psychology, one typically finds the following aspects of human cognition [Veer and Lenting, 1995]:
However, people have argued that traditional cognitive psychology does not
account well enough for context. Other frameworks try to remedy the situation
[Nardi, 1992]: there is activity theory [Kaptelinin et al., 1995],
distributed cognition theory [Zhang and Norman, 1994], and situated action
modelling.
Basically, they assert that human cognition cannot be seen as separate from
the environment, in particular, the tasks and tools in the environment. It is
said that, as a human may shape a tool according to a task at hand, the tool
will also shape the human, as the human is restricted to working with the
limitations of the tool; humans and their tools co-evolve. Representation
tools are a relevant case: an external representation is read (internalised)
or formed (externalised). Thinking is seen as an internal activity with
internal representations. Use of a specific external representation may make
cognitive tasks easier or harder, and may eventually shape thought. Consider
the following example: the two-player `game of 15'. In this game, there are
nine numbers,
. Each player may in turn encircle a number. The
player whose sum of numbers equals 15 first, wins the game. As it turns out,
this game is simply tic-tac-toe with a different representation. However, one
may expect that the original spatial version is quicker to understand and
play by most people than the numerical one.
One concrete example of the importance of context can be found in Rizzo's lecture in [Pasquini et al., 1998]: he considers distinguishing coins. A traditional cognitive psychologist might say that the distinguishability between two coins is adequately represented by difference in measurable features, such as colour and size, assuming that the human mind simply computes these to determine coin type. However, in reality, people's ability to distinguish two types of coins also depends on the existence of other coins: for example, if only one copper-coloured coin type exists, people may confuse any new copper coin type with the existing one, even though it is as different in size from the existing one as are any two of the existing silver-coloured coin types.
Semiotics is the study of signs. The best-known people who have laid the foundation for semiotics are Peirce [Peirce, 1958], Saussure [de Saussure, 1916], and Eco [Eco, 1976]. For a full beginners' introduction to semiotics and some of its applications, see [Zoest, 1978]. In semiotics, a sign identifies something that stands for something else, when interpreted by some individual interpreter in some individual situation. In its simplest form, a sign is a 2-tuple, consisting of: the form of the sign (the significatum), and what it stands for (the denotatum). The precise form of the significatum and denotatum, and how interpretation takes place, is not prescribed by general semiotics. In some semiotic theories though, a sign simply stands for another sign. For any individual, anything that involves interpretation may be called a sign: for example, smoke may be a sign of fire, a closed door may be a sign of a certain person's absence. This also includes signs of culture and convention, such as language, road signs, etc., at least when they indeed are interpreted as such. Note that the notion of the sign's significatum in these examples already implies some (relatively low) level of interpretation. Another sign may have participated at this level. In the first example, it may have been a sign consisting of a significatum in the form of a retinal pattern and a denotatum in the form of the concept 'smoke'.
Around this concept of sign, there are general classifications, some of which are used often. We will discuss the particularly popular classification into syntax, semantics, and pragmatics, and into icon, index, and symbol below.
So, firstly, semiotics identifies the existence of signs, in the form explained above, as units in human interpretation. Also note that the idea of interpretation as replacing one sign by another looks much like the symbolic processing paradigm of cognitive science. In these ways, semiotics is a cognitive framework. We may also view signs not only as units of interpretation, but as units of transmission by someone with the purpose of causing interpretation by another, where transmission is the inverse process of interpretation. In this way, semiotics is also a framework of communication. In both cases, semiotics by itself is a very loose framework, as little is actually prescribed. For example, it is not prescribed when to identify something as a sign, or as part of a sign, or as multiple signs, etc., so this is left to intuition or to other disciplines. However, there is a body of sign classifications available in the literature: specific signs are identified, described, and classified using semiotic concepts. Such semiotic theories exist for psychology, music, cinematography, HCI, etc.
Peirce classifies signs according to the relation between the significatum and denotatum: icon (likeness), index (indirect manifestation), and symbol (relation by convention). Any sign may be more or less iconic, indexical, or symbolic in nature. According to Van Zoest, this classification is well-established because it has proven itself in practice. Each class may be attributed certain extra properties [Zoest, 1978]:
Syntax, semantics, and pragmatics are about sign systems (called `grammars' in Van Zoest's terminology), rather than individual signs. Van Zoest calls them semiotic syntaxis, semiotic semantics, etc. to distinguish them from the linguistic versions of the concepts. In semiotics, these concepts have a more abstract and general meaning. First, of course, one has to identify what is part of a `sign system', and what is not. Examples of sign systems given in [Zoest, 1978] are: traffic signs (red for danger, circle for `forbidden', etc.), fairy tales (they often start with `Once upon a time...', there's usually a story with certain elements, etc.). Viewing one sign system in isolation may not be enough, as sometimes, multiple sign systems work together as a bigger sign system, for example, facial expression and tone may denote sarcasm, possibly inverting the meaning of a spoken sentence.
In dialogue, utterances which influence or are explicitly meant to control the course of the dialogue may be seen as pragmatical in nature, as they are a result of the goals of the participants in relation to the dialogue. On the other hand, one can imagine that there are rules governing dialogue control, which form a separate sign system.
Indeed, some people like to speak of `conversational grammar' or `dialogue grammar' to describe patterns found in sequences of utterances. This dialogue grammar is on a different level than the language grammar, and amounts to a classification of utterance types (for example statement, question, or affirmation) and possible sequences of these. Some criticism can be given to the `dialogue grammar' view:[Good, 1989]. See fig. 1 for a schematic view of this idea.
Figure 1: Levels of semiotics in dialogue
Communication complexity [Kushilevitz, 1997] addresses how two parties could compute a function, while each party only has a part of the needed information, with a minimum amount of communication.
Information theory [Shannon and Weaver, 1964] is a mathematical theory of computer-computer communication, but it has had some impact on people's thinking about human communication as well. It deals with efficiency and reliability of the communication of essentially raw data. One of the main ideas is that a well-expected event contains less information than a little-expected event because it can be shown that it can be coded more efficiently. The theory also addresses how redundancy can be added to communicate reliably over a noisy channel.
Some analyses of cooperating human behaviour exist, see for example [Darses et al., 1993] and [Karsenty, 1993]. In these particular cases, the purpose is to gain insight in HCI issues. The article [Darses et al., 1993] analyses cooperating human partners in design dialogues and advisory dialogues, in a tutor-tutee setting. The data was segmented by hand into Basic Cooperative Behaviours (much like illocutionary acts, see section 3.2) and classified into one of 8 types, and the segments grouped into Basic Cooperative Interactions (much like subdialogues) and classified into one of 11 types. Various qualitative observations are made about tutor-tutee dialogues.
Well known are the Gricean cooperative principles [Grice, 1975]: quantity (make response exactly as informative as required), quality (do not lie, do not state something you are uncertain of), relation (be relevant), manner (avoid obscurity of expression, ambiguity, be brief and orderly, etc.). Grice also discusses relation to other principles, and clashes and violations of principles. The principles can be said to have the common basis of maximising effective communication. The Gricean principles imply an understanding of the other participant's intentions. However, they do not explicitly say when the cooperating partner should take initiative.
Castelfranchi [Castelfranchi, 1991] decouples linguistic cooperation from goal-level cooperation. For example, compliance of language utterances (taking desires `literally') may mean uncooperativeness at the goal level. He argues cooperation may or may not exist at the monitoring level (checking whether information is received), linguistic level (following of orders), and at the goal level independently. Note that he distinguishes between these three levels implicitly. Explicitly, he distinguishes only between level one and level two/three. He argues that his view is in conflict with most views of cooperative AI, which assume cooperation at the first two levels, but ignore the third. However, what is usually called `cooperative behaviour' in NL systems also encompasses tracking the user's implicit goals and plans.
Some theories address ways of tracking and aiding a person's intentions. One is missing-axiom-theory [Smith, 1996], which is a theory of human reasoning. In this theory, goals are modelled as the intention to prove certain propositions. Wanting to obtain missing information (missing axioms) needed to prove the propositions is what results in dialogue. A related idea can be found in [Dessalles, 1998]: here, it is proposed that cognitive conflict drives dialogue. An attempt to form a complete executable model of cooperative communication is found in the DenK project [Bunt et al., 1995]. DenK includes NL parsing, the result of which is represented in underspecified language form (ULF), and belief modelling and visual knowledge modelling, using type theory to represent knowledge.
When communicating, people adapt to a certain extent to the other participant. This is a problem when envisioning new systems using previous experience. An example of lexical accommodation can be found in [Fais and ho Loken-Kim, 1995]: people adapt their word usage to their dialogue partner. The cause that is suggested in this article is social approval.
The origin of the term `agents' lies in AI, where people found out that large knowledge bases became unmaintainable by a single inference engine, because they would become intractable or inconsistent. As a way around the problem, multiple inference engines (agents) with private databases and negotiation protocols were introduced. In computer science, use of the word `agent' has recently become more fashionable, and is probably being used for many more things that it originally was. It is sometimes even used as a new name for the more established concept of module or subprogram.
There are several reasons for adopting the concept `agents'. They are software engineering (as an alternative and extension to the object-oriented view), social and emergent system modelling, design of negotiation systems and protocols, and user interface modelling.
In all cases, `agent-oriented' it is a metaphor to aid in thinking, like `object-oriented'. Therefore, it is no surprise that the term `agent' is based on the intuitive, possibly naive conception that people have when they think of the word `agent'. Here, we assume any agent has at least one of the following qualities:
Possible benefits of agent theory for dialogue system development are:
Ex. It is important that intelligent agents should communicate their own view in a clear way. The two parties usually found in NL systems are like agents in this respect: the system should understand how the user thinks, and especially when the system's understanding starts failing, it is important that the user understands how the system `thinks'.
Ex. The protocols used are often a kind of formal versions of the ideas found in dialogue systems. Analysis of the workability (like self-consistency or completeness) of such protocols may be extendable to those of NL systems.
Ex. Existing agent implementation languages may be used to easily build prototypes of interactive systems.
Ex. Some models assign roles to humans and computers after a first stage of design. [Palanque and Bastide, 1996] even uses formal models to assess this kind of design decisions.
Ex. Anthropomorphic interfaces, for example in a virtual-reality (VR) environment, may be directly modelled using agents.
Ex. In a VR environment with multiple computer and human agents, all parties may be viewed as agents.
We continue with an overview of existing agent systems, models, and theories.
[Norman, 1994] discusses social and ergonomic aspects of humans using computer agents. Some guidelines given are: keep a feeling of control, make sure people don't overexpect regarding agent's intelligence. Safety and privacy issues are also discussed.
A model of psychological agents is introduced in [Watt, 1997]. The article addresses means to integrate human and computer parties effectively. It is argued that computer agents should be able to communicate with humans using means that are native to humans, because this would allow integration in human environments. Agents are classified by the number of understanding levels they incorporate. The levels are, from low to high, behavioural (any program), intentional (with reasoning engine) and psychological (able to understand/be understood by humans). A scheduler agent Luigi is explained in this article, though not in much detail.
A formal model of social agents is described in [Dignum and Linder, 1997]. This model has four levels: informational (belief and knowledge), action (temporal aspects, including possibilities, agent actions and cause-effect relationships), motivational (wishes and goals), social (illocutionary acts: commitments, directions, declarations, assertions). Some related approaches are summarised also.
The PAC-Amodeus model [Nigay and Coutaz, 1997] is a compositional model for software and user interface design. PAC stands for Presentation-Abstraction-Control, referring to the composition of a basic unit (the agent) in a PAC specification. Note that this looks much like some of the Seeheim models discussed in section 5.2.1.
The Cyc project [Guha and Lenat, 1994] aims at providing an engine that allows common sense reasoning using complex inference over a huge database of logical statements. It is argued that such a large amount of common sense is needed to allow any two agents to communicate. Internally, Cyc is in a way also agent-oriented, as multiple independent subdatabases are maintained.
An article focussing on agent communication languages is [Nwana and Wooldridge, 1997]. It is argued that cooperating multi-agent systems imply complex communication, because agents can not know everything that is relevant to the global task. Agent communication languages, ontologies, and software tools are reviewed. This includes Knowledge Query and Manipulation Language (KQML), which is like speech act theory. Each KQML message has illocutionary meaning, which is explicitly stated in the message. Other information about the message are: the body, the language in which the body is written, the ontology which is to be used, and information about the dialogue that the message is part of. Also reviewed are Knowledge Interchange Format (KIF), which is an interlingua for ontologies, and Cyc.
The article [Genesereth and Ketchpel, 1994] is also about agent communication, but specially focusses on trying to set a standard for communication between heterogeneous applications. This is done using a universal language, Agent Communication Language (ACL). ACL is made up out of a lexicon, which may be selected from one out of multiple ontologies, the language KIF, and a communication protocol using KIF which is similar to KQML.
The article [Barbuceanu and Fox, 1996] discusses COOL, an agent implementation language which enables KQML-like communication, and cooperation and conflict management.
DESIRE (DEsign and Specification of Interactive REasoning components) is described in [Brazier et al., 1995] and [Brazier et al., 1994]. DESIRE is a design framework, complete with diagram notation and executable specification language. The framework is based on the `shared task model', i.e. a model of the (design) task that can be understood by all parties.
The task knowledge includes: information about the task structure, which should be hierarchical, with complex tasks (tasks which can be subdivided) and primitive tasks (which cannot be subdivided further), the knowledge structures, the necessary information exchange between the tasks, subtask sequencing constraints, task delegation constraints (specifying which agent may do what).
This task knowledge can be specified in the specification language. Each task is mapped 1-to-1 to a component. As is the case with tasks, there are composed and primitive components. Composed components contain: input/output interface (the interface to components outside itself), task control structure (control flow constraints of the (sub)tasks known by the component), the subtasks and their information links, and domain knowledge. Primitive tasks may be implemented using conventional software.
Each component is specified in predicate logic, using sorts, predicates (relations), functions, and constants. In order to allow flexible task allocation, information about task-knowledge allocation and task allocation (which agent can and will do what) can also be specified. Between the components there are information links. Each link links the truth value of one component's atom to one of another component's atom. The mapping is changeable using a translation table. Component-environment and component-human interaction can also be described by viewing resp. the environment or human as a separate component.
The specification can be executed. The reasoning engine of each component entails object reasoning (reasoning about states of variables), meta-reasoning, meta-meta-reasoning, etc. using the specifications given.
There's no complex negotiation between components built into DESIRE; communication means sending or waiting for raw data of a specific type over each specific link. Agents with complex negotiation will have to be implemented by hand, for example by implementing a special kind of composed component [Brazier et al., 1995].
Language allows communication to others, but language also seems to be closely related to human thought processes, as advocated by cognitive science, see section 3. This seems to be supported by experimental evidence from HCI, for example, there appears to be interference of language use with thought [Ericsson and Simon, 1980]. Thus, dialogue can be seen as collective thinking [Allwood, 1995].
According to [Levinson, 1987], people understate and oversuppose. If anything turns out to be unclear, this will generally be detected later, and can be explained afterwards. Note that it may be a significant problem finding out just what was and was not understood. For NL systems, this means that good dialogue managing may compensate for bad speech or language understanding [Fraser, 1995].
On the other hand, language contains a great degree of redundancy. It is often assumed that redundancy is meant to provide more robust communication across everyday's noisy environment. The noise may sneak in at any level of communication: examples are speech understanding trouble and misunderstood shades of meaning. This would also mean that more efficient communication can and will probably occur in cases when there is less noise. For example, when explaining something complex, it is probably best to define each word you use precisely, and use very strict and unambiguous syntax, and while talking about something routine or within a clear context, it is possible to skip a lot of words and generally be unsyntactic.
An overview of HCI is given in [Shneiderman, 1998] and in [Veer and Lenting, 1995]. Overviews typically include models of human-computer systems, classifications of types of user interfaces, and methods and guidelines to improve usability of a computer system. For an example of user interface (UI) design recommendations, see [Roe and Arnold, 1988].
The article [Haan et al., 1991] summarises possible strategies to improve usability, which are: user cooperation in system design, prototyping, using analogies or metaphors, using guidelines or standards, or formal task-goal modelling coupled with metrics.
Generally, what is usually called `HCI' focusses on more `classical' user interfaces rather than multimodal and NL interfaces. This is a problem, because much of it is based on experimental findings, which cannot always be generalised easily. Still, it would be useful to understand the relationship between these different kinds of interaction. For an overview of speech and NL research topics in relation to HCI, see for example [Baecker et al., 1995]. For an introduction to dialogue system design, see [Bernsen et al., 1998]. For articles on NL in HCI, see for example [Aust and Oerder, 1995]. For some examples of models and theories of semantics and pragmatics in human-computer (and human-human) dialogue, see the workshop proceedings [Hulstijn and Nijholt, 1998].
Well-known in the field of usability of computer systems is Jakob Nielsen (see for example [Nielsen, 1993]). For Nielsen, usability is part of `usefulness' which is in turn part of a bigger system acceptability scheme, distinguishing `social acceptability' and `practical acceptability'. Usability is classified into: easy to learn, efficient to use, easy to remember, few errors, and subjectively pleasing. Note the distinction between learning and remembering: learning includes learning the general concept of the system, while remembering is only remembering the fussy details, while the general concept is already known. The criterium `easy to remember' is therefore especially important for casual users.
The criteria most often found in other literature are effectiveness, efficiency, satisfaction and learnability [Jones and Galliers, 1996]. Variations on this theme exist. For example, there is Relevance, Efficiency, Attitude, Learnability (REAL) [Ek, 1997] and functional, easy to use, learnable, pleasant to use [Haan et al., 1991]. Note that the aspect effective/relevant/functional is not included in Nielsen's view of usability. For Nielsen, this is part of a more general aspect, usefulness. Note also that error is no longer a separate category.
The article [Carlson and Hall, 1993] tries to determine a finer-grained covering set of criteria by means of a review of existing literature. These are user performance (degree of success or number of mistakes or time spent in error recovery), reliability (system does what user intends), stability (system guards against damaging actions or allows easy recovery), power (amount of functionality and number of different ways to achieve something), speed (speed in executing the user's commands), efficiency (speed with which the user can control the system), effectiveness (time taken by the user to accomplish a task). The tradeoff relations between the criteria are explained. The criteria consistency, flexibility, ease of use, ease of learning are argued to be contained in the others. However, the argument for ease of learning is not quite convincing: it is more than the combination of reliable, stable and effective, because these do not imply a good learning curve. The satisfaction/attitude criterium is not accounted for. Note that consistency is usually not seen as a general usability criterium but as one of the means to achieve usability, see section 5.2.
Users, and often the interface designers themselves, typically don't want to concern themselves with the details of the underlying system. Rather, some HCI designers like to talk about a user's conception of the system as fully separate from the system, and have a separate model for it. This is called the conceptual model or User's Virtual Machine (UVM). This is the view that a person has or should have of the system he/she is using. The UVM can be designed independently from the actual system according to ergonomic principles.
One of the most important principles for UVM design is consistency. Consistency means that the different actions and representations in the UVM are somehow analogous to other actions and representations. This should make the system easier for people to learn, understand, and remember. Note that `analogous' is in principle a highly intuitive concept, but its meaning is often quite clear in practice. Example: In a command-line system, it would be consistent to have the source operand always in front of the destination operand, and the command line switches always in front of the operands. One can imagine that a command-line system in which the order of these operands is different for each command, is much harder to use. Analogy can refer to both the relation between different features within the UVM, or to the UVM in relation to everyday life. The last form of analogy is called metaphor.
However, care should be taken that the actual system will be able to comply to the UVM. Often, it turns out that it is not fully possible to hide the system internals from the user. This may happen in the following situations (note that this classification is my own):
One may expect that these problems get worse when the UVM is farther away from the real system, or when the possible user tasks are more complex. One way to prevent them would be to design the UVM carefully, taking into account the system internals as well.
This is a class of HCI models which model human-computer interaction as two-party communication, described as flows of specific kinds and levels of information between software components and the human. Components typically include presentation, domain, and interaction handlers. One of the first and best-known is the Seeheim model-hence the name of this section. This model, along with some others, is described in [Encarnacão, 1997]. The Seeheim model basically consists of four components, connected serially as follows: the human, the presentation component (lexical level), the dialogue component (syntactic level), the application interface (semantic level), and the computer program. The other models are rather similar to the Seeheim model, but are more elaborate. They include: two versions of the arch model (this model, specifically meant for modelling adaptive user interfaces, has similar components, represented as an arch bridging between the user and the system), and the triple agent model (which consists of three components: the user, the user discourse machine, and the task machine). The purpose of such models is usually to model proper distinction between the UVM and the underlying system. Some of these also model the user in terms of the same kind of components as the computer. This more symmetrical view is closer to agent models, discussed in section 4.2.1. The difference is that they always model a single human and a computer party, and that they do not explicitly refer to the concept of agent as used in typical agent models.
The article [Waugh and Taylor, 1995] introduces a layered feedback model of communication. In this model, three theories are examined and brought together: Norman's human-computer hierarchical feedback theory (goal-intention-action-feedback-evaluation), Layered Protocol (LP) (a two-party cooperation model like Norman's feedback theory but with parallellism and information integration within each party and `virtual communication' between similar levels of the two parties (`real' communication only occurs at the lowest, physical, level)), Perceptual Control Theory (PCT) (a human is modelled using a hierarchy of Elementary Control Systems, each of which stands for an activity related to other activities by a task-subtask hierarchy).
In [Diel et al., 1993], an object-oriented UI specification language is described, with the main goal of separating program from user interface. The UI is specified by specifying objects out of four main types: Abstract (modality-independent), Data (specifies data structures), View (which of the items of the Abstract object should be viewed), Presentation (implementation of view). The bulk of the interaction should be implemented in the Presentation objects. Some related models are mentioned also. These are Seeheim, Model View Controller, PAC, Tube UIMS, ScreenView, MacApp, InterView.
Cognitive psychology and HCI are much related, but often, the contribution of cognitive psychology to HCI design is implicit or indirect. Sometimes however, cognitive psychology concepts are translated directly to HCI guidelines. Some examples of the latter case are given here.
In [Nielsen, 1993] for example, it is argued that gestalt theory should be applied to design layouts, memory load should be minimised, and the frequency and consequences of slips should be minimised by good interaction design. In `chunking and phrasing in HCI' [Buxton, 1986], the concept of chunking is translated to: one should design possible actions in the interface in such a way, that they correspond one-to-one to chunks in an expert's memory. This enables quicker learning. In `designing for error' [Lewis and Norman, 1986], the distinction between slips and mistakes goes parallel with a distinction of feedback needed to enable the user to recover from an error.
Some people argue that ideas from cinematography and theatre can effectively be used for HCI. These ideas can be divided into two kinds: adapting the design paradigm, or adapting practical principles, such as filming techniques.
The book [Laurel, 1993] argues for adopting the design paradigm of structuralist theory of theatre. Basically, the human and computer (or rather, representations of `agents' by the computer) are seen as analogous to agents participating in a play. One of the main points made is that the traditional use of metaphor is insufficient for designing easy-to-use systems, because human-computer activity is necessarily different from real-life activity. The alternative would be to design the metaphor as a deliberately and carefully simplified and modified caricature of reality. Another main point can be seen in the fact, that the word `human-computer interaction' is deliberately replaced by `human-computer activity' throughout the book. The main purpose of this is apparently to advocate an integrated view, in which the human is supposed to feel part of the action, rather than feel that the computer is something separate which has to be communicated with (as is stated in the book, `see the computer as a medium, not as a tool'). More detailed adoptations of theatre theory include the notions of formal cause (form), material cause (the materials used), efficient cause (the way in which they are implemented by the actors) and end cause (the purpose); probability and possibility; complication and resolution; discovery, surprise, and reversal; the effect of constraints, etc. However, the examples given are not quite concrete enough to show how this theory is an actual improvement of current practice in HCI development, nor are the illustrations of the `failures' of traditional HCI always quite accurate.
An example of the use of practical principles of cinematography in HCI can be found in [May and Barnard, 1995]. The main idea pointed out in this article is: make sure of cognitive congruence. Two examples are given: a new object that should be in focus should start at the same screen position as the previous object that was in focus, and continuous sound should remain continuous during change of viewpoint.
Some material exists that tries to couple semiotic theory with HCI, for example [Andersen, 1990], [Uzilevsky, 1994], and [Ehrich and Williges, 1986] chapter 5. Possibly, the concept of `signs' may aid in forming a more general understanding of, and relations between human cognition, HCI, NL, and multimodal interaction.
In [Callahan, 1994], the relevance of the sign type classifications icon, index, and symbol for HCI is addressed. It is argued that the choice of sign type in a user interface (for example, direct manipulation is mostly iconic, command line is mostly symbolic) influences the kinds of interaction that are possible or feasible. For example, a point is made in favour of using symbols (words) rather than icons (pictures), because if you give something a name, you may give the user an opportunity to talk about it, using the superior techniques that symbolic communication enables.
Many HCI analyses and models are based on the assumption that there is some concrete task to be achieved or problem to be solved (task-oriented or cooperative problem solving models), while some focus on vaguer tasks, like advisory, exploratory, and learning tasks. [Walker and Whittaker, 1990] contrasts properties of advisory dialogues (AD) and task-oriented dialogues (TOD).
Some task-oriented NL dialogue systems can be found in [Fischer and Reeves, 1991], [Sikorski and Allen, 1996], and [Stent and Allen, 1997].
Relatively often however, NL, multimodal, and `intelligent' interfaces are used for the classes of less well-defined tasks.
One is evaluation-oriented information provision [Jameson et al., 1995], which means information provision specifically meant to assess a certain situation, such as personal assistants or sales assistants. Another well-known example is expert systems [Fass et al., 1996].
Often found is information retrieval (IR) [Bouzeghoub and Metais, 1995]. [Androutsopoulos et al., 1995] gives an overview of existing IR systems. A more specific domain is NL database query (NLDB) system for network management [Chau, 1993]. This article also lists general advantages and disadvantages of NLDB: Advantages are adaptation to user domain knowledge, no learning time and remembering commands, and contextual ability. Disadvantages are: the imperfection of NLP technology, because there are hard-to-avoid ambiguities at all levels: unambiguous generated language looks awkward, and users always make (grammatical) mistakes. Another is users' overestimation of system capabilities.
In educational research, much material concerns instructional dialogues. [Beun et al., 1995] contrasts instructional with informational dialogues. In instructional dialogues, the tutor typically provides information and then asks questions about the subject matter, after which a dialogue may occur. Unlike informational dialogues, the first information is given spontaneously, i.e. without a query, and questions are asked by the one who gave the information (the tutor), because the tutee usually doesn't react spontaneously. So, the tutor usually takes initiative even while the information to be conveyed is usually complex. The student's current knowledge and learning abilities will have to be understood well by the tutor. A particularly subtle technique in instruction is pedagogically motivated misrepresentation (PMM). [Gutwin and McCalla, 1992] explains how and when PMMs are used, and a system that uses them is presented. More specific examples of instructional dialogue are: design instruction [Lee, 1995], optics instruction [Reiner, 1995].
In some systems, and especially in NL systems, cooperativity is explicitly modelled. For a taxonomy of cooperative response types, see [Yamada et al., 1993].
Much of the basic `cooperative answers' as referred to in NL (and sometimes command line) systems can be seen as equivalent to browsing in GUIs. In GUIs, data that is related to the data requested is displayed on screen while the user is selecting the request. For example, users select a specific entry out of a list of files, emails, or icons. If the user is not quite sure what to select he/she is enabled to browse. The linguistic equivalent of browsing is the computer coming up with data that may be useful in response to a not-quite-valid or imprecise request. In other words, cooperativeness is the same as ensuring visibility. See for example [Wolf et al., 1995], where a voice-driven email system was tested and was compared to its graphical equivalent. `Visibility' is shown to be an issue in the voice-driven variant: one has to: 1. make sure the user knows what the computer has understood, 2. give cues as to where the user is in the interface, 3. for larger mailboxes, query commands plus cooperative answers rather than browse commands would be more appropriate.
There is very little research on uncooperative dialogue systems: the most notable example is PRACMA [Jameson et al., 1994], which tries to maintain discourse obligations while having conflicting goals with the user [Jameson and Weis, 1995]. The example application domain is second-hand car sales.
In this section, an overview of some major existing NL and multimodal projects and systems is given. Other such overviews exist. In [Androutsopoulos et al., 1995], an overview of dialogue systems for IR is given. In [Smith and Hipp, 1994], an overview of NL dialogue systems is given.
The systems are summarised in the table below. The classification of the systems is divided into application domain, technical abilities, and the strategies and styles used. Most systems use a NL strategy, all systems use at least a NL style (see section 7 for definition of strategy and style).
Abilities:
Strategies and styles:
Schisma ([Nijholt et al., 1998], more specific information about the parser can be found in [Lie et al., 1998] ) is a theatre inquiry and booking system. Its language parser is based on `rewrite-and-understand', which means any language utterance is rewritten, using production rules, into a `canonical' form. The dialogue manager is based on resolving unfilled entries after a transaction is initiated. Schisma is currently being integrated with the Virtual Music Centre (VMC), which is a virtual-reality version of the music centre building in Enschede.
Dialogos [Albesano et al., 1997] is a task-oriented railway-enquiry telephone speech system which has been tested with a large user base.
TRAINS93 [Sikorski and Allen, 1996], [Stent and Allen, 1997] is a system for route planning of cargo trains. TRAINS93's dialogue manager [Traum, 1997] is described as a reactive-deliberative dialogue agent. Modelled are social attitudes: mutual belief (grounding), obligations, and multi-agent plan execution. Formal descriptions of intention, commitment, planning and plan execution, goal satisfaction, failure, and repair are given.
ARTIMIS [Sadek et al., 1997] [Bretier and Sadek, 1997] is a telephone inquiry system. It is plan-based, and is described as a rational communicating agent. It is reputed to be a good system.
PADIS (Philips Automatic Directory Information System) [Bouwman, 1998] is a voice dialling system, used as a test-bed for HDDL (High Level Dialogue Definition Language), developed at Philips.
COSMA [Busemann et al., 1997] is a NL front-end for appointment scheduling applications. Uses the InterRAP agent architecture as a basis. The program has to communicate with applications and multiple users.
CASSY (Cockpit ASsistant SYstem) [Gerlach and Onken, 1993] tries to assist pilots in flight planning and decision tasks. Its ultimate goal is to reduce error in the human decision-making process.
WOMBAT [Blandford, 1995] is a decision making tutor/assistant (`agent') system, which actually `abstracts' from using full natural language, but instead, focusses on problem solving and argumentation. Each party chooses from a set of canned sentences with parameters that can be filled in at specific places. The system models user goals and beliefs and knowledge about the situation. It decides what to say next by weighing the different options against a Gricean-like priority-system (though the rules are somewhat more tailored to the `tutor' purpose). The system generates reasonable replies but does not support complex argumentation.
CARTOON (CARTography and cOOperatioN between modalities) [Martin, 1997] is a system to assist in obtaining information from a map. The system attempts modal integration of user input, which comes in the form of NL and pointing.
InterLACE: [Trafton et al., 1997] is a multimodal system, combining pointing on a map with typed text. It includes a full natural language engine.
ORIMUHS
[Encarnacão, 1997] is a generic intelligent help and
support system for users of complex software. It keeps a user, task, and
discourse model to generate multimedia help. It can also be given commands
using speech, keyboard, and mouse. It is implemented as a server, serving
multiple individual users at the same time.
QuickSet [Johnston et al., 1997] is an assistant for planning war tactics. The user can supply speech combined with gestures on a map. The gestures recognised are out of a limited set of symbols, such as fortified line, area, tank platoon and mortar. The input from the different modalities is unified to form a feature structure. This is called unification-based multimodal integration.
The circuit fix-it shop is described in detail in the book [Smith and Hipp, 1994], along with the underlying theories and ideas, and evaluations of the system. The application domain is highly task-oriented, and consists of diagnosing and repairing an electronic circuit. The user is capable of doing the physical actions to repair the circuit, while the computer knows what to do. It is plan-based, assuming a single shared goal, and dynamically generating the best next step, based on the current situation. An inference engine, combined with missing-axiom theory, is the central mechanism for driving the dialogue. It is used to: determine the next step in the task, keep track of user knowledge, and model dialogue focus. The system allows mixed-initiative by distinguishing four modes, which augment the working of the central engine. The modes are: directive (computer has control), suggestive (computer suggests but user may change the course of the task or dialogue), declarative (the user has control but the computer mentions relevant facts), and passive (user has control).
Classifications of dialogue strategies in HCI can be found in many HCI overview books. One classification that is recommended often is the one by Shneiderman [Shneiderman, 1998]. A similar one can be found in [Veer and Lenting, 1995].
Mode and style are sometimes used to indicate classification of surface features of interaction, though the precise meaning varies. Indexical, iconic, and symbolic are sometimes called modes [Callahan, 1994]. Others call `interactive' and `noninteractive' modes. [Oviatt and Cohen, 1989], for example discusses the effect of interactiveness on verbal explanations.
A classification of user interface types often found is: conversational, direct-manipulation, command-line, question answering, menu choosing, and form-filling. In [Shneiderman, 1998], they are called styles, while in [Veer and Lenting, 1995], they are called `paradigms'.
In [Androutsopoulos et al., 1995] (page 7), the advantages of NL are contrasted with those of other interface styles. Some interesting advantages are given: NL is better for some questions, which would require lenghty notation in other languages, NL discourse has context, which allows briefer queries. An apparently similar contrast, namely conversational versus direct-manipulation, is given in [Stein and Maier, 1995]. However, here it is argued that the conversational style can also be used in conventional GUIs. The GUI example given in this article even uses a dialogue grammar.
In an attempt to get rid of the confusion, a terminology will be defined here. We will not use the word `mode'. We will indicate the use of a particular (static) representation that is used to communicate a state of the system to the user and vice-versa with the word `style', while we indicate the possible system dynamics, when viewed as abstract system states and possible transitions between them, with the word `strategy'. In this view, the two discussions summarised in the previous paragraph are really talking about different things: while both are talking about dialogue strategy and style as found in dialogue-grammar models of human-human conversation, [Androutsopoulos et al., 1995] is talking about natural-language strategy, while [Stein and Maier, 1995] is talking about natural-language style combined with GUI strategies.
What is generally meant by `a modality' is a specific channel through which communication can be conveyed. Typically, multiple channels may be identified, each with its own distinct properties, and each, in some way, separate from the others. The first question that comes to mind is: what channels are there? The answer probably depends on your viewpoint: in what way are the channels meaningfully distinct and separate? We will not take one viewpoint here, but instead, show some of the options available.
If we look at human perception, the most `objective' answer is probably that a modality corresponds to a human sensory and motoric subsystem, like the eyes, ears, hands, and voice. When we consider a `deeper' cognitive level, we might also say that, for example, people think differently about written language than they do about graphs, even though both are visual, suggesting another classification of modalities.
On the other side, if we are considering technological issues, we might view modalities as corresponding to the computer's input and output subsystems, like screen, speaker, keyboard, mouse, and microphone. A classification according to a `deeper' level of technological issues could also be made, for example, continuous speech versus isolated-word speech, or screen versus paper.
After having recognised the existence of modalities, some issues arise:
A classification of the ways in which combinations of modalities can be used to convey information is found in [Martin, 1997]. He identifies equivalence (modalities can convey the same information), specialisation (a modality is used for a specific subset of the information), redundancy (information conveyed in modalities overlaps), complementarity (information from different modalities has to be integrated to arrive at coherent information), transfer (information from one modality is transferred to another), concurrency (information from different modalities is not related, but merely speeds up interaction).
[Bernsen, 1996] presents modality theory, which classifies modalities according to both human and computer issues, for example, written text, typed text, typed keywords, continuous speech, and isolated-word speech are different modalities. Properties of each modality are given, so that the effects of a certain choice of modality or modalities can be predicted. The emphasis of the article lies on the speech modalities, and on when to choose for or against using speech input or output. Another article contrasting advantages and disadvantages of voice to other modalities is [Cohen and Oviatt, 1995].
Using multiple modalities coherently in some way is called multimodality. [Nagao and Takeuchi, 1994] gives a classification of the multimodal interfaces typically encountered:
In the general case, multimodality helps to decrease ambiguity by supplying additional context. The article itself addresses an example of the third kind. Examples of the first and second kinds are given below.
An example of integrating gestures and natural language can be found in [Fahnrich and Hanne, 1993]. Another can be found in [Johnston et al., 1997], which is called unification-based multimodal integration. This approach uses feature structures in which drawings are incorporated as concepts. The system QuickSet, which is also described, illustrates the approach. [Lee, 1995] concerns the feasibility of combining graphics and NL in design instruction. Design activities are argued to be found as part of all kinds of problem-solving activities. Design is argued to be impossible to explain and has to be learnt while doing. So, a `design studio', an environment for designing and learning to design, is argued for. Such a design studio may be implemented on a computer. An analysis is given of human-human interior design dialogues aided by drawings. The messages conveyed in the drawings are sometimes quite subtle, and go beyond the usage of graphical symbols out of a limited symbol set, as is assumed by a lot of gesture- and drawing-interpreting computer systems.
An example of the second class may be found in[Nagao and Rekimoto, 1995]. Here, the context is some nearby object. The interface is meant to act as if the user talks to the object directly.
[Dybkjaer et al., 1995] addresses the effect that choice of representation modality has on the language use of the user.
An example of modalities at a `deeper' cognitive level is given in [Reiner, 1995], which discusses the use of different kinds of symbols in an optics learning environment. Disciplines like physics and optics use a non-everyday set of symbols and expressions. It is argued that symbols ('labels') can be learned best by using them in communication. It is studied how such means of expression can be acquired in a collaborative-learning setting, using a computer tool to construct optics models. Four kinds of symbols are studied during the learning session: graphical (using the program), mathematical, verbal, and physical (using real objects). The introduction of new symbols in the course of the dialogue are plotted against time. Some requirement relations between symbols are shown. The students appear to prefer graphical representations.
Using multiple output modalities is called multimedia. See for example [Andre and Rist, 1995] and [Sutcliffe and Faraday, 1993]. Generating multimedia presentations from task plans using a generalised text-discourse generation method is discussed in [Rist and Andre, 1993]. A cognitive-walkthrough method for designing multimedia is discussed in [Faraday and Sutcliffe, 1993]. Their underlying theory is that visual and text objects are interpreted to form respectively objects and propositions, which are then integrated using rules from LTM into `macro-propositions', which are mode-independent. From there, a sequence of `macro structures' is formed, representing the discourse. The method is as follows: fist, make a task analysis, then evaluate user attention, topic focus, argument repetition (is a concept repeated at the right times and consistently?), and macro-proposition and macro structure forming (do the propositions make sense?).
[Rauterberg, 1993] proposes a semantic description of user interfaces, which can be used to derive measures of user interface properties, such as interactive directness, feedback, flexibility of dialog interface, flexibility of application interface. The main idea behind the description is the identification of `object space', `function space' and `functional interaction points'. Note: history is not accounted for in the model. For example, command-line interfaces should have scored higher on feedback because they explicitly show history.
The relation of redundancy with mutual belief is discussed in [Walker, 1992]. It is argued that informationally redundant utterances (IRUs) often occur in human-human dialogue. IRUs are meant to make inferences explicit. For example, paraphrases occur often and are meant to represent the conclusion that the hearer has drawn. [Walker, 1994] and [Walker, 1996a] go on where the previous article left off, and discuss the relation between reasoning and redundancy. It is argued that the occurrence of IRUs is also related to the trade-off between cost of retrieval for the hearer and cost of communication for the speaker. This contrasts with the assumption of omniscient parties.
[Smith, 1997] evaluates strategies for selection of utterance verification in case of speech understanding uncertainty in the Circuit Fix-It Shop. It is not efficient or user-friendly to verify everything, so, selections are made according to `parse cost' (the unlikeliness that an insertion, substitution, or deletion of a recognised word could have happened) and `expectation cost' (the unlikeliness that a specific semantic frame could occur given the previous utterances). Experimentally, a combination of both proves best. Introduction of a variable, topic-dependent, verification threshold is shown to make little difference. The remaining bottleneck appears to be misrecognition of content words (like one digit for another, or one name for another).
[Trabelsi et al., 1993] discusses heuristics for generating informative responses to failing queries in NLDB. The kinds of failures are classified into: value (there were no matches for a specific value of an attribute), condition (no matches for value in current context), attribute (attribute does not exist), concept (concept or entity type is not known). Basically, the solution to each is to state clearly the reason why the query failed. Repair is done using lists of options to be selected until a complete query is specified or the user decides that the option wanted is not present.
Several varieties of plan-based models exist: one can model one joint plan, one plan per agent, or multiple (tentative) plans for a single agent. [Ramshaw, 1991] discusses a plan exploration model. This allows multiple and hypothetical plans. It has three levels: domain level, exploration level, and discourse level. [Jameson and Weis, 1995] discusses discourse obligations in noncooperative dialogues, which implies non-shared plans.
Initiative, or control, refers to who is taking control of the interaction. Typically distinguished are mixed-, user-, and system-initiative. [Kitano and Ess-Dykema, 1991] discusses a plan-based understanding model for mixed-initiative dialogues. The model proposed allows non-shared domain plans.
Mixed versus system initiative in NL dialogues is examined experimentally in [Walker et al., 1997a]. One particularly striking remark made here was that people found the pace of the system-initiative alternative faster, even though the mixed-initiative was actually quicker. A reference was made to similar findings in the case of GUIs [Smith, 1996].
Dialogue structure refers to the discourse structure of one participant, and to initiative and initiative shifts between participants. [Passoneau, 1989] discusses discourse structure in relation to the discourse referents `it' and `that'. Theory about the function of the two pronouns is given. The article concludes with a set of rules relating pronoun to discourse center, center retention, and antecedent. [Walker and Whittaker, 1990] discusses topic centering in relation to mixed initiative.
[Karsenty, 1993] models interactive explanations using rhetorical schemas. Human-human design dialogues and draft pictures were recorded and analysed using rhetorical schemas. It was found that rhetorical schemas only explain dialogue goals, and not task goals.
The relation between communication goals and feedback is discussed in [Nivre, 1995]. Communication goals are classified into a three-level hierarchy: evocative (pragmatic), signalling (semantic), utterance (syntactic). Feedback may concern each of these goals, and may be positive (OK), neutral (inconclusive), negative (failed). Some combinations are not possible: negative feedback on a lower goal implies negative feedback on a higher goal. The converse goes for positive feedback. Veridical feedback is any feedback received; intentional feedback is explicit feedback by the other participant.
[Bunt, 1995] argues for transparency and naturalness as the key concepts for successful interaction. To achieve this, it is argued that precise modelling of what is going on in a dialogue is needed. Presented is Dynamic Interpretation Theory: the contextual aspects covered are linguistic, semantic, physical, social, and cognitive context. Dialogue acts (one utterance may consist of multiple acts) are classified into dialogue control and task-oriented dialogue acts. A classification of acts is given.
In [Tillmann and Tischer, 1995], self-repair disfluencies (hesitations and repetitions) are measured in different control circumstances, which are using a button to shift control, using normal conversation, and using normal conversation without a given problem to solve.
[Heeman and Allen, 1997] analyses speech to determine intonational boundaries (segment the speech into phrases), speech repairs (self-repairs), discourse markers (these influence the structure of the discourse) . They argue for the need of an integrated analysis, integrating these three kinds of information.
In plan-based models, plans may include discourse plans next to task plans. For example, [Moore and Paris, 1989] discusses text planning in advisory dialogue. A distinction is made between intentional and attentional structure. A scheme is explained with which the system is able to take task goals, text goals and text structure into account for generating text. [Lambert and Carberry, 1991] discusses another plan-based model of dialogue, which has three plan levels: domain level, problem solving level, and discourse level. [Lambert and Carberry, 1992] continues on the previous model, which is extended for modelling conflicting beliefs and negotiating about such conflicts, thus enabling negotiation subdialogues. Another approach is found in[Kitano and Ess-Dykema, 1991], a plan-based understanding model for mixed-initiative dialogues. This particular model also allows non-shared domain plans.
Another often-found approach to model the effect of dialogue utterances to the dialogue structure is the dialogue grammar approach. Sometimes, the dialogue grammar approach is meant merely to give some extra cues to predict control shifts. The types of utterances identified have at least the distinction question-assertion. In [Jönsson, 1993], this approach is compared to the plan-based approach. It is argued that, in practice, the dialogue grammar approach works as well as the plan-based approach. Shifts in initiative are usually modelled either by means of a finite state automaton or a hierarchy of dialogues and subdialogues. [Walker, 1996b] contrasts a linear with a hierarchical model of control shifts. Since old utterances are forgotten, it is argued that control shifts are never completely hierarchical.
[Walker and Whittaker, 1990] discusses the relation between centering and conversation control in advisory dialogues (ADs) and task-oriented dialogues (TODs). The utterance types that are identified are assertions, commands, questions, and prompts. Control shift types that are identified are abdication, summary, and interruption. Control shifts are either persistent or temporary. In the case of temporary control shifts, the control is relinquished as soon as possible. Possibly, these interruptions have a hierarchical structure. Apparently, interruptions happen when there are problems with either the information quality or the plan quality. It is shown that control-shifts have a specific influence on references. Also, control shifts happen more often in ADs, especially summaries and abdications occur more often in ADs.
[Chu-Carroll and Brown, 1997] discusses prediction of initiative in collaborative (planning) dialogues. A summary of previous work on dialogue initiative is given. It is argued that task initiative and dialogue initiative have to be separated. A predictive-cue approach, which is similar to the dialogue grammar approach, is proposed, obtained after annotating and analysing TRAINS91 dialogues. Nine cue types, classified into three classes, are used to predict initiative.
[Iwadera et al., 1995] divides a dialogue in components (acts or utterances) which each have a type, and can be combined into higher-level structures, called moves and exchanges. The theory classifies topics into short-term and long-term. The scope of a topic is determined by the act-move-exchange-structure. Note: it is not clear how the theory deals with sudden interruptions or changes of topic.
There are soms studies that try to analyse the processes that developers (analysts, designers, programmers) go through [Schooten, 1997]. The following general observations are particularly relevant:
We will define a methodology as a stepwise prescription of development
activities. Methodologies may have various levels of detail and rigidity. For
example, specific activities and/or specific kinds of descriptions may be
prescribed for each step. Various low-detail and general examples can be
found in Kaaniche and Mazet's lecture in [Pasquini et al., 1998], some of them
prescribing a rather complex sequence of steps. Most often found, however, is
the `design cycle' that goes something like: analysis - design -
implementation - evaluation - analysis -
. The purpose of going
through the stages systematically is to be able to gain maximum insight
needed to develop a successful system. Usually, one starts with the analysis
stage, analysing the system that is already present and has to be rebuilt,
but in some methods, one may jump in at any stage. Some variants are not
cyclic, but finish at the evaluation stage, assuming the problems found in
evaluation are small enough to be fixed on the fly, without requiring a
reiteration. Some methods lump together the design and implementation stage
into a single stage, while others lump together the analysis and evaluation
stage.
Why use a development methodology? [Tullis, 1993] shows that a naive choice of interface strategy may not be the best one. In their experiment, naive designers turn out to be fond of `drag and drop', while evaluation points out that it is one of the slowest to use, nor is it the strategy that the users prefer. Also shown in this study is a positive correlation between user's preference and efficiency.
The EAGLES handbook [Gibbon et al., 1997] proposes guidelines for specification, design, and assessment (which means evaluation) of NL and spoken dialogue systems, as well as practical tips. The book identifies and addresses three different kinds of dialogue systems: menu, spoken, and multimodal. These are also contrasted with command systems. The book gives some general recommendations for dialogue systems, as well as methodological recommendations for building such systems. Possible purposes of development methodology are classified into:
Design strategies are classified into: design by intuition, by observation (analysing existing corpora), or by simulation (also called Wizard of Oz (WOz) experiment; this means using a human to simulate a computer, see section 10). Iteration in design strategy is also described. Some issues concerning WOz experiments are addressed: a set of subject, wizard, dialogue model, and communication channel variables for WOz are explained. An assessment is described as consisting of two components: characterisation and assessment framework. Characterisation consists of a list of system, user, task, environment, the resulting corpus, and overall characteristics. Assessment framework consists of assessment situation, choice of glass or black-box view, quantitative and qualitative measures, and methodological recommendations. The quantitative metrics listed in the book are: average dialogue duration, average turn duration, contextual appropriateness, correction rate, transaction success.
Group Task Analysis (GTA) [Veer et al., 1996b], [Veer et al., 1996a] prescribes a combination of ethnography with more traditional task analysis methods and notations, like Hierarchical Task Analysis (HTA), work flow analysis, and object modelling. Unlike the name suggests, it covers both analysis and design stages, but does not prescribe precise procedures for specific stages. Furthermore, one may jump in at either stage. The method does prescribe the definition of two task models for each loop in the design cycle, and suggests various methods for obtaining the knowledge needed to form these models. Task model 1 describes the current situation, and is part of analysis. Task model 2 describes the functioning of the envisioned changed system, and is part of design.
Task Oriented Modelling [Warren, 1993] (TOM) splits the problem to be tackled into three models: domain, user, and device model. The development process is split into four steps. The process may be repeated until the resulting system is satisfactory.
Method for Usability Engineering [Lim and Long, 1994] (MUSE) is an attempt to integrate HCI into structured software engineering methods. It is argued that in most methodologies used in practice, the consideration of HCI issues is limited to the evaluation phase, which is usually at the end of the design process. This means there is not enough human factors input. In their own words, human factors are addressed `too little, too late'. In order to overcome this problem, concepts, notations, and design procedures have to be integrated with existing software engineering methodologies. MUSE's emphasis is on defining procedures of notational transformation, rather than defining HCI knowledge and methods needed to do this successfully; it is assumed that the method is used by experienced HCI developers. The general idea of MUSE is to incorporate both human and computer factors in the task and domain models used in the development process. The basic conception of human and computer factors is by a separation of: online tasks (with computer) and offline tasks (completely without computer) tasks. Online tasks are split into interactive tasks and automated tasks. A number of notations are suggested, in particular, semantic nets for domain modelling, and structured diagrams for task modelling. The stages of the methodology are:
Specification, or modelling, is description of a system and context in any stage of the development process, typically describing objects, processes, or constraints directly related to the system.
Data collection is gathering raw data about the system which may be interpreted by the developers afterwards. Sometimes, the data is gathered by humans, by hanging around, observing, and keeping a log, as for example happens in ethnographical methods. In other cases, data may be collected automatically, by means of computer input logging or video cameras. In any case, the experimental setup (or absence of any explicit setup, as in ethnography), has to be specified, which describes how the data is obtained. After data is collected, it may be interpreted, resulting for example in performance information (such as performance metrics) or qualitative information (such as a specification). Data collection only happens in the analysis and evaluation stage.
We define abstraction as simplification or reduction of information to filter out only those parts that are the most relevant, as seen from a specific viewpoint. Specification necessarily means abstraction, though some abstractions may be more explicitly or better chosen than others. Note that data collection also implies abstraction, because, necessarily, not all data is collected. It is important to realise that abstraction always implies throwing away data that may actually turn out to be relevant. Abstractions are sometimes chosen according to the same theory that underlies the design of the system that is being abstracted. If the abstraction is wrong, the developers may remain blind to the real issues. So, it is useful to be aware when abstraction is occurring, and to know the underlying assumptions.
Formal specification is unambiguous and explicit, hence it allows exact reasoning and description. Sometimes such a specification is executable (which means it can be run on a computer) as well. It is also possible to under-specify, explicitly leaving things unspecified, thus specifying a range of systems, rather than just one. Underspecified specifications are generally not simulable.
The article [Wright et al., 1997] shows an example of how formal specification may play a role in the HCI development process. Formal specification (Z) is considered complementary to non-formal empirical evaluation (cognitive walkthrough and usability inspection). Formal modelling enforces precision, and going from an abstract to a concrete model shows possible options clearly. Empirical evaluation shows what relevant aspects of the model are still missing, and need further elaboration.
The article [Palanque and Bastide, 1996] argues for formal modelling of both the user's task (task model) and the system's functionality using Petri nets. It is shown how these two models can be combined to obtain a human-computer-interaction model with explicit and even executable dynamics. This allows a clear view of what roles the human and computer are playing, and may make remodelling by shifting tasks between human and computer easier.
[Lauesen and Harning, 1993] argues that current UI design is mostly done either using formal methods (which are not good for user-centered design) or using prototyping (which is not structured enough for large systems). An attempt is made to supply an alternative by using a variation on traditional design methods for user-centered design, participating the users in the design process. Special care has to be taken to make sure the users understand the specifications, i.e. the diagrams used in the design method have to be novice-friendly. The notations in the method consist of:
The article [Lim and Long, 1993] argues for using structured notations from software engineering for UI specification. In their view, structured notations may solve the lack of specificity, descriptive scope, communicability, and maintainability of current methods. The notations are intended to be read by the users as well. Examples given are: semantic nets, network diagrams (DFDs), and structured diagrams (flowcharts).
In [Ehrich and Williges, 1986] chapter 3, a notation for control and data flow for specifying dialogues, called SUPERMAN, is described. Chapter 5 describes language specification in Backus-Naur Form (BNF) and by stating examples.
Some notations are simply programming languages, tailored for interaction design. Sometimes, the programs written in these languages are not quite fit for producing professional products, because the languages are very limited, to keep them simple, but they are still useful to create an approximation of an envisioned system which allows early analysis. Examples are Speechmania [Philips, 1998], which is meant for NL dialogue design, and 3dt [Lewin, 1997], which is meant for dialogue design for more traditional user interfaces.
The article [Baekgaard, 1995] describes Dialogue Description Language (DDL). DDL consists of three layers: the graphical layer, the frame layer, and the textual layer. In this article, only the graphical layer is described, which is similar to a finite-state automaton. It is not possible to describe mixed-initiative systems with this language.
The article [Palanque and Bastide, 1996] describes Petri nets to model both human and computer activity in a global task. It is also shown how the Petri net can be used to provide automatically-generated help on specific actions.
The article [Kinoe et al., 1993] introduces a tool for both the analysis stage and the redesign stage of (iterative) user-centered UI design. In the first stage, empirical data is segmented (cut up in units according to the user's basic subtasks) which are tagged (given attributes according to the analysis model). This is done by hand with support by the tool. The results can be ordered in several ways, and can then be reordered and commented on by hand. Reordering is supported by a genetic algorithm that reshuffles orderings. The designers can select the produced orderings that look best.
DIGIS [Bruin and Bouwman, 1993] (Direct Interactive Generation of Interacting Systems) is an object oriented design environment based on the PAC (Presentation, Abstraction, Control) model. UI design is argued to have 3 aspects: presentation, control flow, and interfacing with application. Objects consist of attributes and access protocols. A protocol is described using a regular expression that defines the allowable sequences of actions. On top of that, higher-level tasks can be described as regular expressions of lower-level tasks.
AME [Martin and Winterhalder, 1993] (Application Modeling Environment) is based on using traditional CASE-tools for object oriented (OO) analysis and design, combined with a knowledge-based User Interface Management System. AME's representations are split into three levels: analysis/design, construction, and generation. In the analysis/design level, an OO representation of the interface is constructed using object oriented analysis and design tools. This can then be refined in the construction level. The generation level only generates code. Note that there is no extra help for UI design except the ability of constructing an interface rapidly, assuming that using OO as a paradigm is indeed a good choice.
The book [Jones and Galliers, 1996] addresses some methodological aspects of NL evaluation in great detail. The book is mostly about speech recognition, language parsing, language generation, and NL information retrieval (this aspect is the closest to NL dialogue).
First, terminology and experimental design issues are addressed. An evaluation has a setup (environment of the system), system (including goal and identifications of subsystems), task, and domain language. Evaluation is separated into three levels: the criteria, measures, and methods levels. The criteria are the general requirements. They are classified into intrinsic (related to system objective), and extrinsic (related to system function inside setup). Multiple measures (or metrics) may be used to measure each criterium. Measures may be general (applicable to multiple systems), in the form of baselines or benchmarks, and can be compared to exemplars or norms. A method is the way an experiment is designed. Evaluations are classified into investigation (reviewing a system at work), and experiment (trying to figure out how something will work). The factors that determine performance are classified into system variables and environment variables. It is argued, though, that it is hard to identify environment variables meaningfully. Note that a system may be viewed from different angles and perspectives: the goals or interests of different users/people may be very different, implying there is no one set of criteria that's unequivocally important. For evaluation, generic systems (systems which can be reprogrammed entirely) are a particularly hard case: possible setups and environments may vary greatly, and cost of customisation is also important.
Evaluation tools are described next. For the purpose of comparative evaluation, these tools are shared criteria, measures, and methodologies. General problems with the use of such tools are incomparability of systems, or goals and setups that are too different. For basic evaluations, there are: test data, evaluation data (this is test data with answers), benchmarks, test beds, (support) toolkits. Tools from social science are: the general usability criteria effectiveness, efficiency and acceptability, and the general validation measures reliability (are the results consistent?), and validity (how good is the relation to criterium?). There is supposed to be a norm value for each measure. Other things discussed are antecedent and intervening variables, and quantitative and qualitative measures.
The book continues with a review of evaluations and evaluation methodologies in the literature. The only actual dialogue systems reviewed are database query systems. Typical measures used are ratio of successful answers, percentages of utterances understood, duration measures, problem complexity measures, and utterance rating (such as type of utterance and correctness). An interesting measure is dialogue tree breadth, obtained by asking many people what utterance should be next at a point in the dialogue. A particular problem is the validity of using general measures, which may not be quite applicable to systems. For example, a parser may yield output that works very well for the next subsystem in the chain, but may look bad when viewed in comparison with a parser norm based on parsers with a slightly different purpose or underlying theory. Summarising performance in only very few numbers is dangerous: the numbers do not answer why a system has a particular performance, and the numbers may not even be relevant, especially in view of variations in the larger setting of a system.
Corpus issues are discussed next. Addressed first is design and purpose of Wizard of Oz (WOz) experiments. The idea behind WOz is to assist evaluation by supplying corpora which can be tested. The performance results can be gathered easily and may cover many aspects of language processing. However, it is limited to syntax parsing, and the usual problems of incompatible metrics are not solved. Also, domain properties are not accounted for. Further corpus design issues addressed are using corpora and test collections, test suites and tools, architectures (which means ways to build systems from components), and standards (such as annotation standards).
Mega-evaluation, which is evaluation across many systems, is addressed next. Two models are discussed: the hub-and-spokes model and the braided-chain model. In the hub-and-spokes model, evaluations are considered to be linked (meaning that they are comparable) when they have common data, tasks, systems, or metrics. The links can be mapped to a hub-and-spokes map, with the hub standing for inter-system comparison, and the spokes for testing data mismatch problems. In the braided chain model, comparison among different tasks is modelled as well. Different kinds of systems can be `braided' together by supplying the output of one as input to another. This way, different kinds of systems can be evaluated stand-alone as well as in combination with other systems.
Some final comments and issues on evaluation are given. In particular, it is claimed that evaluation is generally task-oriented, and it is generally not clear how to decompose or identify system and environment factors.
The article [Frascina and Steele, 1993] discusses using task analysis for UI development for a hospital information system. The old method (using paper instead of computers) was analysed first. The task analysis methods used is called TAKD (Task Analysis for Knowledge Descriptions). It consists of: data collection and creation of a list of activities, then construction of a task description hierarchy from the activity list, and then an analysis of sentences of Knowledge Representation Grammar, which are generated from the task hierarchy.
The article [Chase et al., 1993] aims at describing and assessing different user activity notations from the literature according to: scope (what design activities/phases they support), content (what design aspects/objects they can represent), and requirements (what kinds of communication/documentation are required). A 3D chart is made with a (hopefully) covering set of criteria describing each of the three dimensions. Each intersection point in the 3D cube can be given a value. The method was tested by measuring analysts' agreement after analysing User Action Notation (UAN) using the method.
A method for finding out and anticipating common errors by means of early experiments in multimodal systems is described in [Trafton et al., 1997].
The article [Brouwer-Janse, 1995] describes a method to analyse problem-solving procedures. First, data is collected by means of the think-aloud method, followed by an interview and a retrospective report of the task. Analysis is done in two stages: first, the procedures used are identified by analysing the data, then, each step in each procedure is classified into one of 24 `subroutines'.
In [John and Marks, 1997], the effectiveness of some usability evaluation methods is compared. The methods compared are: claims analysis, cognitive walkthrough, GOMS, heuristic evaluation, user action notation, and simply reading the specification. Comparison is done by determining whether problems were identified, whether the problems led to a design change, and whether the design change was good or bad. The methods were tested on a partially-implemented multimedia authoring package. The results are not clear-cut, but it was concluded that they were generally unsatisfactory, especially since the `simply reading the specification' method seems to come out relatively well.
Figure 2 shows the possible experimental setups and sources of experimental data. This only concerns the evaluation of a one-on-one system, and the broader context of global goals or organisational settings is not included. The different combinations of agents and data flows shown in the figure will be explained below.
Figure 2: Evaluation setups and data collection
The following combinations of systems may be found:
An unusual example of protocol verification in NL systems is found in [Guinn, 1995]. Here, the self-consistency of a natural-language dialogue system is tested by simulating it against itself. Since NL dialogue tries to offer `human-like' communication, the relation between human and computer is more symmetrical, and it becomes feasible to make a computer system talk to itself to test it.
In an evaluation with a real system and real users, the role of the user may be filled by actual end users, or by other people representative of the end users. The task and environment may be the real-life situation (which is called field study) or may be designed carefully to reduce random environment factors (which is called laboratory- or controlled study).
The following means of getting data for analysis are found:
There are several attempts at standardisation of corpus annotation. Such standardisation should allow general tools to be used for annotation, and better comparison of different systems. One such attempt is described in [Dahlback et al., 1997]. It addresses:
In the proceedings [Andernach et al., 1995], using corpora for systems development is addressed. Using corpora for utterance type prediction is addressed in [Andernach, 1996]. Using corpora for obtaining a simulated-user model for evaluation is described in [Eckert et al., 1998].
The following means of analysing data are found:
Some examples of formal verification exist. In [Lewin, 1997], deadlock and reachability of states is verified, using a finite-state model of the system. In [McInnes et al., 1995], similar verification is done, though it also includes some text style checks. The article also discusses the possibilities of statistical state-transition checking, based on statistics of frequency of use. This idea is much like the user simulation scheme described in [Eckert et al., 1998].
There are a particularly large number of user simulation systems, which are described in this separate section. Usually, the simulation model is directly based on cognitive psychology, and often, the results of the simulation are used in combination with metrical methods.
The article [Haan et al., 1991] summarises a number of simulated-user analysis models: ETIT, TAG, GOMS, ETAG, CCT, and CLG. All these models are based on a compositional description of the user's task. Rules are defined which break down a high-level description of the task into low-level descriptions (usually, this simply means the steps to be taken to achieve the task). Levels that are typically identified in these models are some selection from the levels task, semantic, syntactic, and physical. Usually missing in these models is a way of accounting for error. Missing in the examples and considerations in the article however, is the possibility of modelling the effect of computer feedback (which is essential in NL systems; note that this may also account for error). Computer feedback may actually cause the user's subtasks and subgoals to change, especially for tasks where the outcome is not known yet (for example in exploration, when getting to know a computer system, or when searching complex databases).
Goals, Operators, Methods, and Selection rules (GOMS, see [Shneiderman, 1998] page 55 and [Olson and Olson, 1990] for an overview) is a procedural model of user activities. GOMS is a well-established method, and many variations exist. GOMS tries to model users' tasks all the way from the `goal level' to the `operator level' (which is the lowest level of subtasks). This is done using a set of rules (methods) describing the sequence of steps (operators) that have to be taken to complete a goal, and rules (selection rules) of how to choose between alternative methods according to more specific goal information.
The main idea is that one arrives at a sequence of operators, which are low-level enough to allow prediction of performance (like speed) easily, using easily-obtainable experimental data (for example, duration of pressing a key, typing a word, clicking a mouse, etc.). At this lowest level, a simple cognitive performance model is assumed. Basically, it amounts to summing up all operator durations to arrive at the total duration. Some variations on GOMS have more complex low-level cognitive models, which take into account operators which can be done simultaneously, by using critical path analysis [Gray et al., 1990]. Users can be simulated by feeding all rules into an inference engine or an AI (for example, SOAR), and then feeding the resulting sequence operators into the low-level cognitive model.
GOMS is typically used for traditional HCI in which the users' goals are clearly defined and the user drives the system. However, it was also used with good effect for a real-time machine-paced interface with machine-driven subtasks (a video game) [John and Vera, 1992], which shows that GOMS is more universally usable.
[Byrne et al., 1994] describes a system (USAGE) that automatically generates a NGOMSL specification from a user interface specification created in an UIDE (User Interface Development Environment) and runs it with user task specifications to obtain an efficiency prediction automatically. The biggest limitation mentioned is that no multi-level task hierarchies are supported.
Execute Process-Interactive Control (EPIC) is described in [Kieras et al., 1997]. EPIC is based on CPM-GOMS, which makes use of information about temporal dependencies of subtasks, and uses the Model Human Processor model to simulate human behaviour. Like CPM-GOMS, EPIC is advocated as being well-suited for multimodal and complex tasks. It is argued that CPM-GOMS is labour-intensive, as the sequence human behaviour corresponding to each task to be tested must be specified explicitly. EPIC tries to solve this problem by generating simulated human behaviour directly from the task specification. The article refers to other cognitive simulation models MHP, HOS, SAINT, CCT, ACT-R, and SOAR.
EPIC consists of a relatively complete and detailed cognitive model, with auditory and visual processor, vocal and manual processor, and a cognitive processor with short-term memory which is based on production rules. This means that the computer party, with input and output devices, can also be effectively simulated with a good deal of detail. This way, reactive tasks (where information that influences the task structure only arrives later on in the task) can also be modelled. Results show that performance times can be predicted reasonably accurately.
The Procedural Knowledge Structure Model (PKSM) is described in [Benysh and Koubek, 1993]. PKSM models procedural knowledge as a flowchart of which some nodes (called the task goals) can be subdivided further until one obtains the full-detail flowchart with only task actions and decisions. The flowchart allows multiple ways to do the same task.
There are also automatic methods which are specifically used on NL dialogue. In [Baber and Hone, 1993] and [Hone and Baber, 1995], task flow modelling is used to determine dialogue duration. The emphasis lies on comparing different repair strategies in relatively simple NL dialogues. In their task flow model, statistical word duration, word misrecognition probability and utterance correction probability are used as parameters for a flowchart model of the dialogue, which can then be used to predict task duration. One of their findings was that the best choice of strategy depends on error rate.
In [Eckert et al., 1998], a dialogue system is evaluated using a statistical user model, obtained from statistical corpus analysis. The model assumes there are only a limited number of different kinds of utterances, and that the user's utterance depends on the last few utterances only.
Evaluation by using metrics is used often, and warrants a separate section. Such evaluation tries to tell whether a system component or aspect is good or bad. Metrics are easy to use, but there is the risk that important information is not reflected in them. Metrics are often used to compare one system to another system or another version of the system (comparative analysis), or to check whether an existing system is acceptable. Such comparative analysis is useful for verifying a redesign or for deciding whether to commit to the usage of a new system. It may even be possible to compare multiple systems across multiple domains (benchmarking). See [Minker, 1998] and [Hirschman and Thompson, 1996] for an overview of metrical evaluation. The evaluation schemes described below are mostly for evaluating NL systems.
Note that there is often no way to verify the evaluation method itself. For some concrete problems that may occur, see [Walker, 1989]. Also, metrical evaluation does not explicitly say why one system is better or worse than another, though there are some ways in which it may help to obtain such information:
Some metrical systems have been proposed, with metrics that try to measure various things. The most often used metric is length, which comes in the form of duration and number of turns. Another is accuracy, for instance word recognition accuracy, meaning accuracy, and number of turns spent on repair dialogues.
In [Minker, 1998], some benchmarking metrics are proposed. These are word recognition error (substitutions, insertions, deletions), semantic all-label (compare all generated semantic word labels with manual transcriptions), semantic concept-value (compare only the labels that belong to values that fill the slots in the relevant query), system response (first, see how answerable the user query was, then compare the system's information with a minimal and maximal reference answer). The article also addresses evaluation of translation systems.
Another metrical evaluation framework is PARADISE (see [Walker et al., 1997b] and [Walker et al., 1997a]), which measures success rate and dialogue duration. PARADISE's aim is to specify metrics that calibrate for differences in tasks, so it can be used for comparison across different systems. A metrical system similar to PARADISE can be found in [Eckert et al., 1998].
In [Danieli and Gerbino, 1995], more metrics are described. These are: Implicit Recovery (IR), Contextual Appropriateness (CA), Turn Correction Ratio (TCR), and Transaction Success (TS).
TRAINS95 [Sikorski and Allen, 1996] and TRAINS96 [Stent and Allen, 1997] have a task (namely, find as short a route as possible) which allows multiple possible solutions but which is clear enough to allow a special metric to be used effectively, namely solution quality.
In [Albesano et al., 1997], evaluation of Dialogos is described. A special metric is used for transaction success. This amounts to classifying transactions into: Success (S), Success with Constraint Relaxation (SC), System Failure (SF), User Failure (UF).
In [Polifroni et al., 1998], metrical analysis of whole systems and system components is described, with the main criterium measured being accuracy. Accuracy was obtained by comparing the computer parse with a human transcription containing syntactic and semantic information. A concrete system with several components is analysed. The discourse-tracking module is tested by looking at the accuracy difference between the semantic frames with the predictor turned on or off. The response generator is tested by comparing the query hypothesis with the query hypothesis resulting from the transcription.
[Dillon et al., 1993] studies effects of vocabulary size and experience on efficiency and acceptability of speech systems. Efficiency was measured using completion time, word non-recognitions, phrase misrecognitions, and items skipped by the users. Acceptability was measured by a survey with 15 bipolar questions. Basically, both improve efficiency, and experience improves acceptability.
After getting an idea of what literature is relevant for understanding human-computer systems, we may attempt to specify the framework in further detail.
When looking at today's everyday computer applications and transaction systems, it is apparent that there are many usability problems. In the literature discussed here, the existence, and the cause and nature of these problems are often implicit in the solutions offered. Sometimes it is questionable whether the solutions are actually solutions to real, existing problems. Hence, the next stage of this research necessarily includes working with a real system.
In real life, it is often the case that the practice of developing systems is purely by intuition, rather than by making use of any of the available theories or methods. Even if such theories or methods are used, results are not always clear-cut. It appears that they are not easy to use, because [Nielsen, 1995]:
In a way, one of the usability problems is that the usability frameworks themselves need to be made more usable also.
NL and multimodal systems have some special problems. In these systems, an attempt is made to make the interaction strategies, and not just some of the styles, analogous to real-life, and in particular, to human-human interaction strategies. However, this implies that the computer has to do things it isn't very good at. To make the systems work, natural interaction has to be `forged' by only accounting for the interaction patterns that occur most frequently, and by careful engineering around the available techniques, like the various existing speech processing, integration of modalities, NL parsing, and dialogue tracking techniques. Most of these techniques have their own specific problems, and, when built into a system, often turn out to have unpredictable weaknesses. Part of the weaknesses are problems that are caused by inconsistency between the conceptual model and the actual system. A basic example is that people tend to overestimate the abilities of NL systems. Hence, a relatively large part of NL and multimodal systems development is about corpus collection and evaluation.
A first look at our own Schisma system makes it apparent that its domain coverage could be improved, even, for example, by the ability to supply answers to questions such as `Where am I?' or `Who are you?', since users could have arrived at Schisma from any location on the Web. It could also be improved at the dialogue level, for example by providing a better invalidation mechanism for repair utterances. Perhaps, the story is not as easy as it sounds, and these particular problems are a result of underlying design decisions that imply that the other feasible alternatives are worse. Perhaps more problems, or solutions, could be found after specifying the system in a more abstract way, or making a first analysis of the system's possible usage and real-life setting.
The current development surrounding Schisma includes adding Schisma as an agent (the `Karin' agent) into the Virtual Music Centre (VMC) virtual-reality environment, giving Karin a facial expression, adding other agents to make the system easier and/or quicker to use, such as the talking notice board and the navigation-assisting agent, and adding new possibilities to the lower-level interface, like multiple dynamically-generated and tiled windows. This implies a lot of extra complexity, and may provide a test bed for examining many of the aspects of human-computer system development we have encountered in this review of literature.
In the literature we have reviewed, we have found several major classes of strategies for attacking the usability problem:
In this second attempt at the framework, we try to make room for all of these strategies. Perhaps we will fill in only one or two of them, but further research may result in a more complete development framework. The framework may be part of a methodology, even if just prescribing documentation of the system using certain specifications at first. Some care should be taken to make the framework developer-friendly.
It seems feasible to use the current VMC system as a test-bed to examine if such a framework can be a help in its development. The current research is centred around a formal specification notation, based on `agent' models. It seems feasible to use such a compositional `agent' model for this system. Such a model could be used to specify interaction between the different elements of the interface, both at the `deeper' level (the agents walking around inside the 3D environment) and the connection between the `deeper' and `surface' level (how are the windows that are opened and can be used related to the agent-for example, which window belongs to which agent?). Windows may be modelled as a special kind of agent.
Such a notation, if properly designed, may aid in all five strategies listed:
It may also provide help in the issues specific to NL and multimodal interaction. The context and all available communication channels for each agent should be made explicit. Existing agent communication frameworks could be adopted, giving a basis for reasoning, communication, and dialogue. Existing agent or parallel programming languages could be adopted, allowing parts of the specification to be immediately executable.
Our current research includes combining process algebra (CSP) and predicate logic (Z) to obtain a system model at the `deeper' level. The process of going from an `intuitive' model to a formal model is being examined. For example, our CSP specification starts with a data flow diagram. After CSP specification, we will try to add more details by means of Z. Our first experiences with modelling the VMC's agent platform were that several things about the documentation need further clarification. HCI aspects and implementation aspects may be unified by viewing the model as a conceptual model so it can be seen whether it is consistent. For example, we found that discussions emerged about the naturalness of having the blackboard communicate with Karin directly, about whether agents that were in different parts of the world should know of each other, or how a dialogue initiation should take place, based on context cues such as proximity and direction of vision. Possibly, formal verification of consistency may be possible by designing a generic abstract model, and fitting a concrete model of the current system into it.
This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -show_section_numbers -split 0 report.
The translation was initiated by Boris van Schooten on Mon Feb 15 15:22:42 MET 1999