Dagstuhl seminar on
Evaluating
Embodied Conversational Agents
15-19 March, 2004
Organized by
Z. Ruttkay (CWI, Amsterdam,
NL), E. André (Univ. Augsburg, DE), K. Höök (IT-University Kista, SE), W. Lewis
Johnson (USC Marina del Rey, US), C. Pelachaud (Univ. Paris, FR)
In the past years we have seen an increasing number of ECAs (Embodied Conversational Agents), as results (and sometimes also as medium) of academic research as well as in commercial applications in fields like information providing (news reader, info kiosk), user interfaces for deaf (lip-reading, signing avatars), education, commerce and entertainment. Due to the novelty of the field and the complexity of the problems, different paradigms and tools have been used to produce the individual ECAs, often tailored to the special needs of an application or of a research issue.
The comparison and re-use of ECAs has not been dealt with to a sufficient degree, in spite of the high costs of developing yet new ECAs. There have been studies that indicate added value of particular features of ECAs, but it remains to be determined in what situations fully functional ECAs are advantageous, and for what purpose. The sporadic evaluation experiments concentrated on special issues and under restricted circumstances. All the same, several of these partial evaluations have already produced surprising and for the future development and application of ECAs essential results, e.g. on the different preferences of different users.
For the evaluation and design guidelines of ECAs one should rely on results and methodologies of related fields like HCI, psychology, cultural anthropology and even arts. The different language and methodologies of these disciplines and lack of common forums have hindered the cross-fertilization. Feedback from the industry on commercial applications would also very useful.
Motivated by the above observations, the main objectives of the seminar were:
· to review the results, tools and resources;
· identify key problems and future research directions;
· initiate further activities to trigger and dissimilate work on ECA design and evaluation.
There were 35 participants (and a waiting list of more
potential ones), with different professional background like ECA research,
psychology, AI, linguistics, and coming from research (and a few industrial)
centers in
The list of participants:
André, Elisabeth; Barker, Tim; Bernsen, Niels Ole; Beskow, Jonas; Biswas,
Gautam; Cassell, Justine; Cavazza, Marc; Egges, Arjan; Eliëns, Anton; Gerhard,
Michael; Gratch, Jonathan ; Heylen, Dirk; Isbister, Katherine; Ishizuka, Mitsuru;
Johnson, Lewis; Krahmer, Emiel; Marriott, Andrew; Marsella, Stacy; Martin,
Jean-Claude; Massaro, Dominic; Nijholt, Anto; Noor, Christoph; Noot, Han;
Olivier, Patrick; Otero, Nuno; Paiva, Ana; Pandzic, Igor; Pelachaud, Catherine;
Pele, Danielle; Prendinger, Helmut;
Rehm, Matthias; Rist, Thomas; Ruttkay, Zsofia; Ten Hagen, Paul.
See the group photo.
The seminar was primarily a work and discussion forum. Only a few invited talks were given, in order to outline the diversity of issues and provide analysis of case-studies. Here are the titles, the slides and papers can be found at http://www.dagstuhl.de/04121/Talks/
1. Zsófia Ruttkay: Evaluating ECAs – why, what and how?
2. Justine Cassell: Experince with ECA design, implementation and evaluation
3. Emiel Krahmer: Combining analysis-by-observation and analysis-by-synthesis for micro-evaluation of ECAs
4. Elisabeth André and Thomas Rist: Lessons learned from evaluating animated presentation agents
5. Katherine Isbister (with on-spot experiments with games): Why are game characters better than PaperClip?
6. Lewis Johson: Embodied Conversational Agents - How Effective Are They?
7. Ana Paiva, Nuno Otero: Synthetic Characters: from design to evaluation ... and back
In a 1.5 days introductory session, next to the plenary talks each participant got the opportunity to introduce his/her background and interest (see presentations by participants at: http://www.dagstuhl.de/04121/).
Then 5 working groups were formed. Each working group set out to explore one major aspect of ECA evaluation. Progress and emerging issues were reported on a plenary session on Thursday morning. On the closing plenary session each group gave a presentation of their results (see below).
Participants appreciated the active work setting, different of the competitive and time-pressed atmosphere at most of traditional conferences. They were eager to share tools, resources and experience, to get into fruitful discussion, and to give open, often self-critical report on their own ECAs and applications. Though work was intense, sometimes continued in the evenings too, there were opportunities offered (and created) to have fun: we walked, cycled, prepared chocolate-balls, tasted vine… which were also documented on the photos.
Below you find a short account on the work done in the separate working groups, and references for further materials and resources.
1.
ECAs and human-human interaction
WG participants: Elisabeth André
(moderator), Jean-Claude
Martin, Nuno Otero, Matthias Rehm, Zsofia Ruttkay (partially)
Slides of final presentation: WG1
Summary:
A number of approaches to modeling ECA behaviors
are based on a direct simulation of human behaviors. Consequently, it comes as
no surprise that the use of data-driven approaches which allow us to validate
design choices empirically has become increasingly popular in the ECA field.
The trend already started in 1999 with a Dagstuhl Seminar on Multimodality
where a working group on multimodal corpora was established. Since then, a
number of useful tools have been developed which facilitate the annotation and
analysis of multimodal corpora and boosted progress in our field. Building upon
this experience, the objective of the present working group was to investigate
the potential benefits of corpora for the creation and evaluation of ECAs.
Given the fact that excellent material
from previous workshop on multimodal corpora was already available, our working
group was not forced to spend much time and effort on a survey of the current
state of the art, but could concentrate on an update of a document on multimodal corpora instead. The document was
provided by Jean-Claude Martin, and we decided to make it publicly available.
In addition, we designed a questionnaire for the
Dagstuhl attendees in order to get a clearer
picture about current trends in multimodal corpora from a representative group
of ECA researchers. In the following, we summarize some of the results:
·
Employed
information resources
As it turned out, ECA researchers rely on a large variety of resources to
inform the design of their ECAs including recordings of users in “natural” or
staged situations, TV shows, Wizard of Oz studies, movies, games and motion
capturing data.
·
Use of corpora
Only around 60% of the Dagstuhl attendees make use of a corpus most of them
relying on an annotation tool. The evaluation of the questionnaires also
revealed that a surprisingly high number of different annotation tools and
home-made annotation schemes are currently being used.
·
Major problems with
corpora
Major problems we identified when analyzing the questionnaires were the limited
ways to re-use corpora which are in most cases collected for a specific
purpose. Furthermore, the creation of a model or the extraction of ECA
behaviors from a corpus is still an open research question. Also there is the
danger, that human users expect a different behavior from an ECA than from a
human conversational partner which might limit the potential benefits of a
simulation-based approach.
·
Criteria for the
evaluation of ECAs
All participants indicated that they performed a variety of experiments to
analyze the user’s objective and/or subjective response to the ECA. About 50%
came up with a catalogue of evaluation criteria. Only 25% based their
evaluation on a comparison of the ECA’s behavior with that of human
conversational partners.
Our working group agreed upon that the use of a corpus provides a promising approach to the modeling of ECA behaviors since it allows us to ground ECA behaviors in empirical
data. In addition, we identified some interesting new options. For instance, we
discussed to extract ECA behaviors from cartoons in order to capture implicit
knowledge from professional designers. Furthermore, a corpus might help to
compare the behavior of different ECAs by serving as a kind of reference
system. In turn, an ECA might be employed to validate corpus annotations, e.g.
by visualizing certain extracted behaviors. A comparison of the visualized and
the original behaviors might then serve as a measurement for the accuracy and
completeness of the corpus annotations.
As a common future project for ECA
researchers interested in corpora we discussed the realization of
culture-specific ECA’s based on the recording and analysis of standardized
staged or natural situations (e.g. asking for directions in different
countries).
2.
ECA’s design
parameters and aspects
WG participants: Jonathan Gratch (moderator), Arjan Egges, Anton
Eliëns, Katherine Isbister, Stacy Marsella, Ana Paiva, Thomas Rist, Paul ten
Hagen
Slides of final presentation: WG2
Summary:
How does one go about designing a human?
With the rise in recent years of virtual humans this is no longer purely
a philosophical question. Virtual humans are intelligent agents with a body,
often a human-like graphical body, that interact verbally and non-verbally with
human users on a variety of tasks and applications. Our working group
approached this question from the perspective of interactivity. Specifically, how can one design effective
interactive experiences involving a virtual human, and what constraints does
this goal place on the form and function of an embodied conversational
agent.
Our group grappled with several related questions: What ideals should designers aspire to, what
sources of theory and data will best lead to this goal and what methodologies
can inform and validate the design process? A longer article (.pdf) summarizes the output of this WG and suggests
a specific framework, borrowed from interactive media design, as a vehicle for
advancing the state of interactive experiences with virtual humans.
3.
Micro-level
evaluation of ECAs(single modalities and aspects)
WG participants: Emiel Krahmer
(moderator), Jonas Beskow, Justine Cassell, Dirk Heijlen, Andrew Marriott, Dominic Massaro, Han Noot,
Patrick Olivier, Danielle Pele
Slides of final presentation: WG3
Summary:
Embodied conversational agents (ECAs) implement all kinds of (para)linguistic and behavioral models. Typically, these are included for a particular purpose, for instance, to manage the interaction between a user human and the ECA (e.g., using gaze) or to suggest a particular emotion (e.g., by lowering the eyebrows). Micro-evaluation is a method to test whether the way these behaviors are implemented in the ECA is understood by users in the intended. Arguably, this evaluation is primary, since the added value of ECAs in applications might not be provable before being sure that the underlying models (and their implementation) are correct.
In this working group, we discussed micro-evaluation of:
(1) audio-visual speech,
(2) non-verbal behavior,
(3) natural language content,
(4) dialogue control and interaction and
(5) personality and emotion.
The micro-evaluation paradigm in these respective subfields appears to follow a general pattern. As a starting point, an ECA developer can look for models in the literature. As it turns out, many relevant models have been developed and published (in phonetics, conversational analysis, cognitive science, social psychology etc.). These models are often taken as a starting point and implemented in the ECA. Via judgment (perception) studies, it can be checked whether the ECA implementation is faithful to the model. However, these models are often incomplete from an ECA perspective, for instance because they often lack information about the timing and execution of specific behavioral aspects. If that happens to be the case, additional effort has to be undertaken to fill in the missing pieces in a particular model. This can be done, for instance, using elicitation (production) studies with human speakers. The collected data can either be used to fine-tune the model or as part of a data-driven approach to ECAs.
While discussing the five topics listed above, it was found that the models get less detailed and more controversial when we move to higher-level issues. Thus, while audio-visual speech and non-verbal behavior are relatively well-understood, the models for personality and emotion are less easy to apply to ECAs. In these fields, more 'foundational' work is needed. Another general observation that came up at various times is that setting up good perception studies is no easy matter. It does not seem to be a good strategy to ask subjects directly whether they feel the ECA under evaluation has property X (e.g. is trustworthy), since the results of such experiments tend to be somewhat unreliable. Rather the perception test has to be set up in such a way that subjects have to make functional use of the ECA and thereby may show indirectly whether the agent has property X. This is what makes micro-evaluation difficult, but also what makes it so much fun to do.
4.
System-level evaluation of ECAs(ECA
in application, usage context)
WG participants: Lewis
Johnson (moderator), Tim Barker, Niels Ole Bernsen, Gautam Biswas, Marc
Cavazza, Noor Christoph, Michael Gerhard, Anton Nijholt, Helmut Prendinger
Slides of final presentation: WG4
Summary:
ECAs are embedded in applications that interact with users, in some task
and environmental context. In order to develop a clear picture of ECA
performance, it is necessary to consider the user(s) and the application as a
system, and evaluation the overall performance of this system. One must
consider what environmental factors might have an impact on the performance of
this system. This can be helpful in interpreting the role of micro-level
ECA evaluationin determining system-level performance; one must make sure that
micro-level evaluations are performed in a context that is comparable to the
context of the user-application system.
System-level evaluation ideally involves three things: 1) evaluation of
user-ECA performance, e.g, how fluent and efficient it is, 2) evaluation of the
user experience, from a subjective standpoint, and 3) evaluation of the
effectiveness of the ECA-enabled application in achieving its goals, e.g.,
learning outcomes or entertainment engagement. Each of these contributes
on our understanding of the role of the ECA in the system.
Relevant environmental factors may include target user group, physical
environment of use, integration into a larger work activity, stance of the user
toward the environment (observer or participant), etc. The role of the
ECA in the application may be central or ancillary, and may change over time;
the ECA may play the role of personal assistant, companion, antagonist,
advisor, tutor, etc.
ECA development ideally involves a spiral approach, and evaluation should be
incorporated into that spiral as well. This means for example that
evaluations need to be simple enough and easy enough to perform to inform the
ongoing design of the ECA, but at the same time provide information that is
predictive of what the ultimate system-level performance will be.
Evaluation methods can involve a combination of extended observations,
interviews, and questionnaires. Logs of interaction data, physiological
data, and videotaps can all be helpful.
The system-level view can offer a different perspective on questions that are
commonly asked regarding ECAs. For example, believability is a concern
for ECAs, and believability is often assessed by having subjects observe ECAs
and give their judgments. But passive observers may get a different
impression of ECAs than users who are engaged in interacting with the
ECA. It would be better to engage users in an interaction with the ECA,
and then assess both the user's subjective impression of believability and
system performance characteristics that correlate with believability.
5.
Sharing work and
results
WG participants: Catherine Pelachaud
(moderator), Mitsuru Ishizuka, Igor Pandzic, Zsófia Ruttkay
Slides of final presentation: WG5
Summary:
One of the major goal in this group was to make an arsenal of models, tools and resources to facilitate ECA development and evaluation. We set up a site for ECA-tools, which hopefully will grow further. Please contribute, contact Igor Pandzic.
Another way of comparing and sharing ECAs is a kind of show or festival where ECAs in different application categories (like presenter, tutor) are compared, possibly on benchmark presentation tasks. It is not easy to come up with a framework where different aspects of ECAs can be fairly compared. Note that we are not aiming at a kind of ‘Miss/Mr ECA’ competition, but an ECA-challenge event where inventive designs, quality of communicational capabilities, novelty of applications can be reviewed. A handful of enthusiastic experts are busy with getting the ball rolling – more help and suggestions are welcome, contact zsofi@cs.utwente.nl.
Initially we intended also to produce a uniform terminology and definition for design and evaluation aspects. The idea was partially addressed in the working groups 2-4, but there was not enough time for a joint and careful investigations. This is an item left for the future work.
Further
materials about the seminar
Audio recording of sessions
Tuesday Plenary (217Mb .mp3) Friday
Plenary (144Mb .mp3)
Group photo: http://www.dagstuhl.de/04121/Photos/
All the photos (on talks, fun events, …) by Andrew Marriott: http://www.vhml.org/workshops/dagstuhl/photos/marriott/
Photos by Anton
Nijholt from Wednesday afternoon and evening: http://wwwhome.cs.utwente.nl/~anijholt/dagstuhlphotos.html
Here you see at a glance links to resources useful for ECA R&D. We intend to list all relevant resources, also compiled by others.
Future plans
On the closing plenary session it was agreed that we should make the results available for the community. The idea is to produce a, with demos richly illustrated and possibly electronic, tutorial on designing and evaluating ECAs, which could also serve as a textbook for courses.
References to the seminar in publications
1. At the AAMAS04 Workshop on Balanced Perception and Action in ECAs the outcome of the WG2 will be presented as invited talk.
2. Reference from the links page of the NICE project at http://www.niceproject.com/
SEND
YOUR REFERENCES!
EVENTS
COMING
Virtual Humans
Workshop: Design Criteria, Techniques and Case Studies for Creating and
Evaluating Interactive Experiences, Marine
AAMAS05
IVA05
PASSED
SIGGRAPH/EGSymposium on Computer
Animation, 27-29 August 2004,
AAMAS04, 19-23 July,
AAMAS04 WS on Balanced Perception
and Action in ECAs, 19-20 July,
AAMAS04 WS on Emphatic
Agents, 19-20 July,
CASA, 7-9 July,
AISB 2004 Convention: Motion, Emotion and
Cognition, 29 March -
Multimodal
Corpora, 25 May 2004
Workshop on Affective Dialogue Systems.
14-16 June, 2004
This site is maintained by Zsofia Ruttkay,
mail: zsofi@cs.utwente.nl Last update: 1.10. 2004