We argue that fusion and other kinds of multimodal reference resolution have a common denominator, which justifies the use of a common term. We will use the term multimodal reference resolution here.
Since utterances consist of more than just language, we need a way to model those parts of an utterance which are not linguistic. Here, we follow the SmartKom framework. SmartKom models all utterance material as a set of objects called modality objects. This "object based" assumption seems juistified by the fact that references can be modelled naturally by refering to such objects. It is in fact just an extension of linguistic referencing models, which typically reference to words (which are naturally a kind of object).
As modalities we have: speech/linguistic objects (user/system), pointing/gesture objects (user), visual objects (system).
When users have multiple modalities at their disposal, they may combine them in different ways to suit their communication needs. They may use multiple modalities complementarily or they can use a specific modality for a specific task. A similar thing may be said for a system trying to format its output optimally.
Usage of multiple modalities in fusion can be extended towards multimodal reference resolution. Referencing to something (either linguistically or otherwise) is a way to save resources and is also part of fusion. It's all about references. When I point at something, I obviously make a reference. When I simultaneously say "this", I make another, linguistic, reference.
Interesting here is the framework by J.C. Martin and the CARE properties by Coutaz.
One approach to fusion is to integrate interpretation of information from the different modalities while the information comes in (early fusion). Instead of early fusion, we may interpret each modality separately, and then combine the information at the end of the dialogue turn (late fusion).
The potential advantage of early fusion is that interpretation can take advantage of early information from other modalities, improving speech recognition etc. If we use late fusion, potentially interesting information may be thrown away before the fusion starts.
We may compensate for this loss of information in late fusion by leaving as much of the interpretation information as possible in the interpreted result (i.e. word graph with probabilities instead of "most probable" final sentence), so that it may be used by the fusion.
A second dimension in the approach to modal fusion is the degree of integration of fusion with other kinds of analysis. We may do fusion as a first step in an interpretation pipeline with little extra information (we will call this isolated) or we may integrate it fully in other analysis processes (called integrated). There is also the possibility to postpone it to the last possible moment, that is, do fusion after other kinds of interpretation have been done on the data. This is not the same as integrated, as integration enables the possibility to use obtained information from the different analysis processes interactively.
In isolated fusion, the fusion process takes as input the output from the speech and gesture recognition, and outputs the fused information, that is typically passed to the next step in the interpretation pipeline. The fusion process may or may not use any other kind of information that is already available, in particular dialogue history and other information from past dialogue. Some fusion approaches do not even use such past analysis information. If they take into consideration timing information (i.e. did a pointing action occur simultaneously with a particular referential expression?), they can still get pretty far.
Some kinds of utterance require a more integrated approach. For example, a reference such as "that" may refer to something in the dialogue history rather than being a deictic reference, but an isolated fusion process may not know this. Such cases require fusion to be integrated with other kinds of reference resolution. Various cases also require pragmatic level information to interpret an utterance and its references.
Leaving multiple interpretations open after the fusion is complete may compensate for a lack of integrated fusion.