Speech dialogue, multimodality and Wizard of Oz experiments

I am currently working in two projects related to dialogue:

For more info: David Portabella <david.portabella@epfl.ch>, <david@portabella.name>

Move to: Background, Wizard Of Oz Experiment, Input multimodality for VoiceXML, Demos, Download, Acknowledgements.

Background

Spoken interfaces based upon VoiceXML prompt users with synthetic speech and understand simple words or phrases, using a defined dialog model specific for the application. As the technology improves we can look forward to richer natural language conversations. There is now an emerging interest in combining speech interaction with other modes of interaction. Multimodal interaction will enable the user to speak, write and type, as well as hear and see using a more natural user interface than today's single mode browsers.

More in Multimodal Interaction Activity


Wizard Of Oz Experiment

The Wizard of Oz experiment is a method used to help the developers verify their dialog models.

People using the system believes that they are interacting with a real system, while actually there is a human who is controlling it. A text to speech synthetizer is used. Data is saved and analyzed later, in order to revise the dialogue model.

The person who acts as the Wizard of Oz can actually do all the job, or just partly. Doing all the job is quite useless and difficult, as he has no time to react fast while emulating the system.

We propose a Wizard Of Oz to test the dialogue model and the grammars used. The models have to be implemented in VoiceXML. An extremely simple grammar has to be done also, but it may indicate only the semantic pairs to be used in order the model to be useful. The person controlling the system, the wizard, does not need to control the model, only the grammars result.

The experiment begins here. No speech recognizer is used. The VoiceXML interpreter runs the model and when speech input is needed, the interpreter informs to the wizard of all the active grammars, with the semantic pairs available. The wizard listens to the user speech and selects the response, that it is sent back to the VoiceXML interpreter. This will be usefull to validate the model, and with the information saved, a better grammar can be built.

In a second phase, the system uses the speech recognizer with the revised grammars, and propose to the wizard the solution found. He can accept or modify the solution. The models and the grammars can be revised again until found a good solution.

After a solution is found, the Wizard of Oz experiment is disabled and the system is ready to run for explotation.

You can download the working implementation of the VoiceXML Woz, voicexmlwoz2003-04-25.zip
This is work in progess, suggestions are welcomed.

Here you can see an screen-shoot. The VoiceXML interpreter runs the model, and whenever a speech input is needed, the Woz Server shows the active grammars and the wizard is asked to listen the user and select the appropiat semantic result.

VoiceXML Woz screenshoot

 

 

Input multimodality for VoiceXML

VoiceXML can be extended to use some other input devices. For instance, in the SmartHome application, a pointer device could be used to choose a light and ask it to switch it on. The device sends some input to the interpreter and with the help of a grammar, this information can be parsed and produce a semantic result.

On the other hand, passive devices can help in contextualizing. If the user says "Switch on this light", and the following grammar is used:


public <main> = [<politeness>] <command> [<politeness>] <object> [<politeness>];

<command> = switch on {on} | switch off {off};

<object> =
entrance light {entrance}
| bedroom light { bedroom }
| dinning room light {dinning_room}
| this light { this}

<politeness> = please;

the interpreter needs to contextualize "this", and can ask a passive pointer device to which light the user is pointing to.

We have designed an input component for VoiceXML interpreter based on Web Services. It takes the input from the speech recognizer, from the keyboard and also from all specified input devices. When there is an input from any of them, it passes it to the interpreter. The interpreter then just uses the active grammars to fit the input and produce a semantic result.

Like this, the input devices can give information that is semantically equivalent to a specific phrase spoken or typed by the user.

For the passive devices, the dialog model needs to call them to contextualize. Again, a solution based on Web Services has been implemented. The added input devices, like the pointer devices, can implement a Web Service.

The dialog model, using the VoiceXML object tag, calls the specified Web Service asking for some information and it immediately receives the response, which is used to contextualize.

Thus, a web browser can also be an active device, calling the VoiceXML Web Service. At the same time, it can also be a passive device, implementing a Web Service (using an IndirectWebServiceProxy) that VoiceXML can call at anytime to contextualize.

Framework for VoiceXML multimodality
Figure 1: Framework for VoiceXML multimodality
An Active device sends an utterance calling the VoiceXML input Web Service. VoiceXML contextualizes calling a Passive device Web Service. A Web browser can be active and passive simultaneously using an IndirectWebServiceProxy. 

Demos

It is now available to download two demos concerning the Inspire and IM2.MDM projects: dialogue2003-03-05.zip
Here you can see two screenshoots:

A sample form the Inspire project
A sample from the Inspire project. User can control appliances in a room by voice dialogue.


A sample from the Multimodal Dialogue Management project
A sample from the Multimodal Dialogue Management project. User can get information about people shown on the picture. A person, user wants to focus on, can be specified by Voice or by mouse click into the marked area. Anaphoric phrases can be handled (i.e. "give me the request from him" affects the person user talked earlier about).


See also a screenshot of the second IM2.MDM demo:
screenshot of the second IM2.MDM demo


Download

Some packages are ready to download under the terms of the GNU Lesser General Public License:

Acknowledgements

Author is grateful to Pavel Cenek from the Laboratory of Speech and Dialogue at the Faculty of Informatics, in MU Brno for his support using his VoiceXML 2.0 browser, now called OptimTalk.