Today I'll give an intro to VoiceXML, which is used in the current project.
The human calls an 800 number, usually 800-555-TELL, and reaches the TellMe service. TellMe gets a VoiceXML script from your web server and executes it. The VoiceXML script can be stored in a text file with a .vxml extension, or it can be generated as the output of a PHP script.
The TellMe servers interpret VoiceXML scripts, convert text to speech, recognize speech, and compile grammars for voice recognition. All have load balancing, fault tolerance, and adaptive caching.
Between the phone and the web server sits a voice server. This voice server interprets the VoiceXML documents. The VoiceXML interpreter could be included in a device such a car radio that is wirelessly connected to the Web, and then there is no need for a separate voice server. However, most of the time, the VoiceXML architecture will be structured as in the illustration below.
VoiceXML serving architecture
Inside the VoiceXML interpreter resides a voice recognition and synthesis engine used to automate a conversation between a machine and a human being. This can be connected by either a wireline or wireless network.
Javascript is available inside VoiceXML, but complex code inside VoiceXML is discouraged. The reasons are the same ones for the failure of client-side Java and C++ inside HTML. Instead, a VoiceXML script can invoke other server-side scripts.
TellMe servers are hosted by an Exodus facility, which provides Internet
connectivity. The servers are connected by a dedicated OC48 network
(2500 Mbits/s) to ATT telephone switches in three different locations.
VoiceXML is a language similar to HTML, but for telephone-based interaction. Like HTML, VoiceXML can be called a rendering language, or a programming language.
XML documents are built from nested "elements." A VoiceXML document
contains a single <vxml> element, which is the root element.
The basic units of a VoiceXML document are dialogs, specified as <form>
elements, and menus, identified by
<menu> elements. A
form is a set of fields to be entered by the human, like an HTML form.
A menu lets the human make a choice from a list.
Basic structure of a VoiceXML document
In our example, we'll be using a form to obtain a user ID and a password from the user. Because we are using a phone, we'll ask the user for five digits for the user ID and four digits for the password. Digits are easily typed on phone keys, or simply spoken into a phone.
Each <field> element inside a form should contain multiple elements that specify various aspects of the field. These elements are not necessarily executed sequentially.
A <prompt> element specifies what the voice browser will
say. For instance:
<prompt>
<audio>Please dial or say your five-digit user ID</audio>
</prompt>
The <audio> element instructs the voice browser to use the text-to-speech engine to read the data content and say it on the phone. After the prompt, the voice browser waits for an answer from the interlocutor at the other end of the phone line.
A <grammar> element is a set of rules that specify what
will be recognized by the voice server. For instance:
<grammar>
<![CDATA[Four_digits]]>
</grammar>
When the person says or dials the user ID, then the element <filled> causes the browser to provide feedback to the interlocutor by saying what the speech recognition engine understood. Then, it jumps to the next form by following the <goto> element.
<filled>
<audio>I heard you say {document.login.userID}</audio>
<goto next="#password"/>
</filled>
A different <filled> element could be used to send variables
to the server for further processing, e.g.
The <noinput> element allows rudimentary error-handling, catching when the user does not dial or say anuything. Specifically, the <noinput> element says what to do when, after a certain time, a request remains answerless.
This example is adapted from http://www.webreference.com/perl/tutorial/20/tutorial20.html.
<?xml version="1.0"?>
<vxml version="1.0" >
<form id="login">
<field name="pin">
<grammar>
<![CDATA[Four_digits]]>
</grammar>
<prompt>Please enter your 4 digit pin code.</prompt>
<filled>
<submit next="http://www.web.com/pin.php"/>
</filled>
<noinput>No PIN entered.<reprompt/></noinput>
<nomatch count="1">Invalid pin code.<reprompt/></nomatch>
<nomatch count="2">Too many attempts.<exit/></nomatch>
</field>
</form>
</vxml>
By including a <grammar> inside a field element, we limit the scope of the grammar rules to the field's context. For instance, in our experiment, we use the Tellme default library grammar named "Five-digit" for the <field> element associated with the user ID, and we use the "Four-digit" grammar library for the password.
Voice recognition is easier when any of these restrictions is true:
This PDF document explains the role of VoiceXML and the architecture
of the TellMe service:
http://www.tellme.com/business/downloads/VoiceXML_facts_and_fiction.pdf
These documents discuss two small but interesting VoiceXML applications:
http://studio.tellme.com/articles/TRAIN.html
http://studio.tellme.com/articles/OnCalls.html