CSE134A LECTURE NOTES

October 30, 2002
 
 

ANNOUNCEMENTS

The midterm is on Wednesday next week.  Everything covered through next Monday is fair game.  Use the tests from last year to practice.  Ask questions on Discus, but don't ask "what's the answer to Question 3?".  See the precise policy on what you can and cannot bring to the exam, but note that the exam is not designed to need any of these references specifically.

Today I'll give an intro to VoiceXML, which is used in the current project.
 
 

VOICE SERVERS AND TELLME

Some of the text and pictures below are taken from http://www.xml.com/lpt/a/2000/08/23/didier/index.html.

The human calls an 800 number, usually 800-555-TELL, and reaches the TellMe service.  The TellMe server loads a VoiceXML script from your web server, then executes the script.

Executing a VoiceXML script means following its commands, and converting text to speech, and recognizing words and sentences said by the human user.

For most applications, the VoiceXML architecture is as in the illustration below.

VoiceXML serving architecture

Alternatively, the VoiceXML interpreter can be included in a device such a wireless phone or a car radio that acts as a client of the web server.  In this case there is no need for a separate voice server.

TellMe has multiple servers, with load balancing, fault tolerance, and adaptive caching.  The servers are hosted by an Exodus facility, which provides Internet connectivity.  The servers are connected by a dedicated OC48 network (2500 Mbits/s) to ATT telephone switches in three different locations.

The VoiceXML script can be stored in a text file with a .vxml extension, or it can be generated as the output of a PHP script.  Voice servers also compile grammars for voice recognition.
 
 

VOICEXML

XML is a human-readable notation for writing and exchanging structured information of all sorts.  We'll go into detail on XML later in the quarter.  For now, it's just a strange syntax, and VoiceXML happens to be written using XML.

VoiceXML is a language similar to HTML, but for telephone-based interaction.  Like HTML, VoiceXML can be called a rendering language, but it is also a programming language.  Like it is for HTML, Javascript is available inside VoiceXML, but complex code inside VoiceXML is discouraged.  The reasons are the same ones for the failure of client-side Java and C++ inside HTML.

A VoiceXML script can invoke another server-side script, either as a subroutine (e.g. to ask for, acquire, and verify a credit card number) or as the follow-on phase of interacting with the human user.

XML documents are built from nested "elements."  A VoiceXML document contains a single <vxml> element, which is the root element. The basic units of a VoiceXML document are dialogs, specified as <form> elements, and menus, identified by <menu> elements.  A dialog is a set of fields to be entered by the human, like an HTML form.  A menu lets the human make a choice from a list.

One way to design a VoiceXML document is to draw a diagram that shows the nesting of dialogs and menus, starting with a template like this:


 
 

VOICEXML EXAMPLE

This example uses a form to obtain a user ID and a password from the user.  Because we are using a phone, we'll ask the user for five digits for the user ID and four digits for the password.  Digits are easily to type on phone keys, and/or easy to speak into a phone.

The example is adapted from http://www.webreference.com/perl/tutorial/20/tutorial20.html.

<?xml version="1.0"?>
  <vxml version="1.0" >
    <form id="login">
      <field name="pin">
             <grammar>
                <![CDATA[Four_digits]]>
             </grammar>
             <prompt>Please enter your 4 digit pin code.</prompt>
             <filled>
               <submit next="http://www.web.com/pin.php"/>
             </filled>
             <noinput>No PIN entered.<reprompt/></noinput>
             <nomatch count="1">Invalid pin code.<reprompt/></nomatch>
             <nomatch count="2">Too many attempts.<exit/></nomatch>
      </field>
    </form>
  </vxml>
 
 

ELEMENTS INSIDE DIALOGS

Each form contains one <field> element for each piece of information to be obtained.  Each field contains subelements that specify various aspects of the field.  The elements inside a field are not necessarily executed sequentially.

A <prompt> element specifies what the voice browser will say. For instance:
        <prompt>
               <audio>Please dial or say your five-digit user ID</audio>
        </prompt>

The <audio> element instructs the VoiceXML interpreter to use the text-to-speech engine to speak the given text as output.  After the prompt, the interpreter waits for an answer from the human at the other end of the phone line.

A <grammar> element gives the name of a set of rules that specify what will be recognized by the voice server.   For instance:
        <grammar>
               <![CDATA[Four_digits]]>
        </grammar>

Most sets of rules, like  Four_digits, are pre-programmed.

A <filled> element tells the interpreter to provide feedback to the interlocutor by saying what the speech recognition engine understood.  In this example, the <goto> element says to then jump immediately to the next form:

<filled>
     <audio>I heard you say {document.login.userID}</audio>
     <goto next="#password"/>
</filled>
A different <filled> element could be used to send variables to the server for further processing, e.g.
       <filled>
            <submit next="http://www.web.com/pin.php"/>
       </filled>

The <noinput> and <nomatch> elements allow error-handling, catching when the user does not dial or say anything, or says something that cannot be recognized.  Specifically, the <noinput> element says what to do if, after a certain time, a request remains answerless.
 
 

SPEECH RECOGNITION

A grammar specifies what the alternatives are for what the user might be saying.  Before a grammar can be used, it must be compiled.  Reusing grammars that have already been compiled is critical for speed.

By including a <grammar> inside a field element, we limit the scope of the grammar rules to the field's context.  For instance, in our experiment, we use the Tellme default library grammar named "Five-digit" for the <field> element associated with the user ID, and we use the "Four-digit" grammar library for the password.

Voice recognition is easier when any of these restrictions is true:

If at least one of these restrictions is true, then voice recognition can be done with reasonable accuracy using a Pentium III class processor.  Understanding continuous speech with an unlimited vocabulary, from novel speakers, is beyond the current state of the art.
 
 

REFERENCES AND TUTORIALS

This shows how to write a simple form in VoiceXML using TellMe Studio:
    http://www.webreference.com/perl/tutorial/20/tutorial20.html
This shows how to build an interactive application that links to a backend Perl script:
    http://www.webreference.com/perl/tutorial/21/tutorial21.html

This PDF document explains the role of VoiceXML and the architecture of the TellMe service: VoiceXML facts and fiction

These documents discuss two small but interesting VoiceXML applications:
    http://studio.tellme.com/articles/TRAIN.html
    http://studio.tellme.com/articles/OnCalls.html
The latter is a small business, Oncalls.com.
 
 



Copyright (c) by Charles Elkan, 2002.