CSE134A LECTURE NOTES

October 29, 2001
 
 

ANNOUNCEMENTS

The midterm is on Wednesday this week.  Everything covered through last week (not today) is fair game.  Use the tests from last year to practice.  Ask questions on Discus, but don't ask "what's the answer to Question 3?".  See the precise policy on what you can and cannot bring to the exam, but note that the exam is not designed to need any of these references specifically.

Today I'll give an intro to VoiceXML, which is used in the current project.
 
 

VOICE SERVERS AND TELLME

Some of the text and pictures below are taken from http://www.xml.com/lpt/a/2000/08/23/didier/index.html.

The human calls an 800 number, usually 800-555-TELL, and reaches the TellMe service.  TellMe gets a VoiceXML script from your web server and executes it.  The VoiceXML script can be stored in a text file with a .vxml extension, or it can be generated as the output of a PHP script.

The TellMe servers interpret VoiceXML scripts, convert text to speech, recognize speech, and compile grammars for voice recognition.  All have load balancing, fault tolerance, and adaptive caching.

Between the phone and the web server sits a voice server. This voice server interprets the VoiceXML documents.  The VoiceXML interpreter could be included in a device such a car radio that is wirelessly connected to the Web, and then there is no need for a separate voice server.  However, most of the time, the VoiceXML architecture will be structured as in the illustration below.

VoiceXML serving architecture
VoiceXML serving architecture

Inside the VoiceXML interpreter resides a voice recognition and synthesis engine used to automate a conversation between a machine and a human being. This can be connected by either a wireline or wireless network.

Javascript is available inside VoiceXML, but complex code inside VoiceXML is discouraged.  The reasons are the same ones for the failure of client-side Java and C++ inside HTML.  Instead, a VoiceXML script can invoke other server-side scripts.

TellMe servers are hosted by an Exodus facility, which provides Internet connectivity.  The servers are connected by a dedicated OC48 network (2500 Mbits/s) to ATT telephone switches in three different locations.
 
 

VOICEXML

XML is a human-readable notation for writing and exchanging structured information of all sorts.  We'll go into detail on XML later in the quarter.  For now, it's just a strange syntax, and VoiceXML happens to be written using XML.

VoiceXML is a language similar to HTML, but for telephone-based interaction.  Like HTML, VoiceXML can be called a rendering language, or a programming language.

XML documents are built from nested "elements."  A VoiceXML document contains a single <vxml> element, which is the root element. The basic units of a VoiceXML document are dialogs, specified as <form> elements, and menus, identified by <menu> elements.  A form is a set of fields to be entered by the human, like an HTML form.  A menu lets the human make a choice from a list.
 
 

Basic structure of a VoiceXML            document
Basic structure of a VoiceXML document

In our example, we'll be using a form to obtain a user ID and a password from the user. Because we are using a phone, we'll ask the user for five digits for the user ID and four digits for the password. Digits are easily typed on phone keys, or simply spoken into a phone.

Each <field> element inside a form should contain multiple elements that specify various aspects of the field.  These elements are not necessarily executed sequentially.

A <prompt> element specifies what the voice browser will say. For instance:
        <prompt>
               <audio>Please dial or say your five-digit user ID</audio>
        </prompt>

The <audio> element instructs the voice browser to use the text-to-speech engine to read the data content and say it on the phone. After the prompt, the voice browser waits for an answer from the interlocutor at the other end of the phone line.

A <grammar> element is a set of rules that specify what will be recognized by the voice server.   For instance:
        <grammar>
               <![CDATA[Four_digits]]>
        </grammar>

When the person says or dials the user ID, then the element <filled> causes the browser to provide feedback to the interlocutor by saying what the speech recognition engine understood. Then, it jumps to the next form by following the <goto> element.

<filled>
     <audio>I heard you say {document.login.userID}</audio>
     <goto next="#password"/>
</filled>
A different <filled> element could be used to send variables to the server for further processing, e.g.
       <filled>
            <submit next="http://www.web.com/pin.php"/>
       </filled>

The <noinput> element allows rudimentary error-handling, catching when the user does not dial or say anuything.  Specifically, the <noinput> element says what to do when, after a certain time, a request remains answerless.

This example is adapted from http://www.webreference.com/perl/tutorial/20/tutorial20.html.

<?xml version="1.0"?>
  <vxml version="1.0" >
    <form id="login">
      <field name="pin">
             <grammar>
                <![CDATA[Four_digits]]>
             </grammar>
             <prompt>Please enter your 4 digit pin code.</prompt>
             <filled>
               <submit next="http://www.web.com/pin.php"/>
             </filled>
             <noinput>No PIN entered.<reprompt/></noinput>
             <nomatch count="1">Invalid pin code.<reprompt/></nomatch>
             <nomatch count="2">Too many attempts.<exit/></nomatch>
      </field>
    </form>
  </vxml>
 
 

SPEECH RECOGNITION

A grammar specifies what the alternatives are for what the user might be saying.  Before a grammar can be used, it must be compiled.  Reusing grammars that have already been compiled is critical for speed.

By including a <grammar> inside a field element, we limit the scope of the grammar rules to the field's context.  For instance, in our experiment, we use the Tellme default library grammar named "Five-digit" for the <field> element associated with the user ID, and we use the "Four-digit" grammar library for the password.

Voice recognition is easier when any of these restrictions is true:

If at least one of these restrictions is true, then voice recognition can be done with reasonable accuracy using a Pentium III class processor.  Understanding continuous speech with an unlimited vocabulary, from novel speakers, is beyond the current state of the art.
 
 

REFERENCES AND TUTORIALS

This shows how to write a simple form in VoiceXML using TellMe Studio:
    http://www.webreference.com/perl/tutorial/20/tutorial20.html
This shows how to build an interactive application that links to a backend Perl script:
    http://www.webreference.com/perl/tutorial/21/tutorial21.html

This PDF document explains the role of VoiceXML and the architecture of the TellMe service:
    http://www.tellme.com/business/downloads/VoiceXML_facts_and_fiction.pdf

These documents discuss two small but interesting VoiceXML applications:
    http://studio.tellme.com/articles/TRAIN.html
    http://studio.tellme.com/articles/OnCalls.html
 



Copyright (c) by Charles Elkan, 2001.