Today I'll give an intro to VoiceXML, which is used in the current project.
The human calls an 800 number, usually 800-555-TELL, and reaches the TellMe service. The TellMe server loads a VoiceXML script from your web server, then executes the script.
Executing a VoiceXML script means following its commands, and converting text to speech, and recognizing words and sentences said by the human user.
For most applications, the VoiceXML architecture is as in the illustration below.

Alternatively, the VoiceXML interpreter can be included in a device such a wireless phone or a car radio that acts as a client of the web server. In this case there is no need for a separate voice server.
TellMe has multiple servers, with load balancing, fault tolerance, and adaptive caching. The servers are hosted by an Exodus facility, which provides Internet connectivity. The servers are connected by a dedicated OC48 network (2500 Mbits/s) to ATT telephone switches in three different locations.
The VoiceXML script can be stored in a text file with a .vxml extension,
or it can be generated as the output of a PHP script. Voice servers
also compile grammars for voice recognition.
VoiceXML is a language similar to HTML, but for telephone-based interaction. Like HTML, VoiceXML can be called a rendering language, but it is also a programming language. Like it is for HTML, Javascript is available inside VoiceXML, but complex code inside VoiceXML is discouraged. The reasons are the same ones for the failure of client-side Java and C++ inside HTML.
A VoiceXML script can invoke another server-side script, either as a subroutine (e.g. to ask for, acquire, and verify a credit card number) or as the follow-on phase of interacting with the human user.
XML documents are built from nested "elements." A VoiceXML document contains a single <vxml> element, which is the root element. The basic units of a VoiceXML document are dialogs, specified as <form> elements, and menus, identified by <menu> elements. A dialog is a set of fields to be entered by the human, like an HTML form. A menu lets the human make a choice from a list.
One way to design a VoiceXML document is to draw a diagram that shows the nesting of dialogs and menus, starting with a template like this:
The example is adapted from http://www.webreference.com/perl/tutorial/20/tutorial20.html.
<?xml version="1.0"?>
<vxml version="1.0" >
<form id="login">
<field name="pin">
<grammar>
<![CDATA[Four_digits]]>
</grammar>
<prompt>Please enter your 4 digit pin code.</prompt>
<filled>
<submit next="http://www.web.com/pin.php"/>
</filled>
<noinput>No PIN entered.<reprompt/></noinput>
<nomatch count="1">Invalid pin code.<reprompt/></nomatch>
<nomatch count="2">Too many attempts.<exit/></nomatch>
</field>
</form>
</vxml>
A <prompt> element specifies what the voice browser will
say. For instance:
<prompt>
<audio>Please dial or say your five-digit user ID</audio>
</prompt>
The <audio> element instructs the VoiceXML interpreter to use the text-to-speech engine to speak the given text as output. After the prompt, the interpreter waits for an answer from the human at the other end of the phone line.
A <grammar> element gives the name of a set of rules that
specify what will be recognized by the voice server. For instance:
<grammar>
<![CDATA[Four_digits]]>
</grammar>
Most sets of rules, like Four_digits, are pre-programmed.
A <filled> element tells the interpreter to provide feedback to the interlocutor by saying what the speech recognition engine understood. In this example, the <goto> element says to then jump immediately to the next form:
<filled>
<audio>I heard you say {document.login.userID}</audio>
<goto next="#password"/>
</filled>
A different <filled> element could be used to send variables
to the server for further processing, e.g.
The <noinput> and <nomatch> elements allow error-handling,
catching when the user does not dial or say anything, or says something
that cannot be recognized. Specifically, the <noinput>
element says what to do if, after a certain time, a request remains answerless.
By including a <grammar> inside a field element, we limit the scope of the grammar rules to the field's context. For instance, in our experiment, we use the Tellme default library grammar named "Five-digit" for the <field> element associated with the user ID, and we use the "Four-digit" grammar library for the password.
Voice recognition is easier when any of these restrictions is true:
This PDF document explains the role of VoiceXML and the architecture of the TellMe service: VoiceXML facts and fiction
These documents discuss two small but interesting VoiceXML applications:
http://studio.tellme.com/articles/TRAIN.html
http://studio.tellme.com/articles/OnCalls.html
The latter is a small business, Oncalls.com.