CSE 134A LECTURE NOTES

November 4, 2002
 
 

ANNOUNCEMENTS

I have very sad news: Aicha Mouline passed away last Thursday in a freeway crash.  Her friend Desiree has created a memorial web site.  Please add some words for Aicha's family if you remember her.

Aicha enjoyed learning and was committed to doing well on everything.  She would understand that we are continuing with 134A, which was a classthat she enjoyed a year ago.

The midterm is on Wednesday this week.  Here are the Spring 2001 midterm and final examination.  Sample solutions are not available currently.  The midterm will count for 1/6 of your overall grade.

For the midterm and final, you may bring and use the following materials: one PHP book and one MySQL book, your own personal hand-written notes, documents handed out in class, and a printed copy of the published lecture notes.  You may not use any other materials.
 

Today's lecture will be about regular expressions, but first let's talk about UI design principles for audio interfaces.
 
 

USER INTERFACE

Getting a voice interface (UI) right is even harder than getting a web interface right.  Bad visual interfaces are still usable, but bad voice interfaces are not.

What is the most fundamental difference between audio and visual interfaces?  It may well be the difference between one dimension and two.  Voice interfaces are unavoidably sequential, whereas a web user can jump around on a page, and absorb a lot of information simultaneously.

There are few universal principles for UI design, but here are some guidelines that are usually appropriate:

A fundamental issue: User initiative versus system initiative versus mixed initiative.
 
 

HOW DO REGULAR EXPRESSIONS MATCH?

The general rule is that the expression gives the longest match possible, starting at the first place where a match is possible.  This behavior is called "leftmost, longest."  For example in the string Greetings, planet Earth! these expressions match as follows:
 
[:alpha:]]+ Greetings
[:alpha:]]* Greetings
n[et]* n in Greetings
n[et]+ n in planet
G.*t Greetings, planet Eart

Leftmost, longest behavior means that to match the first string delimited by single quotes you must write the pattern '[^']*'  This pattern explicitly says that quote characters are not allowed inside the match.

Note that the top priority for the match found is "leftmost."  "Longest" is only the second priority.
 
 

EFFICIENCY WITH REGULAR EXPRESSIONS

Simple regular expressions can be very inefficient to use.

Avoid especially expressions that match multiple ways.  For example do not write .*<big><b>.*</b></big>.*  This is bad for several reasons.

Much better: write

    $text = after("<big><b>",$text);
    $title = before("</b></big>",$text);

When possible, use plain strings instead of regular expressions.  The explode() function has the same effect as split(), but the delimiter is an ordinary string, not a regular expression.  Therefore explode() is much more efficient.

For the new project and in general, it is important that you adopt a well-organized and efficient approach for doing information extraction.  Do not just use regular expressions developed by trial and error.  One part of your report should be an explanation of your  approach, which should be as clear and simple as possible.  In the report, describe the capabilities and limitations of your strategy.  Which changes in the data sources could you handle, and which changes would break your strategy?
 
 

EXTRACTING PARTS OF A STRING

Note:  The material starting here was not covered in lecture, and will not be on the midterm.  Everything above can be on the midterm.

If ereg() has a third argument that is an array variable, say$regs, and the pattern has parenthesized subpatterns, then the match to each subpattern will be stored in $regs[1], $regs[2], etc.  $regs[0] will contain the whole match.

If ereg() finds any matches at all, then $regs is filled with exactly ten elements, even though more or fewer than ten parenthesized substrings may actually match.  If no matches are found, then $regs is not altered.

For example, to convert from an ISO date to a U.S. style date:

if (ereg ("([0-9]{4})-([0-9]{1,2})-([0-9]{1,2})", $date, $regs)) {
    echo "$regs[3].$regs[2].$regs[1]";
    }
else { echo "Invalid date format: $date"; }
 
 

REPLACING PARTS OF A STRING

The function ereg_replace ($pattern, $replacement, $string) scans $string for matches to $pattern, then replaces each match with $replacement.
For example, suppose we want to separate all the words in a string by commas:
ereg_replace("[ \n\r\t]+", ",", trim($str));
If the pattern contains parenthesized substrings, then the replacement string may contain substrings of the form \\digit. where digit is between 0 and 9.  Each of these is replaced by the text matching the corresponding parenthesized substring.  \\0 gives the entire match to the whole pattern.  If parentheses are nested, they are counted by the opening parenthesis.  For example:
$string=ereg_replace("([a-z])([A-Z])", "\\1xxx\\2" , "FieldNamePlus");
The \\1 says to include whatever matched what was in the first parentheses in the "pattern", and the \\2 says to include whatever matched the second parentheses.  The ouput is
FieldxxxNamexxxPlus
because dN is changed to dxxxN and eP to exxxP.
 
 




Copyright (c) by Charles Elkan, 2002.