CSE134A LECTURE NOTES

October 24, 2001
 
 

ANNOUNCEMENTS

Because Monday Nov. 12 is a campus holiday, the third project will be due three weeks from today, on Wednesday Nov. 14.

Remember that the midterm is next Wednesday, October 31.  Here are the Spring 2001 midterm and final examination.  Sample solutions are not available currently.  The midterm will count for 1/6 of your overall grade.
 
 

USING REGULAR EXPRESSIONS

For a good basic tutorial on regular expressions in PHP see http://www.phpbuilder.com/columns/dario19990616.php3.  Some examples below are from this.

The PHP function ereg($pattern,$input) returns TRUE if the pattern is found in the string.  Remember to use ^ and $ if you want the pattern to match only if it matches the whole string.  The function eregi() is similar except it ignores upper case/lower case differences.

The function split($pattern,$input) returns an array of strings that is the result of dividing up its second argument into pieces, using matches to the first argument as a delimiter.  For example:

list ($month, $day, $year) = split ('[/.-]', $date);
Note that in this example the slash, period, and hyphen are not escape sequences: they stand for themselves.  The list operator produces a tuple instead of an array.

If the delimiter is not found or the delimiter is empty, then the first element of the array (with subscript 0) gets the whole input.  If the delimiter is repeated consecutively, the array will include null items.

An optional third argument says how many items to return.  The last item then contains the whole remainder of the string.  This is useful for writing  before() and after() functions that take a specific part of a web page   These functions are useful for information extraction, for example:

function after($pattern, $text) {
    if ($pattern == "") return $text;
    $s = split($pattern, $text, 2);
    return $s[1];
}
Note that this function returns all of $text after the first occurrence of the pattern.
 
 

HOW DO REGULAR EXPRESSIONS MATCH?

The general rule is that the expression gives the longest match possible, starting at the first place where a match is possible.  This behavior is called "leftmost, longest."  For example in the string Greetings, planet Earth! these expressions match as follows:

[:alpha:]]+ Greetings
[:alpha:]]* Greetings
n[et]* n in Greetings
n[et]+ n in planet
G.*t Greetings, planet Eart

Leftmost, longest behavior means that to match the first string delimited by single quotes you must write the pattern '[^']*'  This pattern explicitly says that quote characters are not allowed inside the match.

Note that the top priority for the match found is "leftmost."  "Longest" is only the second priority.
 
 

EFFICIENCY WITH REGULAR EXPRESSIONS

Simple regular expressions can be very inefficient to use.

Avoid especially expressions that match multiple ways.  For example do not write .*<big><b>.*</b></big>.*  This is bad for several reasons.

Much better: write

    $text = after("<big><b>",$text);
    $title = before("</b></big>",$text);

When possible, use plain strings instead of regular expressions.  The explode() function has the same effect as split(), but the delimiter is an ordinary string, not a regular expression.  Therefore explode() is much more efficient.

For the new project and in general, it is important that you adopt a well-organized and efficient approach for doing information extraction.  Do not just use regular expressions developed by trial and error.  One part of your report should be an explanation of your  approach, which should be as clear and simple as possible.  In the report, describe the capabilities and limitations of your strategy.  Which changes in the data sources could you handle, and which changes would break your strategy?
 
 

EXTRACTING PARTS OF A STRING

If ereg() has a third argument that is an array variable, say $regs, and the pattern has parenthesized subpatterns, then the match to each subpattern will be stored in $regs[1], $regs[2], etc.  $regs[0] will contain the whole match.

If ereg() finds any matches at all, then $regs is filled with exactly ten elements, even though more or fewer than ten parenthesized substrings may actually match.  If no matches are found, then $regs is not altered.

For example, to convert from an ISO date to a U.S. style date:

if (ereg ("([0-9]{4})-([0-9]{1,2})-([0-9]{1,2})", $date, $regs)) {
    echo "$regs[3].$regs[2].$regs[1]";
    }
else { echo "Invalid date format: $date"; } 
 
 

REPLACING PARTS OF A STRING

The function ereg_replace ($pattern, $replacement, $string) scans $string for matches to $pattern, then replaces each match with $replacement.

For example, suppose we want to separate all the words in a string by commas:

ereg_replace("[ \n\r\t]+", ",", trim($str));
If the pattern contains parenthesized substrings, then the replacement string may contain substrings of the form \\digit. where digit is between 0 and 9.  Each of these is replaced by the text matching the corresponding parenthesized substring.  \\0 gives the entire match to the whole pattern.  If parentheses are nested, they are counted by the opening parenthesis.  For example:
$string=ereg_replace("([a-z])([A-Z])", "\\1xxx\\2" , "FieldNamePlus");         
The \\1 says to include whatever matched what was in the first parentheses in the "pattern", and the \\2 says to include whatever matched the second parentheses.  The ouput is
FieldxxxNamexxxPlus
because dN is changed to dxxxN and eP to exxxP.
 
 
 



Copyright (c) by Charles Elkan, 2001.