CSE134A LECTURE NOTES

April 11, 2001
 
 

WELCOME

On Friday in section Alan Su will go into detail about a workaround for the Homeworth site and these topics: SSL from PHP, HTTP headers, and the POST method of handling forms.  Meanwhile, you can continue work on the Yahoo parts of the project.

About the report: We can't provide a past report because this is a brand-new class. You are welcome to bring your outline or your draft to a TA office hour to discuss it and get ideas for improving it. The report should be detailed enough to be interesting, but it should also be five pages at most. You should use the report-writing skills that you learned in your general education and lab science classes.  These skills are important for graduate school, and also for software jobs with top companies.
 
 

STRING HANDLING

PHP has functions similar to those available with C and other languages.  But for every new language, you must ask questions about the details of semantics.  For example, with substr(string, pos, length) the second argument is an integer starting at zero to mean the first character.  A value of -1 means starting at the last character.

When extracting information from a web page, typically you must remove a lot of HTML tags and extraneous characters.  PHP has many useful functions for this, for example trim(string) which removes white space characters from the start and end of its argument.

However, functions with fixed functionality such as trim are not enough for information extraction.  We need more sophisticated string handling for extracting info from web pages.  Regular expressions are the answer.
 
 

REGULAR EXPRESSIONS

A regular expression is a pattern.  Some letters stand for themselves, while others are special.  Note that space does stand for itself but a period means any single non-newline character.

In HTML source code, newlines have no meaning.  You may want to remove them as a first step in information extraction.

A regular expression pattern matches a string if the pattern can be found anywhere in the string.  So for example the pattern once matches the string There once was

The special character ^ means "at the start of the string" and $ means "at the end."

Escape sequences begin with a slash, e.g. \n and \t and \. and \-.

Characters inside square brackets are alternatives, e.g. [aeiou] Inside [] the character - is special, e.g. [0-9\.\-].  You can write [ ] to indicate a space explicitly.

Immediately after [ the character ^ means "anything except."  Double square brackets indicate a special character class, for example [[:alnum:]] and [[:space:]]
 
 

MATCHING MULTIPLE CHARACTERS

All the above are patterns that match a single character.  Curly braces enclosing a number or a range are a postfix operator indicating multiple matches.  For example  [[:alpha:]]{2-8}

Note: Does .{3} means three of the same character, or any three characters?  Try it and see.

Exercise: Write a pattern that will match any real number, similar to the PHP is_double() function.

? is an abbreviation for {0,1}
* is short for {0,}
+ is short for {1,}

Parentheses ( ) allow these multipliers to apply to a sequence of characters.  The vertical bar gives alternatives.  Examples:
    (Nant|b)ucket
    Fran|Nan$
    (Fran|Nan)$
Precedence, i.e. tightness of binding, is important for RE operators.  It seems that   $ binds more tightly than |

Exercise: What does this pattern test for: ^.+@.+\\..+$

Here \\. is a literal period and .+ means any one or more characters.

Intuitively, this matches anya@anyb.anyc which is close to the syntax of email addresses.  The pattern is not perfect because it also matches @@... for example.
 
 

USING REGULAR EXPRESSIONS

The PHP function ereg($pattern,$input) returns TRUE if the pattern is found in the string.  Remember to use ^ and $ is you want the pattern to match the whole string.

eregi()  ignores upper case/lower case differences.

The split($pattern,$input) function gives an array of strings.

split has many applications.  One is to divide a long page into repeated parts, where each part can be processed separately.  Often the first and last parts are special.

If the delimiter is not found, the first element of the array gets the whole input.  If the delimiter is repeated, the array will include null items.

An optional third argument says how many items to return.  The last item then contains the whole remainder of the string.  This is useful for extracting a specific variable part of a web page with before() and after() functions.  The se functions are useful for information extraction, for example:

function after($pattern, $text) {
    if ($pattern == "") return $text;
    $s = split($pattern, $text, 2);
    return $s[1];
}
function idafter($pattern,$text) {
     $idpattern = "^([[:alpha:]]*[[:digit:]]*)";
     $text = after($pattern,$text);
     if (!eregi($idpattern,$text,$id))
           printerror("No identifier found in $text");
     return $id[1];
}

EFFICIENCY WITH REGULAR EXPRESSIONS

Simple regular expressions can be very inefficient to use.

Avoid especially expressions that match multiple ways.  For example do not write .*<big><b>.*</b></big>.*  This is bad for several reasons.  One, the .* at the beginning and end are unnecessary.  Don't add ^ and $, instead delete the .*  Two, if there are multiple <big><b> regions, this will not just match the first.

Much better: write
    $text = after("<big><b>",$text);
    $title = before("</b></big>",$text);

When possible, use plain strings instead of regular expressions.  The explode() function has the same effect as split(), but the delimiter is an ordinary string, not a regular expression.  Therefore explode() is much more efficient.

It is very easy for regular expressions to match many different ways.  Consider  the function call split("[[:space:]]+",$words).
You must use split() as opposed to explode if the delimiter is not a single fixed string.
 
 



Copyright (c) by Charles Elkan, 2001.