CSE 134A LECTURE NOTES

October 28, 2002
 
 

ANNOUNCEMENTS

Remember that the midterm is next week, on Wednesday, November 6.  Here are the Spring 2001 midterm and final examination.  Sample solutions are not available currently.  The midterm will count for 1/6 of your overall grade.
 
 

STRING HANDLING

PHP has functions similar to those available with C and other languages.  But for every new language, you must ask questions about the details of semantics.  For example, with substr(string, pos, length) the second argument is an integer starting at zero to mean the first character.  A value of -1 means starting at the last character.  Why? Because -0 = 0.

When extracting information from a web page, typically you must remove a lot of HTML tags and extraneous characters.  PHP has many useful functions for this, for example trim(string) which removes white space characters from the start and end of its argument.

However, functions with fixed functionality such as trim are not enough for information extraction.  We need more sophisticated string handling for extracting specific items of data from web pages.  Regular expressions are the answer.

For a good basic tutorial on regular expressions in PHP see http://www.phpbuilder.com/columns/dario19990616.php3.  Some examples below are from this.
 
 

REGULAR EXPRESSIONS

A regular expression is a pattern.  Some letters stand for themselves, while others are special.  Note that space does stand for itself but a period means any single character that is not a newline.

In HTML source code, newlines have no meaning.  You may want to remove them as a first step in information extraction.

A regular expression pattern matches a string if the pattern can be found anywhere in the string.  So for example the pattern once matches the string There once was

The special character ^ means "at the start of the string" and $ means "at the end."  Use these if you want to match a pattern, but not anywhere.

Escape sequences are character combinations that designate a character that otherwise has a special meaning.  For example an escape sequence is needed to represent a period.  Escape sequences begin with a slash, e.g. \n and \t and \. and \-.

Characters inside square brackets are alternatives, e.g. [aeiou] Inside square brackets the character - is special, so you have to escape it to represent it literally, as in [0-9\.\-] for example.  You can write [ ] to indicate a space explicitly.

Immediately after [ the character ^ means "anything except."  Double square brackets indicate a special character class, for example [[:alnum:]] and [[:space:]]
 
 

MATCHING MULTIPLE CHARACTERS

All the examples above are patterns that match a single character.  Curly braces enclosing a number or a range are a postfix operator indicating multiple matches.  For example [[:alpha:]]{2-8} can match two to eight characters each of which must be a letter.

Note: Does .{3} means three of the same character, or any three characters?  Try it and see.

Exercise: Write a pattern that will match any real number, similar to the PHP is_double() function.

? is an abbreviation for {0,1}
* is an abbreviation for {0,}
+ is an abbreviation for {1,}
Parentheses ( ) allow "multipliers" like the above to apply to a sequence of characters.  The vertical bar gives alternatives.  Examples:
    (Nant|b)ucket
    Fran|Nan$
    (Fran|Nan)$
Precedence, i.e. tightness of binding, is important for regular expression operators.  It seems that  binds more tightly than |

Exercise: What does this pattern test for: ^.+@.+\\..+$

Here \\. is a literal period and .+ means any one or more characters.

Intuitively, this matches anya@anyb.anyc which is close to the syntax of email addresses.  The pattern is not perfect because it also matches
strings that are obviously not valid email addresses, for example strings containing more than one @ symbol.
 
 

USING REGULAR EXPRESSIONS

The PHP function ereg($pattern,$input) returns TRUE if the pattern is found in the string.  Remember to use ^ and $ if you want the pattern to match only if it matches the whole string.  The function eregi() is similar except it ignores upper case/lower case differences.

The function split($pattern,$input) returns an array of strings that is the result of dividing up its second argument into pieces, using matches to the first argument as a delimiter.  For example:

list ($month, $day, $year) = split ('[/.-]', $date);
Note that in this example the slash, period, and hyphen are not escape sequences: they stand for themselves.  The list operator produces a tuple instead of an array.

If the delimiter is not found or the delimiter is empty, then the first element of the array (with subscript 0) gets the whole input.  If the delimiter is repeated consecutively, the array will include null items.

An optional third argument says how many items to return.  The last item then contains the whole remainder of the string.  This is useful for writing  before() and after() functions that take a specific part of a web page   These functions are useful for information extraction, for example:

function after($pattern, $text) {
    if ($pattern == "") return $text;
    $s = split($pattern, $text, 2);
    return $s[1];
}
Note that this function returns all of $text after the first occurrence of the pattern.
 
 




Copyright (c) by Charles Elkan, 2002.