CSE134A LECTURE NOTES

October 22, 2001
 
 

ANNOUNCEMENTS

The reports for the current project are due today, Monday, October 22.  You must use turnin to submit your code before 4:30pm Tuesday.  The extension is because PHP was down on Friday evening.  Both PHP and the server for spell-checking have been working fine since then.

The third project is due three weeks from today.
 
 

DATABASE DESIGN AND CONSTRAINTS

For any database, we usually possess some meta-knowledge about properties that the actual data should always satisfy.  For example: Ideally, we would be able to state all these constraints in some formal language.  The database server would give an error message whenever an insert or delete or update operation would cause any constraint to be violated.

Different database servers can enforce more or fewer types of constraint.  For example, MySQL can enforce the first constraint by saying that the id field is not null in the messages table.  The fourth constraint can be enforced with a primary key declaration on the pair of fields (name,address).

The second constraint is accommodated by not declaring id to be unique  for the messages table.  The third constraint is accommodated by making the messages and person tables separate.  This constraint would be violated if we used one big table.

The last constraint is an example of a functional dependency. Unlike more sophisticated database systems, MySQL has no features to enforce functional dependencies.

Guideline: Before choosing a database design, write down what constraints you know should be true.  Then select a design that enforces these constraints.
 
 

STRING HANDLING

PHP has functions similar to those available with C and other languages.  But for every new language, you must ask questions about the details of semantics.  For example, with substr(string, pos, length) the second argument is an integer starting at zero to mean the first character.  A value of -1 means starting at the last character.  Why? Because -0 = 0.

When extracting information from a web page, typically you must remove a lot of HTML tags and extraneous characters.  PHP has many useful functions for this, for example trim(string) which removes white space characters from the start and end of its argument.

However, functions with fixed functionality such as trim are not enough for information extraction.  We need more sophisticated string handling for extracting specific items of data from web pages.  Regular expressions are the answer.
 
 

REGULAR EXPRESSIONS

A regular expression is a pattern.  Some letters stand for themselves, while others are special.  Note that space does stand for itself but a period means any single character that is not a newline.

In HTML source code, newlines have no meaning.  You may want to remove them as a first step in information extraction.

A regular expression pattern matches a string if the pattern can be found anywhere in the string.  So for example the pattern once matches the string There once was

The special character ^ means "at the start of the string" and $ means "at the end."  Use these if you want to match a pattern, but not anywhere.

Escape sequences are character combinations that designate a character that otherwise has a special meaning.  For example an escape sequence is needed to represent a period.  Escape sequences begin with a slash, e.g. \n and \t and \. and \-.

Characters inside square brackets are alternatives, e.g. [aeiou] Inside square brackets the character - is special, so you have to escape it to represent it literally, as in [0-9\.\-] for example.  You can write [ ] to indicate a space explicitly.

Immediately after [ the character ^ means "anything except."  Double square brackets indicate a special character class, for example [[:alnum:]] and [[:space:]]
 
 

MATCHING MULTIPLE CHARACTERS

All the examples above are patterns that match a single character.  Curly braces enclosing a number or a range are a postfix operator indicating multiple matches.  For example [[:alpha:]]{2-8} can match two to eight characters each of which must be a letter.

Note: Does .{3} means three of the same character, or any three characters?  Try it and see.

Exercise: Write a pattern that will match any real number, similar to the PHP is_double() function.

? is an abbreviation for {0,1}
* is an abbreviation for {0,}
+ is an abbreviation for {1,}
Parentheses ( ) allow "multipliers" like the above to apply to a sequence of characters.  The vertical bar gives alternatives.  Examples:
    (Nant|b)ucket
    Fran|Nan$
    (Fran|Nan)$
Precedence, i.e. tightness of binding, is important for regular expression operators.  It seems that  binds more tightly than |

Exercise: What does this pattern test for: ^.+@.+\\..+$

Here \\. is a literal period and .+ means any one or more characters.

Intuitively, this matches anya@anyb.anyc which is close to the syntax of email addresses.  The pattern is not perfect because it also matches
strings that are obviously not valid email addresses, for example strings containing more than one @ symbol.
 
 

USING REGULAR EXPRESSIONS

The PHP function ereg($pattern,$input) returns TRUE if the pattern is found in the string.  Remember to use ^ and $ if you want the pattern to match the whole string.  The function eregi()  ignores upper case/lower case differences.

The function split($pattern,$input) returns an array of strings that is the result of dividing up its second argument into pieces, using matches to the first argument as a delimiter.

split has many applications.  One is to divide a long page into repeated parts, where each part can be processed separately.  Often the first and last parts are special.

If the delimiter is not found, the first element of the array gets the whole input.  If the delimiter is repeated consecutively, the array will include null items.

An optional third argument says how many items to return.  The last item then contains the whole remainder of the string.  This is useful for writing  before() and after() functions that take a specific part of a web page   The se functions are useful for information extraction, for example:

function after($pattern, $text) {
    if ($pattern == "") return $text;
    $s = split($pattern, $text, 2);
    return $s[1];
}
Note that this function returns all of $text after the first occurrence of the pattern.

When possible, use plain strings instead of regular expressions.  The explode() function has the same effect as split(), but the delimiter is an ordinary string, not a regular expression.  Therefore explode() is much more efficient.
 



Copyright (c) by Charles Elkan, 2001.