The third
project is due three weeks from today.
Different database servers can enforce more or fewer types of constraint. For example, MySQL can enforce the first constraint by saying that the id field is not null in the messages table. The fourth constraint can be enforced with a primary key declaration on the pair of fields (name,address).
The second constraint is accommodated by not declaring id to be unique for the messages table. The third constraint is accommodated by making the messages and person tables separate. This constraint would be violated if we used one big table.
The last constraint is an example of a functional dependency. Unlike more sophisticated database systems, MySQL has no features to enforce functional dependencies.
Guideline: Before choosing a database design, write down what constraints
you know should be true. Then select a design that enforces these
constraints.
When extracting information from a web page, typically you must remove a lot of HTML tags and extraneous characters. PHP has many useful functions for this, for example trim(string) which removes white space characters from the start and end of its argument.
However, functions with fixed functionality such as trim are
not enough for information extraction. We need more sophisticated
string handling for extracting specific items of data from web pages.
Regular expressions are the answer.
In HTML source code, newlines have no meaning. You may want to remove them as a first step in information extraction.
A regular expression pattern matches a string if the pattern can be found anywhere in the string. So for example the pattern once matches the string There once was
The special character ^ means "at the start of the string" and $ means "at the end." Use these if you want to match a pattern, but not anywhere.
Escape sequences are character combinations that designate a character that otherwise has a special meaning. For example an escape sequence is needed to represent a period. Escape sequences begin with a slash, e.g. \n and \t and \. and \-.
Characters inside square brackets are alternatives, e.g. [aeiou] Inside square brackets the character - is special, so you have to escape it to represent it literally, as in [0-9\.\-] for example. You can write [ ] to indicate a space explicitly.
Immediately after [ the character ^ means "anything
except." Double square brackets indicate a special character class,
for example [[:alnum:]] and [[:space:]]
Note: Does .{3} means three of the same character, or any three characters? Try it and see.
Exercise: Write a pattern that will match any real number, similar to the PHP is_double() function.
? is an abbreviation for {0,1}Parentheses ( ) allow "multipliers" like the above to apply to a sequence of characters. The vertical bar gives alternatives. Examples:
* is an abbreviation for {0,}
+ is an abbreviation for {1,}
(Nant|b)ucketPrecedence, i.e. tightness of binding, is important for regular expression operators. It seems that $ binds more tightly than |
Fran|Nan$
(Fran|Nan)$
Exercise: What does this pattern test for: ^.+@.+\\..+$
Here \\. is a literal period and .+ means any one or more characters.
Intuitively, this matches anya@anyb.anyc which is close to
the syntax of email addresses. The pattern is not perfect because
it also matches
strings that are obviously not valid email addresses, for example strings
containing more than one @ symbol.
The function split($pattern,$input) returns an array of strings that is the result of dividing up its second argument into pieces, using matches to the first argument as a delimiter.
split has many applications. One is to divide a long page into repeated parts, where each part can be processed separately. Often the first and last parts are special.
If the delimiter is not found, the first element of the array gets the whole input. If the delimiter is repeated consecutively, the array will include null items.
An optional third argument says how many items to return. The last item then contains the whole remainder of the string. This is useful for writing before() and after() functions that take a specific part of a web page The se functions are useful for information extraction, for example:
function after($pattern, $text) {Note that this function returns all of $text after the first occurrence of the pattern.
if ($pattern == "") return $text;
$s = split($pattern, $text, 2);
return $s[1];
}
When possible, use plain strings instead of regular expressions.
The explode() function has the same effect as split(),
but the delimiter is an ordinary string, not a regular expression.
Therefore explode() is much more efficient.