About the report: We can't provide a past report because this is a brand-new
class. You are welcome to bring your outline or your draft to a TA office
hour to discuss it and get ideas for improving it. The report should be
detailed enough to be interesting, but it should also be five pages at
most. You should use the report-writing skills that you learned in your
general education and lab science classes. These skills are important
for graduate school, and also for software jobs with top companies.
When extracting information from a web page, typically you must remove a lot of HTML tags and extraneous characters. PHP has many useful functions for this, for example trim(string) which removes white space characters from the start and end of its argument.
However, functions with fixed functionality such as trim are
not enough for information extraction. We need more sophisticated
string handling for extracting info from web pages. Regular expressions
are the answer.
In HTML source code, newlines have no meaning. You may want to remove them as a first step in information extraction.
A regular expression pattern matches a string if the pattern can be found anywhere in the string. So for example the pattern once matches the string There once was
The special character ^ means "at the start of the string" and $ means "at the end."
Escape sequences begin with a slash, e.g. \n and \t and \. and \-.
Characters inside square brackets are alternatives, e.g. [aeiou] Inside [] the character - is special, e.g. [0-9\.\-]. You can write [ ] to indicate a space explicitly.
Immediately after [ the character ^ means "anything
except." Double square brackets indicate a special character class,
for example [[:alnum:]] and [[:space:]]
Note: Does .{3} means three of the same character, or any three characters? Try it and see.
Exercise: Write a pattern that will match any real number, similar to the PHP is_double() function.
? is an abbreviation for {0,1}
* is short for {0,}
+ is short for {1,}
Parentheses ( ) allow these multipliers to apply to a sequence of characters.
The vertical bar gives alternatives. Examples:
(Nant|b)ucket
Fran|Nan$
(Fran|Nan)$
Precedence, i.e. tightness of binding, is important for RE operators.
It seems that $ binds more tightly than |
Exercise: What does this pattern test for: ^.+@.+\\..+$
Here \\. is a literal period and .+ means any one or more characters.
Intuitively, this matches anya@anyb.anyc which is close to
the syntax of email addresses. The pattern is not perfect because
it also matches @@...
for example.
eregi() ignores upper case/lower case differences.
The split($pattern,$input) function gives an array of strings.
split has many applications. One is to divide a long page into repeated parts, where each part can be processed separately. Often the first and last parts are special.
If the delimiter is not found, the first element of the array gets the whole input. If the delimiter is repeated, the array will include null items.
An optional third argument says how many items to return. The last item then contains the whole remainder of the string. This is useful for extracting a specific variable part of a web page with before() and after() functions. The se functions are useful for information extraction, for example:
function after($pattern, $text) {
if ($pattern == "") return $text;
$s = split($pattern, $text, 2);
return $s[1];
}
function idafter($pattern,$text) {
$idpattern = "^([[:alpha:]]*[[:digit:]]*)";
$text = after($pattern,$text);
if (!eregi($idpattern,$text,$id))
printerror("No identifier found in $text");
return $id[1];
}
Avoid especially expressions that match multiple ways. For example do not write .*<big><b>.*</b></big>.* This is bad for several reasons. One, the .* at the beginning and end are unnecessary. Don't add ^ and $, instead delete the .* Two, if there are multiple <big><b> regions, this will not just match the first.
Much better: write
$text = after("<big><b>",$text);
$title = before("</b></big>",$text);
When possible, use plain strings instead of regular expressions. The explode() function has the same effect as split(), but the delimiter is an ordinary string, not a regular expression. Therefore explode() is much more efficient.
It is very easy for regular expressions to match many different ways.
Consider the function call split("[[:space:]]+",$words).
You must use split() as opposed to explode if the delimiter
is not a single fixed string.