Midterm Grading
Criteria and Comments
Below are the criteria that
were used to grade the midterm. We also mention some common mistakes that
people made.
Question 1
Graded by Greg Hamerly. Email him with questions.
General Comments:
This question asked for PHP functions that used
regular expressions to extract identifiers from strings. These functions
should be efficient and concise. Each problem can be solved with 2-3 lines
of code.
-
Several people did not write regular expressions
(assuming that they were given), which is incorrect. Other people simply
wrote a regular expression, but not a function. A general guideline of
points for this problem was:
-
5 points: perfectly correct
-
4 points: minor problems with code or tiny problems
with regular expressions
-
3 points: flawed regular expressions
-
2 points: the function does nothing near what was
asked
-
1 point: not used except in extreme cases
-
0 points: no answer given
-
Some people tried to use spaces or some other regular
expression to separate identifiers. The problem defined what separates
identifiers: non-alpha/numeric characters. Therefore the appropriate regular
expression to separate identifiers was "[^[:alnum:]]+" (or "[^a-zA-Z0-9]+").
-
The ereg() function does not do what many people
think it does. The result array (third argument) always returns
10 elements if it finds any match. Also, those elements are for each part
of the regular expression enclosed in parentheses, but not multiple times.
In other words, the regular expression "(a)(b)+" will return only 2 elements
(a and b), and will not extract more b's and put them in the result array.
-
Many people incorrectly used start-of-string ("^")
and end-of-string ("$") markers in their regular expressions.
-
Several people incorrectly used the parentheses operators
"( )", or confused them with the bracket operators "[ ]".
-
Many people did not verify that the beginning of
an identifier did not start with a digit. Points were not taken off for
this mistake.
See the problem-specific comments for further details.
Also, carefully read the documentation on split, explode, ereg, and other
similar functions in PHP if you are confused about how they are used.
(a) [5 points]
Carefully write a PHP function using regular
expressions that returns the first identifier present anywhere in
a long string.
ANSWER:
<?
function firstid($string) {
if (ereg("[[:alpha:]][[:alnum:]]*",
$string, $result)) {
return $result[0];
}
return null;
}
?>
COMMON PROBLEMS:
-
Some people used the split command but with the regular
expression for an identifier, which would have removed all the identifiers
from the string, not returned them. This was a common problem in all three
parts of this question.
-
An accepted answer was splitting all the strings
and returning the first one.
(b) [5 points]
Carefully write a PHP function that returns all
the identifiers in a string.
ANSWER:
<?
function allids($string) {
$result = split("[^[:alnum:]]+",
$string);
$arr = [];
$c = count($result);
for ($i = 0; $i < $c;
$i++) {
if (ereg("[[:alpha:]][[:alnum:]]*", $result[$i], $out)) {
$arr[] = $out[0];
}
}
return $arr;
}
?>
The following simpler function incorrectly allows
identifiers that start with digits, but points were not taken off for this
mistake.
<?
function allids($string) {
$result = split("[^[:alnum:]]+", $string);
return $result;
}
?>
COMMON PROBLEMS:
-
This question's most common problem was code that
was too complex and hard to follow.
-
Several people tried to search character-by-character,
which is not exploiting the power of regular expressions.
-
As said above, some people used the regular expression
for an identifier as the argument to split, which would remove all
identifiers, rather than extract them.
(c) [5 points]
Carefully write a PHP function that returns the
last identifier in a string. (For efficiency, this function must
not be based on extracting all the identifiers in the string).
<?
function lastid($string) {
if (ereg("([[:alpha:]][[:alnum:]]*)[^[:alnum:]]*$", $string, $result)) {
return $result[1];
}
return $null;
}
?>
COMMON PROBLEMS:
-
Many people extracted all identifiers in the string
(using split, or repeatedly extracting the first identifier). This is an
incorrect approach because you were asked not to use this in the question.
Someone using this approach received 2 points.
-
Several people neglected to put the "[^[:alnum:]]*"
before the "$" end-of-string marker.
-
Another clever solution is to reverse the string
and search for "[[:alnum:]]*[[:alpha:]]", and return the reverse of whatever
that matched. However, it was not correct to simply reverse the string
and use the function you wrote to extract the first identifier.
Question 2
Graded by Victor Gidofalvi. Email him with
questions.
(a) [2 points]
Write an SQL query that returns the names of presidents
who are still alive.
ANSWER:
SELECT last_name, first_name
FROM president
WHERE death is null;
COMMON PROBLEMS:
-
Some students returned the names of dead presidents.
[deduction: 1 point]
-
Some students have used = instead of IS in their
query. Remember that the comparison operator for the special value NULL
is IS. Using = will return the empty set. [deduction: 1 point]
(b) [3 points]
Write an SQL query that finds the state in which
the greatest number of presidents have been born.
ANSWER:
SELECT state,
count(*) AS times
FROM president
GROUP BY
state
ORDER BY
times DESC
LIMIT 1;
COMMON PROBLEMS:
-
The use of max(count(state)) and similar syntax in
your query will result in the following error: ERROR 1111: Invalid
use of group function. [deduction: 2 points]
-
Limiting the output of the query to 1 row was also
essential. [deduction: 1 point]
(c) [4 points]
Can you write a query that finds the states in which
no presidents have been born? Explain your answer.
ANSWER:
NO. The entries in the president table only include
states where presidents have been born. To list states where no presidents
have been born we would need to have an additional table listing all states.
COMMON PROBLEMS:
Almost everyone gave a correct answer to this
question. Students who answered YES with a reasonable explanation about
what is needed to achieve this [an additional table with all the states]
received also full credit.
(d) [4 points]
Write an SQL query that finds which presidents have
the same
last name.
ANSWER:
SELECT last_name
FROM president
GROUP BY last_name
HAVING count(*) > 1;
An almost correct answer would have been:
SELECT p1.first_name,
p1.last_name,
p2.first_name,
p2.last_name
FROM president p1,
president p2
WHERE p1.last_name = p2.last_name AND
p1.first_name > p2.first_name;
However this query produces many duplicates. Consider
a scenario, where there are four presidents with the same last name and
distinct first names. Even though the last comparison of first name enforces
a unique ordering of the presidents, the output of this query will contain
3 + 2 + 1 = 6 duplicates.
COMMON PROBLEMS:
-
Many people gave the second solution to the problem,
and forgot about the duplicates. The problem of duplicates can be remedied
by a correct grouping of last_names. [deduction: 1 point]
(e) [4 points]
Why is the following query likely to be slow, regardless
of what indexes are available?
SELECT * FROM president WHERE 2001 - year(BIRTH)
< 75;
Explain how to write a different query that computes
the same result but is much faster.
ANSWER:
In order for an index to work, the indexed field
should appear alone (without any arithmetic operations or functions like
year() or other user-defined functions) on one side of the comparison.
The following are examples of queries that are
not taking advantage of indexes:
SELECT * FROM t1 WHERE a - 2;
SELECT * FROM t1 WHERE abs(a) = 3;
SELECT * FROM t1 WHERE abs(a) > 3;
SELECT * FROM t1 WHERE user_defined_function(a)
< 100;
The following are examples of queries that are
speeded up because of proper use of indexes:
SELECT * FROM t1 WHERE a > 3;
SELECT * FROM t1 WHERE a > abs(-3);
SELECT * FROM t1 WHERE a > year('19930104');
SELECT * FROM t1 WHERE a > year('19930104')
- 1995;
This is the correct way to do use the index, and
therefore speed the query:
# In this query, the date format is 'YYYYMMDD'
SELECT * FROM president WHERE birth > '19260101';
COMMON PROBLEMS:
-
Lots of students were concerned about the arithmetic
used in the where clause and forgot about the function. [deduction:
2 points]
-
Students received partial credit for correct reasoning,
but no improved query. The credit depended on the clarity of the reasoning.
[deduction 1-2 point(s)]
(f) [3 points]
When you are testing the speed of an SQL query with
an index and without an index, it is important to run each version of the
query several times. Explain why.
ANSWER:
For both versions the table and/or the index will
likely be copied to the cache, making subsequent queries faster. By running
both versions of the query several times, we make sure that the cached
versions are compared. This way we know better whether the unindexed
version is really slower than the indexed version in production use.
COMMON PROBLEMS:
-
Some students were listing reasons such as external
factors, load factor of the system, etc., but forgot about the effects
of caching. [deduction: 1-2 point(2)]
Question 3
Graded by Joe Drish. Email him with questions.
(a) Scalability
Option A (database)
Important Advantage
-
As users increase, the number of accesses to external
sources stays constant.
Important Disadvantage
-
As users and data increase beyond a certain maximum
you must make major modifications and enhancements to the data base.
Option B (real-time web information retrieval,
no data base)
Important advantage
-
As users increase the service can be replicated as
much as wanted on independent web servers.
Important disadvantage
-
As users increase, the burden on external sources
increases.
Those were the main advantages and disadvantages.
If you mentioned anything along those lines you were given 3 points.
If you mentioned other, less important advantages or disadvantages you
were also given credit. If you said something that was not necessarily
true then you may have been docked points, depending on the degree to which
the statement was incorrect.
(b) Convenience of presenting media
Option A (database)
Important Advantage
-
Generally more convenient to implement and less code
because the formats are specified in the data base already and are not
dynamic and the downloading code can be reused without change.
Important Disadvantage
-
Harder to reuse formatting info from the original
sources.
Option B (real-time web information retrieval,
no data base)
Important advantage
Important disadvantage
Those were the main advantages and disadvantages.
If you mentioned anything along those lines you were given 3 points.
If you mentioned other, less important advantages or disadvantages you
were also given credit. If you said something that was not necessarily
true then you may have been docked points, depending on the degree to which
the statement was incorrect.
(c) Security
Option A (database)
Important Advantage
-
Privacy depends only on protecting communication
between the user and the system, not between the system and the sources.
Important Disadvantage
-
Hackers may attempt to hack into the database.
If the central database is penetrated, everything is compromised.
Option B (real-time web information retrieval,
no data base)
Important advantage
-
No potential for hacking the data base; no burdensome
data base security maintenance
Important disadvantage
-
Availability and privacy depend on protecting communication
not just between the user and the system, but also between the system and
the sources.
Those were the main advantages and disadvantages.
If you mentioned anything along those lines you were given 3 points.
If you mentioned other, less important advantages or disadvantages you
were also given credit. If you said something that was not necessarily
true then you may have been docked points, depending on the degree to which
the statement was incorrect.
(d) User perception
Option A (database)
-
Users would perceive queries as being faster because
of indexing and other database optimizations.
-
Users would be able to perceive, or obtain, only
the data that is in the database. This includes old data.
Option B (real-time web information retrieval, no
data base)
-
Users would perceive queries as being slower, since
the web needed to be searched.
-
Users would be able to perceive, or obtain, up-to-date
information. They would not necessarily be able to see, or query,
older information, because there is no database.
If you mentioned something about the speed of which
data was obtained by using either option, then you were given 3 points.
If you mentioned something about how up-to-date the information was for
either option, you were given an additional 3 points. You were not
given any points if you talked about how difficult it was to implement
either option.
Common problems for Question 3:
Some people listed advantages for the data base
option when they really applied to the no data base option. You may
have been deducted a point for this. It is better to be clear and
simple rather than wordy when writing your explanations. Scalability
is defined as how well a solution to a problem will work when the size
of the problem increases. Size in this case is users and data.
Please keep this definition in mind when thinking about scalability.