CSE134A LECTURE NOTES

October 17, 2001
 
 

ANNOUNCEMENTS

Everyone should look at the class web page and read the new messages on Discus at least once per day.  We have over 130 students and this is our best way to communicate with you.

We have published the scores for the first project on gradesource.com.  If you have not received your secret number by email, contact Victor Gidofalvi.

Some students are upset about losing points for not providing a working URL.  We did request these through announcements in class and on the web.  Now we have to be fair to the great majority of students, who did the project on time and followed directions.

For the current project, be sure to put your URL, the names of your team members, etc. in your report.  Also be sure that all your file permissions are set correctly, etc.  We will not log in to your account, or fix things for you.  You must carefully take care of all details in your project.  The report is your opportunity to explain anything that might be unclear.

The deadline for the current project is next Monday, October 22.  We've canceled the part where you build your own server for checking spelling is canceled.  We'll revisit that topic later in the quarter.
 
 

GUIDELINES FOR INDEXING

Indexes are only useful for <, <=, =, >=, > comparisons on columns of the same type in where clauses, such that the indexed column is not part of any larger expression.  For example, an index on c2 is not useful for the query
select c1, c2 from t1, t2 where c1 = c2*2
When indexing a text column, make the index be on a string prefix only.  For example to make an index on the first ten characters only of an author's name, use the command alter table messages add index ax (author(10))

Indexes are less useful for values that are repeated, most useful for unique values.

Always do performance experiments when you are not 100% sure about the value of an index.  Note that the same query can be much more than ten times slower when the database is ten times larger.  This can be true even if the smaller database is actually the larger one narrowed done with a where clause.

The same query can be ten times faster when you run it a second time, because the tables have been loaded into the server's in-memory cache.

It is possible to use data from the index directly, instead of using the index to guide a read from the table.  Consider the  query

select author, zip from messages, address where messages.author = address.name
Now we are taking the  zip value from the table address, for each author in messages.   Suppose we have an index on (name,zip) in the address table.  Then the additional values can be taken straight from the index, so the table itself is never accessed at all.  This can be a make-or-break optimization.
 
 

SEEING HOW A SELECT QUERY WILL BE EXECUTED

Use the explain SQL command to see how a query will be executed.  For example, here's how an index is used for a query involving a range restriction on the value of the date field:
mysql> explain select distinct ticker from messages where date>'2001-09-23';
+----------+-------+---------------+------+---------+------+-------+-------+
| table    | type  | possible_keys | key  | key_len | ref  | rows  | Extra |
+----------+-------+---------------+------+---------+------+-------+-------+
| messages | range | date          | date |    NULL | NULL | 11321 |       |
+----------+-------+---------------+------+---------+------+-------+-------+
Which index is used is shown by the key column.  Which value from another table will be looked up in this index is shown by the ref column.  How the index is used is shown by the type column.  The least efficient value for type is ALL, then index, then range, ref, eq_ref.

The product of rows from each table is an estimate of the number of combinations to be processed.

If only index is shown under Extra, then the the actual table does not need to be accessed at all, which is the major efficiency bonus mentioned above.

Here's how an index is used for an equijoin:

mysql> explain select distinct ticker from messages, sorted where messages.id=sorted.k;
+----------+--------+---------------+---------+---------+----------+--------+-------------+
| table    | type   | possible_keys | key     | key_len | ref      | rows   | Extra       |
+----------+--------+---------------+---------+---------+----------+--------+-------------+
| sorted   | index  | k             | k       |       4 | NULL     | 124113 | Using index |
| messages | eq_ref | PRIMARY       | PRIMARY |       4 | sorted.k |      1 |             |
+----------+--------+---------------+---------+---------+----------+--------+-------------+
And here's how both are used together:
mysql> explain select distinct ticker from messages, sorted
            where messages.id=sorted.k and date>'2001-04-23' order by ticker;
+----------+-------+---------------+------+---------+-------------+-------+-------------+
| table    | type  | possible_keys | key  | key_len | ref         | rows  | Extra       |
+----------+-------+---------------+------+---------+-------------+-------+-------------+
| messages | range | PRIMARY,date  | date |    NULL | NULL        | 11321 |             |
| sorted   | ref   | k             | k    |       4 | messages.id |    10 | Using index |
+----------+-------+---------------+------+---------+-------------+-------+-------------+

GOOD AND BAD DATABASE DESIGN

Typically you know which columns you want to store data in, e.g. first and last name, SSN, zip code, date of message, date of userid creation, etc.  But how should you group these into tables?

The most important guideline is to avoid redundancy.  Consider this schema:

messages: first name, last name, address, zip, title, body
To reduce redundancy, we should have two tables.  We still have redundancy in the names, so we should introduce unique ids for authors:
messages: aid, title, body
  person: aid, first name, last name, address, zip
Note that we will often want unique ids for many different entities, e.g. for authors and separately for messages.  That's why the new field is named aid and not just id.

Why is redundancy bad?  There are at least four reasons:

How can we prevent these problems?
 
 

CONSTRAINTS

For any database, we usually possess some meta-knowledge about properties that the actual data should always satisfy.  For example: Ideally, we would be able to state all these constraints in some formal language.  The database server would give an error message whenever an insert or delete or update operation would cause any constraint to be violated.

Different database servers can enforce more or fewer types of constraint.  For example, MySQL can enforce the first constraint by saying that the id field is not null in the messages table.  The fourth constraint can be enforced with a primary key declaration on the pair of fields (name,address).

The second constraint is accommodated by not declaring id to be unique  for the messages table.  The third constraint is accommodated by making the messages and person tables separate.  This constraint would be violated if we used one big table.

The last constraint is an example of a functional dependency. Unlike more sophisticated database systems, MySQL has no features to enforce functional dependencies.

Guideline: Before choosing a database design, write down what constraints you know should be true.  Then select a design that enforces these constraints.
 
 



Copyright (c) by Charles Elkan, 2001.