CSE134A LECTURE NOTES

April 25, 2001
 
 

ANNOUNCEMENTS

The first project reports are due right now, at the start of class.
 
 

INDEXES AND JOIN QUERIES

Consider the query
select c1, c2, c3 from t1, t2, t3 where c1=c2 and c1=c3
With an index on tables t2 and t3 this query can be executed by one scan of table t1.

Doing direct accesses to each row of a table is still much slower than scanning the table sequentially.  Reason: for each row, one block must be read from disk.  When scanning, many rows are processed per read of a block.

It is possible to use data from the index directly, instead of using the index to guide a read from the table.  Consider the  query

select author, zip from postings, address where postings.author = address.name
Now we are taking the  zip value from the table address, for each author in postings.   Suppose we have an index on (name,zip).  Then the additional values can be taken straight from the index.  This can be a make-or-break optimization.
 
 

GUIDELINES FOR INDEXING

Every index causes additional slowdown for inserts and updates.  Slow inserts and updates are especially bad because of locking.  While an insert or update is happening, the table cannot be accessed by any select query; it is locked.  On the other hand, many select queries can read a table at the same time.  So, only create an index if it is necessary.

Remember that an index on multiple columns, e.g. (state, city, zip), is also an index on its prefixes, i.e. (state, city) and (state).

Indexes are only useful for <, <=, =, >=, > comparisons on columns of the same type in where clauses, such that the indexed column is not part of any larger expression.  For example, an index on c2 is not useful for the query

select c1, c2 from t1, t2 where c1 = c2*2
When indexing a text column, make the index be on a string prefix only.  For example to make an index on the first ten characters only of an author's name, use the command alter table postings add index ax (author(10))

Indexes are less useful for values that are repeated, most useful for unique values.
 
 

SEEING HOW A SELECT QUERY WILL BE EXECUTED

Use the explain SQL command to see how a query will be executed.  For example, here's how an index is used for a range restriction:
mysql> explain select distinct ticker from postings where date>'2001-04-23';

+----------+-------+---------------+------+---------+------+-------+-------+
| table    | type  | possible_keys | key  | key_len | ref  | rows  | Extra |
+----------+-------+---------------+------+---------+------+-------+-------+
| postings | range | date          | date |    NULL | NULL | 11321 |       |
+----------+-------+---------------+------+---------+------+-------+-------+

Here's how an index is used for an equijoin:
mysql> explain select distinct ticker from postings, sorted where postings.id=sorted.k;

+----------+--------+---------------+---------+---------+----------+--------+-------------+
| table    | type   | possible_keys | key     | key_len | ref      | rows   | Extra       |
+----------+--------+---------------+---------+---------+----------+--------+-------------+
| sorted   | index  | k             | k       |       4 | NULL     | 124113 | Using index |
| postings | eq_ref | PRIMARY       | PRIMARY |       4 | sorted.k |      1 |             |
+----------+--------+---------------+---------+---------+----------+--------+-------------+

And here's how both are used together:
mysql> explain select distinct ticker from postings, sorted
            where postings.id=sorted.k and date>'2001-04-23' order by ticker;

+----------+-------+---------------+------+---------+-------------+-------+-------------+
| table    | type  | possible_keys | key  | key_len | ref         | rows  | Extra       |
+----------+-------+---------------+------+---------+-------------+-------+-------------+
| postings | range | PRIMARY,date  | date |    NULL | NULL        | 11321 |             |
| sorted   | ref   | k             | k    |       4 | postings.id |    10 | Using index |
+----------+-------+---------------+------+---------+-------------+-------+-------------+

PERFORMANCE AND SCALABILITY

The same query can be much more than ten times slower when the database is ten times larger.  This can be true even if the smaller database is actually the larger one narrowed done with a where clause.

The same query can be ten times faster when you run it a second time, because the tables have been loaded into the server's in-memory cache.
 
 

GOOD AND BAD DATABASE DESIGN

Typically you know which columns you want to store data in, e.g. first and last name, SSN, zip code, date of message, date of userid creation, etc.  But how should you group these into tables?

The most important guideline is to avoid redundancy.  Consider this schema:

postings: first name, last name, address, zip, title, body
To reduce redundancy, we should have two tables:
postings: first name, last name, title, body
  person: first name, last name, address, zip




Copyright (c) by Charles Elkan, 2001.