Tom Kelliher, CS 318
Apr. 22, 2002
Homework due Wednesday, start of class.
Read 13.4--7.
Indexing schemes.
SELECT
processing.
Completion of query processing.
Given a table and one or more access paths, how do we structure queries to optimize performance?
SORT BY
clause, sorting will be necessary
when DISTINCT
is in use. Why?
External sorting algorithm:
// Partial sort on runs of M pages. do { read next M pages from the file; sort rows using an in-memory technique; write the sorted M pages to a file of their own; } until end of file; // Merge the partial sorts into a final sorted file. // Open files for reading to maintain a balanced tree organization // of runs. while there is more than one input run { open M - 1 runs for reading while there is a non-empty run { choose the smallest tuple, with respect to sort key, from each run; output the smallest such tuple and delete it from its run; } write and close the output file; } }M - 1 because we need one output buffer.
Rough example: file contains 13 pages, we have three buffers.
Questions:
DISTINCT
clause.
Assuming a uniform distribution, each hash file will fit in the M buffers. Quite a large file can be handled this way.
Preliminaries:
SELECT
model:
SELECT * FROM R WHERE Col1 op1 Val1 AND Col2 op2 VAL2 ... Coln opn Valn;
Or, if file is sorted on Col1
, perform a binary search to get
to the first ``qualifying'' tuple and then scan forward.
Cost?
Col1
--- Find first qualifying tuple
and then use sibling pointers to scan forward.
Cost? Suppose the file is unclustered? Sort the rids.
Col1
--- Find bucket containing Val1
and scan.
Cost?
Or, if file is sorted on several of the Coli
, perform a binary
search to get to the first ``qualifying'' tuple and then scan forward.
Cost?
For this to be useful, a subset of the Coli
must be a prefix of
the index columns. Example.
Cost?
opi
must be equality.
The columns indexed on must be a subset of the Coli
. Why?
Hash to find buckets, then check buckets for matches.
Cost?