Tom Kelliher, CS 318
Apr. 22, 2002
Homework due Wednesday, start of class.
Read 13.4--7.
Indexing schemes.
SELECT processing.
Completion of query processing.
Given a table and one or more access paths, how do we structure queries to optimize performance?
SORT BY clause, sorting will be necessary
when DISTINCT is in use. Why?
External sorting algorithm:
// Partial sort on runs of M pages.
do {
read next M pages from the file;
sort rows using an in-memory technique;
write the sorted M pages to a file of their own;
} until end of file;
// Merge the partial sorts into a final sorted file.
// Open files for reading to maintain a balanced tree organization
// of runs.
while there is more than one input run {
open M - 1 runs for reading
while there is a non-empty run {
choose the smallest tuple, with respect to sort key, from each
run;
output the smallest such tuple and delete it from its run;
}
write and close the output file;
}
}
M - 1 because we need one output buffer.
Rough example: file contains 13 pages, we have three buffers.
Questions:
DISTINCT clause.
Assuming a uniform distribution, each hash file will fit in the M buffers. Quite a large file can be handled this way.
Preliminaries:
SELECT model:
SELECT * FROM R WHERE Col1 op1 Val1 AND Col2 op2 VAL2 ... Coln opn Valn;
Or, if file is sorted on Col1, perform a binary search to get
to the first ``qualifying'' tuple and then scan forward.
Cost?
tree index on Col1 --- Find first qualifying tuple
and then use sibling pointers to scan forward.
Cost? Suppose the file is unclustered? Sort the rids.
Col1 --- Find bucket containing Val1
and scan.
Cost?
Or, if file is sorted on several of the Coli, perform a binary
search to get to the first ``qualifying'' tuple and then scan forward.
Cost?
tree index on several columns.
For this to be useful, a subset of the Coli must be a prefix of
the index columns. Example.
Cost?
opi must be equality.
The columns indexed on must be a subset of the Coli. Why?
Hash to find buckets, then check buckets for matches.
Cost?