Measuring Performance

Tom Kelliher, CS 220

Sept. 2, 2011

Administrivia

No additional reading.

Introduction.

Comparing performance.

First off, multipliers: gig, meg, nano, pico.
What is the ``clock''?
1. Clock frequency/rate. Clock period.
2. Logic gate circuit delays.
  Combinational and sequential logic.
3. How much work can I do in a clock period?
CPU model:

$\begin{figure}\centering\includegraphics[]{Figures/cpuModel.eps}\end{figure}$
CPU implementations: single cycle, multiple cycle, pipelined.
What is super-pipelining?
Relating this model to CPI.

Why do we care about performance?
What is it? What do we measure?
1. GHz? Matters to marketing types.
2. How quickly we can run synthetic benchmarking kernels? (Toy programs.)
3. Throughput? Matters mostly to system admins.
4. Response time? Matters mostly to users.
Response time. Definition:
Begin to finish time for a program, as measured by a ``wall clock.''
Response time then includes:
1. I/O time.
2. Time CPU assigned to other users.
3. Time necessary for system tasks.
Another measure of response time: user CPU time.
System performance -- elapsed time (wall time) on an unloaded system. Accounts for everything (I/O, users, OS overhead).
CPU performance -- user CPU time. Best metric for comparing processors?

Equations:

Performance:

$\begin{displaymath} {\rm Performance} = \frac{1}{\rm Execution~Time} \end{displaymath}$

Higher numbers are better.
Relative performance (suppose machine A is faster than B):

$\begin{displaymath} \frac{\rm Performance_A}{\rm Performance_B} = \frac{\rm Execution~Time_B}{\rm Execution~Time_A} = n \end{displaymath}$

We say A is times faster than B.
Breaking down execution time:
1. Factoring in cycle time:
  
  $\begin{displaymath} {\rm CPU~time} = {\rm CPU~cycles} \times {\rm cycle~time} \end{displaymath}$
2. How many cycles?
  
  $\begin{displaymath} {\rm CPU~cycles} = {\rm instruction~count} \times {\rm avg~CPI} \end{displaymath}$
  
  Categorize instructions and then get CPI for each category.
  How do we get instruction counts?
CPU time:

$\begin{displaymath} {\rm CPU~time} = {\rm instruction~count} \times {\rm avg~CPI} \times {\rm cycle~time} \end{displaymath}$

Influences on:
1. Instruction count: compiler, architecture.
  Static vs. dynamic counts.
2. Cycle time: architecture, technology, microarchitecture (pipelining).
3. CPI: cycle time, microarchitecture (pipelining, superscalar, renaming).
Complexity!!!

Examples:

Consider two different implementations, M1 and M2, of the same instruction set. There are four classes of instructions (A, B, C, and D) in the instruction set.
M1 has a clock rate of 500 MHz. The average number of cycles for each instruction class on M1 is a follows:

Class CPI

A 1

B 2

C 3

D 4

M2 has a clock rate of 750 MHz. The average number of cycles for each instruction class on M2 is a follows:

Class CPI

A 2

B 2

C 4

D 4

Assume that peak performance is defined as the fastest rate that a machine can execute an instruction sequence chosen to maximize that rate. What are the peak performances of M1 and M2 expressed as instructions per second?
If the number of instructions executed in a certain program is divided equally among the classes of instructions, how much faster is M2 than M1?
Assuming the previous CPI and instruction distribution values, at what clock rate would M1 have the same performance as M2?

How do you choose your test programs (benchmarks)?
1. Workload -- programs used day-in and day-out?
  Too cumbersome. Everyone's workload differs.
2. Benchmarks -- representative programs. Should be real, substantial applications.
3. Not synthetic kernels, ripe for one-shot compiler optimizations.
SPEC:
1. Set of scientific benchmarks (compiler, go, compress, jpeg, plasma physics, quantum chemistry, etc.).
2. '89 and '95.
3. Int and fp.
Example of benchmark abuse: Matrix 300 (SPEC '89) on an IBM Powerstation 550:

$\begin{figure}\centering\includegraphics[width=5in]{Figures/f0203.eps}\end{figure}$

99% of execution time is in a single line of code! Designed to test memory system. Compiler performed a one-shot optimization to eliminate cache misses.

Thomas P. Kelliher 2011-09-01