Performance, Microprogramming

Tom Kelliher, CS26

Oct. 23, 1996

Don't cares on truth tables.

Performance

Simplified performance of a particular program:

Poor performance of single-bus CPU implementation:

Serialization of register transfers.
Multiple clocks per instruction.
Very little parallelism.
Poor resource utilization.

Poor relative to what?

One instruction per clock execution. (First achieved by RISC architectures.)
Multiple instruction per clock execution --- superscalar execution.

How achieved?

Improving the Single-Bus CPU

Multiple buses!!!

This organization allows an actual addition to be done in one clock.

How are the other sub-cycles affected?

Simultaneous PC increment and MAR load.
Direct, indirect, indexed, immediate addressing.
MDR transfers.

Instruction Prefetching: Rudimentary Pipelining

Instruction Unit.

At beginning of each cycle, CPU stalls waiting on instruction fetch.
Next fetch (maybe) can be overlapped with current execute.
Buffers a few instruction --- caches small loops.
Branch prediction.

Caching

Small, fast memory placed between CPU and main memory.

Discussed in detail later.

Organization of a Complete CPU

Superscalar (more execution units).
Why two L1 caches?

Consider a typical 3-stage RISC instruction cycle:

Consider pipelining it.

Microprogramming

First the what, then the why.

Introduction to Microprogramming

Data path (defn):
1. Register file.
2. ALU.
3. MDR, other ``data'' registers.
Control unit (defn):
1. IR, PC, MAR.
2. Instruction decoder/encoder.
Data path needs sequences of 0's and 1's on control inputs to execute instructions.
Control unit provides the sequence.
Can the control unit be replaced with a memory (control store) whose output is connected to the data path's control inputs?
Contents of the control store --- a program for each instruction. Microprogram. Microinstructions.
How do we sequence the control store? Required operations:
1. Straight line execution.
2. Unconditional branches.
3. Conditional branches.

A microsequencer:

Horizontal microcode.
Vertical microcode.
Limiting the size of the control store:
1. Commonalities between microroutines.
2. Utilize branching to ``factor out'' common code.
3. Micro-subroutines!!!
Prefetching microinstructions?!?

What a Long, Strange Trip It's Been

The Case for Microprogramming

Advantages off the bat:

Easier debugging.
Quicker to market.
Emulation.
Extending the instruction set.
Easier upgrades.

Disadvantages off the bat:

Slower than hard-wired.

1970s technology:

Main memory was core; control stores were solid state (10 times faster).
No caches.
8Kb ROM = 8 bit register, space-wise.

Implications:

Program speed was proportional to program size (bandwidth).
Control stores were ``cheap.''

Solution: Microprogramming and richer instruction sets

Simplify compiler construction.
Close the ``semantic gap.''
Improve architectural quality be decreasing program size and bandwidth.
Microinstructions were ``faster'' than regular instructions.
Register-based architectures were unwieldy; use stack-based or memory-memory.

1980s technology:

Main memory was now solid state.
Caches were common.
CMOS VLSI.
Control store ROMs were becoming RAMs (bugs).
Compilers were sub-setting architectures.

Some weird developments:

Writable control stores.
Virtual memory at the control store level.
Nanocode.

Two CPUs:

RISC design philosophy:

Functions should be kept simple unless there is very good reason to do otherwise.
Microinstructions should not be faster than simple instructions.
Microcode is not magic.
Simple decoding and pipelined execution are more important than program size.

RISC CPU traits:

Load/store; operations are register-register.
The operations and addressing modes are reduced.
Instruction formats are simple and do not cross word boundaries.
RISC branches avoid pipeline penalties.

Thomas P. Kelliher
Wed Oct 23 22:49:36 EDT 1996

Tom Kelliher