Caches and Memory Hierarchies

Tyler Bletsch
Duke University

Slides are derived from work by
Daniel J. Sorin (Duke), Amir Roth (Penn), and Alvin Lebeck (Duke)
Where We Are in This Course Right Now

- So far:
  - We know how to design a processor that can fetch, decode, and execute the instructions in an ISA
  - We have assumed that memory storage (for instructions and data) is a magic black box

- Now:
  - We learn why memory storage systems are hierarchical
  - We learn about caches and SRAM technology for caches

- Next:
  - We learn how to implement main memory
Readings

• Patterson and Hennessy
  • Chapter 5
This Unit: Caches and Memory Hierarchies

- Memory hierarchy
  - Basic concepts
- Cache organization
- Cache implementation
Why Isn’t This Sufficient?

processor core (CPU)

- instruction fetch requests;
- load requests;
- stores

MEMORY

- $2^N$ bytes of storage, where $N=32$ or $64$ (if 32-bit or 64-bit ISA)

- fetched instructions;
- loaded data

$2^{32}$ bytes per running program = ridiculous
$2^{64}$ bytes per running program = ridiculous++
→ will defer this problem for a week or so

Let’s assume we can buy all that memory.
So now what’s the problem?

Access latency of memory is proportional to its size. Accessing 4GB of memory would take hundreds of cycles → way too long.
An Analogy: Duke’s Library System

- Student keeps small subset of Duke library books on bookshelf at home
  - Books she’s actively reading/using
  - Small subset of all books owned by Duke
  - Fast access time
- If book not on her shelf, she goes to Perkins
  - Much larger subset of all books owned by Duke
  - Takes longer to get books from Perkins
- If book not at Perkins, must get from off-site storage
  - Guaranteed (in my analogy) to get book at this point
  - Takes much longer to get books from here
An Analogy: Duke’s Library System

- **CPU keeps small subset of memory in its level-1 (L1) cache**
  - Data it’s actively reading/using
  - Small subset of all data in memory
  - Fast access time

- **If data not in CPU’s cache, CPU goes to level-2 (L2) cache**
  - Much larger subset of all data in memory
  - Takes longer to get data from L2 cache

- **If data not in L2 cache, must get from main memory**
  - Guaranteed to get data at this point
  - Takes much longer to get data from here
Big Concept: Memory Hierarchy

- Use hierarchy of memory components
  - Upper components (closer to CPU)
    - Fast ↔ Small ↔ Expensive
  - Lower components (further from CPU)
    - Slow ↔ Big ↔ Cheap
  - Bottom component (for now!) = what we have been calling “memory” until now

- Make average access time close to L1’s
  - How?
    - Most frequently accessed data in L1
    - L1 + next most frequently accessed in L2, etc.
  - **Automatically** move data up&down hierarchy
Some Terminology

• If we access a level of memory and find what we want → called a hit

• If we access a level of memory and do NOT find what we want → called a miss
Some Goals

• Key 1: High “hit rate” → high probability of finding what we want at a given level

• Key 2: Low access latency

• Misses are expensive (take a long time)
  • Try to avoid them
  • But, if they happen, amortize their costs → bring in more than just the specific word you want → bring in a whole block of data (multiple words)
Blocks

- Block = a group of *spatially contiguous and aligned* bytes
  - Typical sizes are 32B, 64B, 128B
- Spatially contiguous and aligned
  - Example: 32B blocks
  - Blocks = [address 0- address 31], [32-63], [64-96], etc.
  - NOT:
    - [13-44] = unaligned
    - [0-22, 26-34] = not contiguous
    - [0-20] = wrong size (not 32B)
Why Hierarchy Works For Duke Books

- **Temporal locality**
  - Recently accessed book likely to be accessed again soon

- **Spatial locality**
  - Books near recently accessed book likely to be accessed soon (assuming spatially nearby books are on same topic)
Why Hierarchy Works for Memory

- **Temporal locality**
  - Recently executed instructions likely to be executed again soon
    - Loops
  - Recently referenced data likely to be referenced again soon
    - Data in loops, hot global data

- **Spatial locality**
  - Insns near recently executed insns likely to be executed soon
    - Sequential execution
  - Data near recently referenced data likely to be referenced soon
    - Elements in array, fields in struct, variables in stack frame

- **Locality** is one of the most important concepts in computer architecture → don’t forget it!
Hierarchy Leverages Non-Uniform Patterns

- **10/90 rule (of thumb)**
  - For Instruction Memory:
    - 10% of static insns account for 90% of executed insns
    - Inner loops
  - For Data Memory:
    - 10% of variables account for 90% of accesses
    - Frequently used globals, inner loop stack variables

- What if processor accessed every block with equal likelihood? Small caches wouldn’t help much.
Memory Hierarchy: All About Performance

\[ t_{\text{avg}} = t_{\text{hit}} + \%_{\text{miss}} \times t_{\text{miss}} \]

- \( t_{\text{avg}} \) = average time to satisfy request at given level of hierarchy
- \( t_{\text{hit}} \) = time to hit (or discover miss) at given level
- \( t_{\text{miss}} \) = time to satisfy miss at given level
- Problem: hard to get low \( t_{\text{hit}} \) and \( \%_{\text{miss}} \) in one structure
  - Large structures have low \( \%_{\text{miss}} \) but high \( t_{\text{hit}} \)
  - Small structures have low \( t_{\text{hit}} \) but high \( \%_{\text{miss}} \)
- Solution: use a hierarchy of memory structures

“Ideally, one would desire an infinitely large memory capacity such that any particular word would be immediately available ... We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has a greater capacity than the preceding but which is less quickly accessible.”

Burks, Goldstine, and Von Neumann, 1946
Memory Performance Equation

- For memory component M
  - **Access**: read or write to M
  - **Hit**: desired data found in M
  - **Miss**: desired data not found in M
    - Must get from another (slower) component
  - **Fill**: action of placing data in M

- $$\%_{\text{miss}}$$ (miss-rate): #misses / #accesses
- $$t_{\text{hit}}$$: time to read data from (write data to) M
- $$t_{\text{miss}}$$: time to read data into M from lower level

- **Performance metric**
  - $$t_{\text{avg}}$$: average access time
    - $$t_{\text{avg}} = t_{\text{hit}} + (\%_{\text{miss}} \times t_{\text{miss}})$$
Abstract Hierarchy Performance

How do we compute $t_{avg}$?

$= t_{avg-M1}$

$= t_{hit-M1} + (\%miss_{M1} * t_{miss-M1})$

$= t_{hit-M1} + (\%miss_{M1} * t_{avg-M2})$

$= t_{hit-M1} + (\%miss_{M1} * (t_{hit-M2} + (\%miss_{M2} * t_{miss-M2})))$

$= t_{hit-M1} + (\%miss_{M1} * (t_{hit-M2} + (\%miss_{M2} * t_{avg-M3})))$

$= \ldots$

Note: Miss at level X = access at level X+1
Typical Memory Hierarchy

- **1st level:** L1 I$, L1 D$ (L1 insn/data caches)
- **2nd level:** L2 cache (L2$)
  - Also on same chip with CPU
  - Made of SRAM (same circuit type as CPU)
  - Managed in hardware
  - This unit of ECE/CS 250
- **3rd level:** main memory
  - Made of DRAM
  - Managed in software
  - Next unit of ECE/CS 250
- **4th level:** disk (swap space)
  - Made of magnetic iron oxide discs
  - Managed in software
  - Course unit after main memory
- Could be other levels (e.g., Flash, PCM, tape, etc.)

Note: many processors have L3$ between L2$ and memory
• Much of today’s chips used for caches → important!
A Typical Die Photo

Intel Pentium4 Prescott chip with 2MB L2$

L2 Cache
A Closer Look at that Die Photo

Intel Pentium chip with 2x16kB split L1$
A Multicore Die Photo from IBM

IBM’s Xenon chip with 3 PowerPC cores
This Unit: Caches and Memory Hierarchies

- Memory hierarchy
- Cache organization
- Cache implementation
Cache Structure

- Cache consists of **frames**, and each frame is the storage to hold one block of data
  - Also holds a “**valid**” bit and a “**tag**” to label the block in that frame
- **Valid**: if 1, frame holds valid data; if 0, data is invalid
  - Useful? Yes. Example: when you turn on computer, cache is full of invalid “data”
- **Tag**: specifies which block is living in this frame
  - Useful? Yes. Far fewer frames than blocks of memory!

<table>
<thead>
<tr>
<th>valid</th>
<th>tag</th>
<th>block data</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>[64-95]</td>
<td>32 bytes of valid data</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
<tr>
<td>1</td>
<td>[0-31]</td>
<td>32 bytes of valid data</td>
</tr>
<tr>
<td>1</td>
<td>[1024-1055]</td>
<td>32 bytes of valid data</td>
</tr>
</tbody>
</table>
Cache Example (very simplified for now)

<table>
<thead>
<tr>
<th>valid</th>
<th>tag</th>
<th>block data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
</tbody>
</table>

- When computer turned on, no valid data in cache (everything is zero, including valid bits)
Cache Example (very simplified for now)

- Assume CPU asks for word at byte addresses [32-35]
  - Either due to a load or an instruction fetch
- Word [32-35] is part of block [32-63]
- Miss! (no blocks in cache yet)
- Fill cache (from lower level) with block [32-63]
  - don’t forget to set valid bit and write tag

<table>
<thead>
<tr>
<th>valid</th>
<th>tag</th>
<th>block data</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>[32-63]</td>
<td>32 bytes of valid data</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
</tbody>
</table>
Cache Example (very simplified for now)

<table>
<thead>
<tr>
<th>valid</th>
<th>tag</th>
<th>block data</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>[32-63]</td>
<td>32 bytes of valid data</td>
</tr>
<tr>
<td>1</td>
<td>[1024-1055]</td>
<td>32 bytes of valid data</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
</tbody>
</table>

• Assume CPU asks for word [1028-1031]
  • Either due to a load or an instruction fetch
• Word [1028-1031] is part of block [1024-1055]
• Miss!
• Fill cache (from lower level) with block [1024-1055]
Cache Example (very simplified for now)

<table>
<thead>
<tr>
<th>valid</th>
<th>tag</th>
<th>block data</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>[32-63]</td>
<td>32 bytes of valid data</td>
</tr>
<tr>
<td>1</td>
<td>[1024-1055]</td>
<td>32 bytes of valid data</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
<tr>
<td>0</td>
<td>[0-31]</td>
<td>32 bytes of junk</td>
</tr>
</tbody>
</table>

- Assume CPU asks (again!) for word [1028-1031]
  - Hit! Hooray for **temporal locality**

- Assume CPU asks for word [1032-1035]
  - Hit! Hooray for **spatial locality**

- Assume CPU asks for word [0-3]
  - Miss! Don’t forget those valid bits.
Where to Put Blocks in Cache

• How to decide which frame holds which block?
  • And then how to find block we’re looking for?

• Some more cache structure:
  • Divide cache into sets
    • A block can only go in its set → there is a 1-to-1 mapping from block address to set
  • Each set holds some number of frames = set associativity
    • E.g., 4 frames per set = 4-way set-associative

• At extremes
  • Whole cache has just one set = fully associative
    • Most flexible (longest access latency)
  • Each set has 1 frame = 1-way set-associative = “direct mapped”
    • Least flexible (shortest access latency)
Direct-Mapped (1-way) Cache

- Assume 8B blocks
- 8 sets, 1 way/set → 8 frames
- Each block can only be put into 1 set (1 option)
  - Block [0-7] → set 0
  - Block [8-15] → set 1
  - Block [16-23] → set 2
  
  ...  
  - Block [56-63] → set 7
  - Block [64-71] → set 0
  - Block [72-79] → set 1

- Block [X-(X+7)] → set (X/8)%8
  - 1st 8=8B block, 2nd 8 = 8 sets
Direct-Mapped (1-way) Cache

• Assume 8B blocks
• Consider the following stream of 1-byte requests from the CPU:
  • 2, 11, 5, 50, 67, 51, 3
• Which hit? Which miss?

<table>
<thead>
<tr>
<th>Location</th>
<th>Set</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>11</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>0</td>
</tr>
<tr>
<td>50</td>
<td>6</td>
</tr>
<tr>
<td>67</td>
<td>0</td>
</tr>
<tr>
<td>51</td>
<td>6</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>way 0</th>
<th>valid</th>
<th>tag</th>
<th>data</th>
</tr>
</thead>
<tbody>
<tr>
<td>set 0</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 5</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 6</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 7</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Problem with Direct Mapped Caches

- Assume 8B blocks
- Consider the following stream of 1-byte requests from the CPU:
  - 2, 67, 2, 67, 2, 67, 2, 67, ...
- Which hit? Which miss?
- Did we make good use of all of our cache capacity?

<table>
<thead>
<tr>
<th>Location</th>
<th>Set</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>67</td>
<td>0</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>valid</th>
<th>tag</th>
<th>data</th>
</tr>
</thead>
<tbody>
<tr>
<td>set 0</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 5</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 6</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 7</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
## 2-Way Set-Associativity

<table>
<thead>
<tr>
<th></th>
<th>way 0</th>
<th></th>
<th>way 1</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>valid</td>
<td>tag</td>
<td>data</td>
<td>valid</td>
</tr>
<tr>
<td>set 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 1</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>set 3</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- 4 sets, 2 ways/set → 8 frames (just like our 1-way cache)
  - Block [0-7] → set 0
  - Block [8-15] → set 1
  - Block [16-23] → set 2
  - Block [24-31] → set 3
  - Block [32-39] → set 0
  - Etc.
• Assume the same pathological stream of CPU requests:
  • Byte addresses 2, 67, 2, 67, 2, 67, etc.
  • Which hit? Which miss?

• Now how about this: 2, 67, 131, 2, 67, 131, etc.

• How much more associativity can we have?
### Full Associativity

<table>
<thead>
<tr>
<th></th>
<th>way 0</th>
<th>way 1</th>
<th>way 2</th>
<th>way 3</th>
<th>way 4</th>
<th>way 5</th>
<th>way 6</th>
<th>way 7</th>
</tr>
</thead>
<tbody>
<tr>
<td>v t d</td>
<td>v t d</td>
<td>v t d</td>
<td>v t d</td>
<td>v t d</td>
<td>v t d</td>
<td>v t d</td>
<td>v t d</td>
<td>v t d</td>
</tr>
</tbody>
</table>

- 1 set, 8 ways/set → 8 frames (just like previous examples)
  - Block [0-7] → set 0
  - Block [8-15] → set 0
  - Block [16-23] → set 0
  - Etc.
Mapping Addresses to Sets

- MIPS has 32-bit addresses
  - Let’s break down address into three components
- If blocks are 8B, then \( \log_2 8 = 3 \) bits required to identify a byte within a block. These bits are called block offset.
  - Given block, offset tells you which byte within block
- If there are \( S \) sets, then \( \log_2 S \) bits required to indentify the set. These bits are called set index or just index.
- Rest of the bits (32-3- \( \log_2 S \)) specify the tag

| tag | index | block offset |
Mapping Addresses to Sets

- How many blocks map to the same set?
- Let’s assume 8-byte blocks
  - $8 = 2^3 \rightarrow$ 3 bits to specify block offset
- Let’s assume we have direct-mapped cache with 256 sets
  - 256 sets = $2^8$ sets $\rightarrow$ 8 bits to specify set index
- $2^{32}$ bytes of memory/(8 bytes/block) = $2^{29}$ blocks
- $2^{29}$ blocks / 256 sets = $2^{21}$ blocks / set
- So that means we need $2^{21}$ tags to distinguish between all possible blocks in the set $\rightarrow$ 21 tag bits
  - Note: $21 = 32 - 3 - 8$ 😊

| tag (21) | index (8) | block offset (3) |
Mapping Addresses to Sets

- Assume cache from previous slide (8B blocks, 256 sets)
- Example: What do we do with the address 10?

```
0000 0000 0000 0000 0000 0000 0000 0000 0000 1010
```
- offset = 2 (2\textsuperscript{nd} byte in block)
- index=1 (set 1)
- tag = 0

- This matches what we did before – recall:
  - Block [0-7] → set 0
  - Block [8-15] → set 1
  - Block [16-23] → set 2
  - etc.
Cache Replacement Policies

- Set-associative caches present a new design choice
  - On cache miss, which block in set to replace (kick out)?
- Some options
  - **Random**
  - **LRU (least recently used)**
    - Fits with temporal locality, LRU = least likely to be used in future
  - **NMRU (not most recently used)**
    - An easier-to-implement approximation of LRU
    - NMRU=LRU for 2-way set-associative caches
  - **FIFO (first-in first-out)**
    - When is this a good idea?
ABCs of Cache Design

• Architects control three primary aspects of cache design
  • And can choose for each cache independently
• A = Associativity
• B = Block size
• C = Capacity of cache

• Secondary aspects of cache design
  • Replacement algorithm
  • Some other more subtle issues we’ll discuss later
Analyzing Cache Misses: 3C Model

- Divide cache misses into three categories
  - **Compulsory (cold)**: never seen this address before
    - Easy to identify
  - **Capacity**: miss caused because cache is too small – would’ve been miss even if cache had been fully associative
    - Consecutive accesses to block separated by accesses to at least N other distinct blocks where N is number of frames in cache
  - **Conflict**: miss caused because cache associativity is too low – would’ve been hit if cache had been fully associative
    - All other misses
3C Example

- Assume 8B blocks
- Consider the following stream of 1-byte requests from the CPU:
  - 2, 11, 5, 50, 67, 128, 256, 512, 1024, 2
- Is the last access a capacity miss or a conflict miss?

<table>
<thead>
<tr>
<th>Location</th>
<th>Set</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>11</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>0</td>
</tr>
<tr>
<td>50</td>
<td>6</td>
</tr>
<tr>
<td>67</td>
<td>0</td>
</tr>
<tr>
<td>128</td>
<td>0</td>
</tr>
<tr>
<td>256</td>
<td>0</td>
</tr>
<tr>
<td>512</td>
<td>0</td>
</tr>
<tr>
<td>1024</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
</tr>
</tbody>
</table>
ABCs of Cache Design and 3C Model

- **Associativity** (increase, all else equal)
  - + Decreases conflict misses
  - − Increases $t_{\text{hit}}$

- **Block size** (increase, all else equal)
  - − Increases conflict misses
  - + Decreases compulsory misses
  - ± Increases or decreases capacity misses
  - • Negligible effect on $t_{\text{hit}}$

- **Capacity** (increase, all else equal)
  - + Decreases capacity misses
  - − Increases $t_{\text{hit}}$
Inclusion/Exclusion

- If L2 holds superset of every block in L1, then L2 is **inclusive** with respect to L1.
- If L2 holds no block that is in L1, then L2 and L1 are **exclusive**.
- L2 could be neither inclusive nor exclusive.
  - Has some blocks in L1 but not all.
- This issue matters a lot for multicores, but not a major issue in this class.
- Same issue for L3/L2.
Stores: Write-Through vs. Write-Back

• When to propagate new value to (lower level) memory?
  • **Write-through**: immediately (as soon as store writes to this level)
    + Conceptually simpler
    + Uniform latency on misses
    – Requires additional bandwidth to next level
  • **Write-back**: later, when block is replaced from this level
    • Requires additional “dirty” bit per block → why?
    + Minimal bandwidth to next level
      • Only write back dirty blocks
    – Non-uniform miss latency
      • Miss that evicts clean block: just a fill from lower level
      • Miss that evicts dirty block: writeback dirty block and then fill from lower level
Stores: Write-allocate vs. Write-non-allocate

• What to do on a write miss?
  • **Write-allocate**: read block from lower level, write value into it
    + Decreases read misses
    – Requires additional bandwidth
  • Use with write-back
  • **Write-non-allocate**: just write to next level
    – Potentially more read misses
    + Uses less bandwidth
  • Use with write-through
Optimization: Write Buffer

- **Write buffer**: between cache and memory
  - Write-through cache? Helps with store misses
    + Write to buffer to avoid waiting for next level
  - Store misses become store hits
  - Write-back cache? Helps with dirty misses
    + Allows you to do read (important part) first
      1. Write dirty block to buffer
      2. Read new block from next level to cache
      3. Write buffer contents to next level
Typical Processor Cache Hierarchy

• First level caches: optimized for $t_{\text{hit}}$ and parallel access
  • Insns and data in separate caches (I$, D$) → why?
  • Capacity: 8–64KB, block size: 16–64B, associativity: 1–4
  • Other: write-through or write-back
  • $t_{\text{hit}}$: 1–4 cycles

• Second level cache (L2): optimized for $\%_{\text{miss}}$
  • Insns and data in one cache for better utilization
  • Capacity: 128KB–1MB, block size: 64–256B, associativity: 4–16
  • Other: write-back
  • $t_{\text{hit}}$: 10–20 cycles

• Third level caches (L3): also optimized for $\%_{\text{miss}}$
  • Capacity: 2–16MB
  • $t_{\text{hit}}$: ~30 cycles
Performance Calculation Example

- **Parameters**
  - Reference stream: 20% stores, 80% loads
  - L1 D$: \( t_{hit} = 1\text{ns}, \%_{miss} = 5\%, \) write-through + write-buffer
  - L2: \( t_{hit} = 10\text{ns}, \%_{miss} = 20\%, \) write-back, 50% dirty blocks
  - Main memory: \( t_{hit} = 50\text{ns}, \%_{miss} = 0\% \)

- **What is** \( t_{avgL1D$} \) **without an L2?**
  - Write-through+write-buffer means all stores effectively hit
  - \( t_{missL1D$} = t_{hitM} \)
  - \( t_{avgL1D$} = t_{hitL1D$} + \%_{loads} \times \%_{missL1D$} \times t_{hitM} = 1\text{ns} + (0.8 \times 0.05 \times 50\text{ns}) = 3\text{ns} \)

- **What is** \( t_{avgD$} \) **with an L2?**
  - \( t_{missL1D$} = t_{avgL2} \)
  - Write-back (no buffer) means dirty misses cost double
  - \( t_{avgL2} = t_{hitL2} + (1 + \%_{dirty}) \times \%_{missL2} \times t_{hitM} = 10\text{ns} + (1.5 \times 0.2 \times 50\text{ns}) = 25\text{ns} \)
  - \( t_{avgL1D$} = t_{hitL1D$} + \%_{loads} \times \%_{missL1D$} \times t_{avgL2} = 1\text{ns} + (0.8 \times 0.05 \times 25\text{ns}) = 2\text{ns} \)
Cost of Tags

• “4KB cache” means cache holds 4KB of data
  • Called **capacity**
  • Tag storage is considered overhead (not included in capacity)
• Calculate tag overhead of 4KB cache with 1024 4B frames
  • Not including valid bits
  • 4B frames → 2-bit offset
  • 1024 frames → 10-bit index
  • 32-bit address – 2-bit offset – 10-bit index = 20-bit tag
  • 20-bit tag * 1024 frames = 20Kb tags = 2.5KB tags
  • 63% overhead → much higher than usual because blocks are so small (and cache is small)
Two (of many possible) Optimizations

- **Victim buffer**: for conflict misses
- **Prefetching**: for capacity/compulsory misses
Victim Buffer

- Conflict misses: not enough associativity
  - High-associativity is expensive, but also rarely needed
    - 3 blocks mapping to same 2-way set and accessed (ABC)*

- **Victim buffer (VB):** small FA cache (e.g., 4 entries)
  - Sits on I$/D$ fill path
  - VB is small $\rightarrow$ very fast
  - Blocks kicked out of I$/D$ placed in VB
  - On miss, check VB: hit? Place block back in I$/D$
  - 4 extra ways, shared among all sets
    - Only a few sets will need it at any given time
    - Very effective in practice
Prefetching

- **Prefetching**: put blocks in cache proactively/speculatively
  - Key: anticipate upcoming miss addresses accurately
    - Can do in software or hardware

- Simple example: **next block prefetching**
  - Miss on address $X \rightarrow$ anticipate miss on $X + $block-size$
  - Works for insns: sequential execution
  - Works for data: arrays

- **Timeliness**: initiate prefetches sufficiently in advance
- **Accuracy**: don’t evict useful data
What this means to the programmer

• If you’re writing code, you want good performance.

• The cache is crucial to getting good performance.

• The effect of the cache is influenced by the order of memory accesses.

CONCLUSION:
The programmer can change the order of memory accesses to improve performance!
Cache performance matters!

- A **HUGE** component of software performance is how it interacts with cache

Example:

Which will have fewer cache misses?

```c
for (k = 0; k < 100; k++)
    for (j = 0; j < 100; j++)
        for (i = 0; i < 5000; i++)
            x[i][j] = 2 * x[i][j];
```

```c
for (k = 0; k < 100; k++)
    for (i = 0; i < 5000; i++)
        for (j = 0; j < 100; j++)
            x[i][j] = 2 * x[i][j];
```

Adapted from Lebeck and Porter (creative commons)
Blocking (Tiling) Example

```c
/* Before */

for(i = 0; i < SIZE; i++)
    for (j = 0; j < SIZE; j++)
        for (k = 0; k < SIZE; k++)
            c[i][j] = c[i][j] + a[i][k]*b[k][j];
```

- **Two Inner Loops:**
  - Read all NxN elements of z[ ] (N = SIZE)
  - Read N elements of 1 row of y[ ] repeatedly
  - Write N elements of 1 row of x[ ]

- **Capacity Misses a function of N & Cache Size:**
  - 3 NxN => no capacity misses; otherwise ...

- **Idea:** compute on BxB submatrix that fits

Adapted from Lebeck and Porter (creative commons)
/* After */
for(ii = 0; ii < SIZE; ii += B)
    for (jj = 0; jj < SIZE; jj += B)
        for (kk = 0; kk < SIZE; kk += B)
            for(i = ii; i < MIN(ii+B-1,SIZE); i++)
                for (j = jj; j < MIN(jj+B-1,SIZE); j++)
                    for (k = kk; k < MIN(kk+B-1,SIZE); k++)
                        c[i][j] = c[i][j] + a[i][k]*b[k][j];

• Capacity Misses decrease
  \[2N^3 + N^2\] to \[2N^3/B + N^2\]
• B called *Blocking Factor (Also called Tile Size)*

Adapted from Lebeck and Porter (creative commons)
Hilbert curves: A fancy trick for matrix locality

• Turn a 1D value into an n-dimensional “walk” of a cube space (like a 2D or 3D matrix) in a manner that maximizes locality

• Extra overhead to compute curve path, but computation takes no memory, and cache misses are very expensive, so it may be worth it

• (Actual algorithm for these curves is simple and easy to find)
This Unit: Caches and Memory Hierarchies

- Memory hierarchy
- Cache organization
- Cache implementation
How to Build Large Storage Components?

• Functionally, we could implement large storage as a vast number of D flip-flops

• But for big storage, our goal is density (bits/area)
  • And FFs are big: ~32 transistors per bit

• It turns out we can get much better density
  • And this is what we do for caches (and for register files)
Static Random Access Memory (SRAM)

- Reality: large storage arrays implemented in “analog” way
  - Bits as cross-coupled inverters, not flip-flops
    - Inverters: 2 gates = 4 transistors per bit
    - Flip-flops: 8 gates ≈ 32 transistors per bit
- Ports implemented as shared buses called bitlines (next slide)
- Called **SRAM (static random access memory)**
  - “Static” → a written bit maintains its value (doesn’t leak out)
  - But still volatile → bit loses value if chip loses power
- Example: storage array with two 2-bit words

![Diagram of SRAM](image)

- Bit 0
- Bit 1
- Word 0
- Word 1
• To write (a 1):
  1. Drive bit lines (bit=1, \overline{bit}=0)
  2. Select row

• To read:
  1. Pre-charge bit and \overline{bit} to Vdd (set to 1)
  2. Select row
  3. Cell pulls one line lower (pulls towards 0)
  4. Sense amp on column detects difference between bit and \overline{bit}
Typical SRAM Organization: 16-word x 4-bit

- SRAM Cell
- Sense Amp+
- Dout 0 to 3
- Wr Driver & Precharger+
- WrEn
- Address Decoder
- Din 0 to 3
- A0 to A3
• Write Enable is usually active low (WE_L)
• Din and Dout are combined (D) to save pins:
  • A new control signal, output enable (OE_L) is needed
  • WE_L is asserted (Low), OE_L is de-asserted (High)
    • D serves as the data input pin
  • WE_L is de-asserted (High), OE_L is asserted (Low)
    • D is now the data output pin
• Both WE_L and OE_L are asserted:
  • Result is unknown. Don’t do that!!!
SRAM Executive Summary

- Large storage arrays cannot be implemented “digitally”
  - Muxing and wire routing become impractical
- SRAM implementation exploits analog transistor properties
  - Inverter pair bits much smaller than flip-flop bits
  - Wordline/bitline arrangement makes for simple “grid-like” routing
  - Basic understanding of reading and writing
    - Wordlines select words
    - Overwhelm inverter-pair to write
    - Drain pre-charged line or swing voltage to read
  - Access latency proportional to $\sqrt{\#\text{bits} \times \#\text{ports}}$

- You must understand important properties of SRAM
  - Will help when we talk about DRAM (next unit)
Basic Cache Structure

- Basic cache: array of block frames
  - Example: 4KB cache made up of 1024 4B frames
- To find frame: decode part of address
  - Which part?
  - 32-bit address
  - 4B blocks → 2 LS bits locate byte within block
    - These are called offset bits
  - 1024 frames → next 10 bits find frame
    - These are the index bits
  - Note: nothing says index must be these bits
  - But these work best (think about why)
Basic Cache Structure

- Each frame can hold one of \(2^{20}\) blocks
  - All blocks with same index bit pattern
- How to know which if any is currently there?
  - To each frame attach **tag** and **valid bit**
  - Compare frame tag to address **tag bits**
    - No need to match index bits (why?)
- Lookup algorithm
  - Read frame indicated by index bits
  - If (tag matches && valid bit set)
    then Hit → data is good
    Else Miss → data is no good, wait
• **Set-associativity**
  - Block can reside in one of few frames
  - Frame groups called **sets**
  - Each frame in set called a **way**
  - This is **2-way set-associative (SA)**
  - 1-way → **direct-mapped (DM)**
  - 1-set → **fully-associative (FA)**

+ Reduces conflicts
  - Increases $t_{hit}$: additional mux
Set-Associativity

- Lookup algorithm
  - Use index bits to find set
  - Read data/tags in all frames in parallel
  - **Any** (match && valid bit)?
    - Then Hit
    - Else Miss
  - Notice tag/index/offset bits

```
CPU address == hit/miss
```

```
[31:11] [10:2] 1:0
```

```
0 512
1 513
2 514
510 1022
511 1023
```
NMRU and Miss Handling

- Add MRU field to each set
  - MRU data is encoded “way”
  - Hit? update MRU
  - Fill? write enable ~MRU
• How to implement full (or at least high) associativity?
  • Doing it this way is terribly inefficient
  • 1K matches are unavoidable, but 1K data reads + 1K-to-1 mux?
Normal RAM vs Content Addressable Memory

**RAM**
- Cell number 5, what are you storing?

**CAM**
- Attention all cells, will the owner of data “5” please stand up?
**Full-Associativity with CAMs**

- **CAM**: content addressable memory
  - Array of words with built-in comparators
  - Matchlines instead of bitlines
  - Output is “one-hot” encoding of match

- **FA cache?**
  - Tags as CAM
  - Data as RAM
CAM Circuit

- **Matchlines** (correspond to bitlines in SRAM): inputs
- **Wordlines**: outputs

- **Two phase match**
  - Phase I: clk=1, pre-charge wordlines to 1
  - Phase II: clk=0, enable matchlines, non-matched bits dis-charge wordlines
CAM Circuit In Action

- Phase I: clk=1
  - Pre-charge wordlines to 1
CAM Circuit In Action

• Phase I: clk=0
  • Enable matchlines (notice, match bits are flipped)
  • Any non-matching bit discharges entire wordline
    • Implicitly ANDs all bit matches (NORs all bit non-matches)
  • Similar technique for doing a fast OR for hit detection
CAM Upshot

• CAMs are effective but expensive
  – Matchlines are very expensive (for nasty circuit-level reasons)
• CAMs are used but only for 16 or 32 way (max) associativity
  • See an example soon
• Not for 1024-way associativity
  – No good way of doing something like that
  + No real need for it either
Stores: Tag/Data Access

- **Reads**: read tag and data in parallel
  - Tag mis-match → data is garbage (OK)

- **Writes**: read tag, write data in parallel?
  - Tag mis-match → clobbered data (oops)
  - For SA cache, which way is written?

- **Writes are a pipelined 2 cycle process**
  - Cycle 1: match tag
  - Cycle 2: write to matching way
Stores: Tag/Data Access

- Cycle 1: check tag
  - Hit? Write data next cycle
  - Miss? Depends (write-alloc or write-no-alloc)
• Cycle 2 (if hit): write data
This Unit: Caches and Memory Hierarchies

- Memory hierarchy
- Cache organization
- Cache implementation