Caching and Performance of CPUs

Why Caches are Important:

  • Memory-Processor Speed Gap: Modern CPUs operate at extremely high clock speeds, often in the range of 3-4 GHz (billions of cycles per second). On the other hand, accessing main memory (RAM) is much slower, with latency typically around 50-100 ns.

  • Cache Basics: A cache is a small, faster memory located closer to the CPU. It holds copies of frequently accessed data from main memory, allowing the CPU to access it quickly. Modern systems use multi-level caches (L1, L2, L3) to strike a balance between speed, size, and cost.

Multi-level Cache System:

  • L1 Cache: This is the fastest and smallest cache, located directly on the CPU. It has a very low latency, typically around 1-4 cycles (which is less than 1 ns). It's usually split into instruction and data caches. The size ranges from 32 KB to 128 KB.

  • L2 Cache: Slightly larger and slower than L1, typically around 256 KB to 2 MB, with a latency of 7-14 cycles (~3-5 ns).

  • L3 Cache: Shared across all cores in a multi-core processor, L3 caches are larger (4 MB to 64 MB) but slower, with latencies of 20-40 cycles (~10-20 ns). However, it's still much faster than RAM.

  • RAM: Accessing RAM takes 50-100 ns, which can be more than 100 CPU cycles.

Reducing Wait Cycle Stalls in Pipelines:

In pipelined CPUs, stalls occur when the CPU needs to wait for data from memory. A single stage in the pipeline might complete in 1 cycle, but if the next instruction relies on data that is not yet available (e.g., from a previous load instruction), the pipeline stalls.

  • Caches Mitigate Stalls: Caches dramatically reduce the time to access memory, reducing the number of wait cycles introduced by memory stalls.

    • When data is found in L1 (a cache hit), the CPU can continue without stalling because it only takes a few cycles to retrieve the data.

    • If the data is in L2 or L3, the wait is longer, but still much faster than accessing RAM.

  • Multi-level caches work together to minimize the impact of a cache miss. When the CPU misses in L1, it checks L2, and then L3, before finally going to RAM, progressively reducing the likelihood of a long stall.

Real Modern Numbers (Approximate):

  • CPU Speed: 3.5 GHz (1 cycle = ~0.29 ns)

  • L1 Cache Latency: 1-4 cycles (~0.3 to 1.2 ns)

  • L2 Cache Latency: 7-14 cycles (~2 to 5 ns)

  • L3 Cache Latency: 20-40 cycles (~10 to 20 ns)

  • RAM Latency: 50-100 ns (hundreds of cycles)

In modern CPUs, keeping data in caches as much as possible is crucial for maintaining high performance, especially in pipelined architectures with multiple execution stages (M stages). By reducing the need to access slower memory, caches help to significantly reduce pipeline stalls and improve overall efficiency.

Spatial and Temporal Locality and Caching

Spatial and temporal locality are key principles in understanding why caches work so well. Here's how they apply in typical C programming and how caching can take advantage of them to improve performance.

1. Temporal Locality:

Temporal locality refers to the idea that if a piece of data is accessed, it is likely to be accessed again soon.

Example in C:

int sum = 0;
int arr[1000]; // Large array

for (int i = 0; i < 1000; i++) {
    arr[i] = i; // First access (write)
}

for (int i = 0; i < 1000; i++) {
    sum += arr[i]; // Second access (read)
}
  • First access: We write to each element of the array arr[i] in sequence.

  • Second access: We later read each element of arr[i] again in the summation loop.

Because the array elements are accessed multiple times, caching can store these elements after the first access. Temporal locality means that when we access arr[i] for the summation, it's likely to still be in the cache, speeding up the read operation.

2. Spatial Locality:

Spatial locality refers to the likelihood that if one piece of data is accessed, other nearby data will be accessed soon.

Example in C:

int sum = 0;
int arr[1000]; // Large array

for (int i = 0; i < 1000; i++) {
    sum += arr[i]; // Accessing elements sequentially
}

Here, the array arr is accessed sequentially from arr[0] to arr[999]. When arr[0] is accessed, it's very likely that arr[1], arr[2], and so on will be accessed soon. Caches are designed with cache blocks or lines, where each block holds several contiguous memory locations (e.g., 64 bytes). When arr[0] is loaded into the cache, the cache might also load arr[1], arr[2], and other nearby elements. This is spatial locality—accessing nearby data soon after accessing a specific piece of data.

Cache Block Size:

Assume a cache block (or line) size of 64 bytes. On a system where each int is 4 bytes, a single cache block can hold 16 int elements. If arr[0] is accessed, the cache will load not just arr[0], but also arr[1] through arr[15]. So, when arr[1] is accessed, it will already be in the cache.

Performance Comparison: With and Without Cache

Without Cache:

  • When there’s no cache, each memory access for arr[i] requires a direct fetch from main memory (RAM), which takes around 50-100 ns.

  • Accessing 1000 elements results in around 1000 memory accesses, each taking 50-100 ns, resulting in total memory access time between 50,000 ns (50 µs) and 100,000 ns (100 µs).

With Cache:

  1. Temporal Locality: In the case of the first and second access to arr[i] (as in the summation example), the first access loads arr[i] into the cache (assuming an L1 hit for later accesses).

    • L1 Cache hit time: 1-4 cycles (~1 ns). The second access benefits from cache hits instead of going back to RAM.
  2. Spatial Locality: In the example where elements of arr are accessed sequentially, spatial locality ensures that after loading arr[0], the cache will already have arr[1] through arr[15], reducing the number of cache misses.

    • Accessing 1000 elements with a block size of 64 bytes results in 1000 / 16 = 62.5 (rounded up to 63) memory accesses, rather than 1000.

    • If the first access takes 50 ns (due to an L1 miss), the remaining accesses within the same cache block take just 1-4 cycles (1-2 ns). This drastically reduces the total memory access time.

Real-World Comparison:

  • Without Cache: 1000 memory accesses at 50-100 ns per access = 50,000-100,000 ns (or 50-100 µs).

  • With Cache:

    • For temporal locality, assuming 10% of the accesses are cache misses (due to cold start), we might have 100 cache misses and 900 cache hits.

    • Each miss takes 50 ns, and each hit takes 1 ns. The total access time = (100 × 50 ns) + (900 × 1 ns) = 5000 ns + 900 ns = 5900 ns (about 5.9 µs).

    • For spatial locality, since fewer cache misses occur due to sequential access, the effective memory access time reduces to around 6-7 µs.

Cache Mapping Policies and Write Strategies

Let's assume for this example that we have 32-bit addresses (though they could vary in size in other systems). The cache must map these addresses into cache blocks using various mapping policies, and different strategies are used to manage both reads and writes. Here's how we break down the 32-bit address and explore the different cache policies in detail.

Breakdown of a 32-bit Address

A 32-bit address can be divided into:

  • Offset: Identifies the exact byte within a cache block.

  • Index: Identifies which set or cache line the block is stored in.

  • Tag: Used to determine if the data in the cache is the data that was requested.

Cache Mapping Policies

1. Direct-Mapped Cache

  • In a direct-mapped cache, each memory block maps to exactly one cache line. It’s simple but can lead to more conflicts when multiple memory addresses map to the same cache line.

  • For example, if we have a cache with 1024 cache lines and each block is 64 bytes:

    • Offset: With a 64-byte block, 6 bits are needed for the byte offset (since (2^6 = 64)).

    • Index: With 1024 lines, we need 10 bits for the index (since (2^{10} = 1024)).

    • Tag: The remaining 16 bits (32 - 6 - 10) form the tag, which uniquely identifies if the data in the cache matches the requested address.

In this setup, each memory block can only be stored in one specific cache line.

2. Set-Associative Cache

  • A set-associative cache is more flexible than direct-mapped. The cache is divided into several sets, and each set has multiple ways (or slots) where blocks can be stored. This allows for multiple locations for each memory block, reducing conflicts.

  • For example, in a 4-way set-associative cache with 1024 lines:

    • If there are 256 sets (since (1024 / 4 = 256)), we need 8 bits for the index (since (2^8 = 256)).

    • The block size still requires 6 bits for the offset.

    • The remaining 18 bits are used for the tag.

The key advantage of set-associative caches is that blocks can be stored in any of the "ways" in a set, reducing the number of conflicts compared to direct-mapped caches.

3. Fully Associative Cache

  • In a fully associative cache, any block can be stored in any cache line. There’s no specific index, and every cache block must be checked for a match with the requested address's tag.

    • In this case, there’s no index field, and the tag field uses most of the address.

    • For a block size of 64 bytes, the offset is still 6 bits, and the remaining 26 bits are used for the tag.

Fully associative caches are the most flexible but are also the most expensive and complex because they require searching the entire cache for a matching tag.

  • Note: Direct-mapped is a special case of set-associative with 1-way (1 block per set), and fully associative is a special case with 1 set and all blocks in that set.

Replacement Policies for Eviction

When a cache is full and a new block must be loaded, the cache needs a policy to decide which block to evict. Common replacement policies include:

  1. Least Recently Used (LRU):

    • The block that hasn’t been accessed for the longest time is replaced. This policy exploits temporal locality, assuming that recently accessed data is more likely to be accessed again soon.
  2. First-In, First-Out (FIFO):

    • The oldest block in the cache (the one that was loaded first) is replaced, regardless of access patterns.
  3. Random Replacement:

    • A random block is evicted. This is sometimes used because it is simple to implement and avoids the overhead of tracking usage.
  4. Least Frequently Used (LFU):

    • The block that has been accessed the least number of times is evicted. This policy can be useful when data access patterns are skewed towards a small number of blocks.

When a CPU writes data to the cache, it must ensure consistency with the main memory. The strategies employed to maintain this consistency are known as write policies. The primary write policies are write-through and write-back, each with its own method of handling data synchronization between the cache and main memory.

Write-Through Policy

In a write-through policy, every write operation to the cache is immediately reflected in the main memory. This approach ensures that the main memory remains consistent with the cache at all times.

  • Data Consistency: Main memory is always up-to-date with the latest data from the cache.

  • Performance Impact: This policy can lead to increased memory traffic, as every write to the cache results in a corresponding write to the main memory, potentially slowing down the system due to frequent memory accesses.

Write-Back Policy

With a write-back policy, data written to the cache is not immediately updated in the main memory. Instead, the updated data remains in the cache and is written back to the main memory only when the corresponding cache block is evicted.

  • Data Consistency: The main memory may not always have the most recent data until the cache block is evicted and written back.

  • Performance Impact: This approach reduces memory traffic, as data is written to the main memory only upon eviction, allowing for faster write operations to the cache. However, it introduces complexity, as the system must track which cache blocks have been modified (these are known as "dirty" blocks).

Write Allocate vs. No-Write Allocate

These strategies determine how the system handles write operations that miss the cache.

  • Write Allocate: On a write miss, the system loads the corresponding block from the main memory into the cache and then performs the write operation. This approach is beneficial when the data is likely to be accessed again soon, as it brings the data closer to the processor.

  • No-Write Allocate (Write Around): On a write miss, the system writes the data directly to the main memory without loading the block into the cache. This method is efficient when the data is unlikely to be reused, as it avoids occupying cache space with seldom-used data.

Inclusive vs. Exclusive Cache Policies

In multi-level cache hierarchies, inclusion policies dictate whether data present in one cache level is also present in other levels.

  • Inclusive Policy: All data in the upper-level cache (e.g., L1) is also present in the lower-level cache (e.g., L2). This redundancy ensures that the lower-level cache contains all the data of the upper-level cache, simplifying data management and coherence. However, it may lead to inefficient use of cache space due to duplication.

  • Exclusive Policy: Data is present in only one level of the cache hierarchy at a time. If data is in the L1 cache, it is not in the L2 cache, and vice versa. This approach maximizes the effective cache capacity, as each level stores unique data. However, it introduces complexity in managing data movement between cache levels.

Cache Performance

In the context of cache performance, we aim to understand how the processor's execution time is affected by cache behavior, specifically the memory stall cycles due to cache misses. The performance impact of the cache is measured by incorporating both cache hits and cache misses into the CPU execution time. Below is a detailed explanation and breakdown of the relevant formulas.

The overall CPU execution time can be expressed as the sum of the CPU clock cycles and the memory stall cycles:

$$\text{CPU execution time} = (\text{CPU clock cycles} + \text{Memory stall cycles}) \times \text{Clock cycle time}$$

Here, CPU clock cycles refer to the number of cycles required by the processor for instructions that result in cache hits, while memory stall cycles are the additional cycles required due to cache misses, where the processor must wait for data to be retrieved from lower levels of the memory hierarchy (e.g., main memory or lower cache levels).

The number of memory stall cycles depends on the frequency of cache misses and the cost of each miss (the miss penalty). The general formula for memory stall cycles is:

$$\text{Memory stall cycles} = \text{Number of misses} \times \text{Miss penalty}$$

Given the instruction count (IC) and the number of memory accesses per instruction, the memory stall cycles can be broken down as:

$$\text{Memory stall cycles} = \text{IC} \times \left(\frac{\text{Misses}}{\text{Instruction}}\right) \times \text{Miss penalty}$$

Alternatively, expressing this in terms of memory accesses per instruction, miss rate, and miss penalty:

$$\text{Memory stall cycles} = \text{IC} \times \left(\frac{\text{Memory accesses}}{\text{Instruction}}\right) \times \text{Miss rate} \times \text{Miss penalty}$$

Where IC represents the instruction count, which is the total number of instructions executed, memory accesses per instruction is the average number of memory accesses made by each instruction, miss rate is the fraction of memory accesses that result in a cache miss, and miss penalty is the time it takes (in clock cycles) to service a cache miss, typically the time to fetch data from a lower level of the memory hierarchy or main memory.

In real-world scenarios, the miss rates and penalties for reads and writes can be different. Therefore, the memory stall cycles can be further broken down into reads and writes:

$$\text{Memory stall clock cycles} = \text{IC} \times \text{Reads per instruction} \times \text{Read miss rate} \times \text{Read miss penalty}$$

$$+ \text{IC} \times \text{Writes per instruction} \times \text{Write miss rate} \times \text{Write miss penalty}$$

This version of the formula accounts for the different behavior of reads and writes, where the miss rate and penalty may not be the same for both types of operations.

To simplify the above equations, we can calculate an average miss rate and average miss penalty across both reads and writes:

$$\text{Memory stall clock cycles} = \text{IC} \times \left(\frac{\text{Memory accesses}}{\text{Instruction}}\right) \times \text{Miss rate} \times \text{Miss penalty}$$

This form combines both reads and writes into a single average miss rate and penalty, making it easier to compute the total memory stall cycles when precise read/write differentiation isn’t needed or isn’t available.

The miss rate is simply the fraction of cache accesses that result in a miss, i.e., the percentage of data requests not found in the cache. The miss penalty is the number of clock cycles required to fetch the data from the next level of the memory hierarchy (e.g., L2 cache, L3 cache, or main memory). Instruction count (IC) represents the total number of instructions executed by the CPU, and memory accesses per instruction refers to the average number of memory access operations per instruction.

Higher miss rates lead to more frequent cache misses and, consequently, more memory stall cycles, slowing down overall execution. Lower miss penalties reduce the time penalty per miss, allowing the CPU to resume execution faster even after a miss. Higher associativity generally reduces miss rate at the cost of increased complexity and access time.

The equations above are useful for understanding how various cache parameters impact overall system performance. By adjusting cache size, associativity, and block size, along with strategies like prefetching or multi-level caching, you can reduce the number of cache misses and minimize the memory stall cycles, significantly boosting CPU efficiency.

Designers often prefer to measure cache performance using misses per instruction rather than misses per memory reference. This metric provides a clearer understanding of how cache performance affects the execution of instructions, rather than focusing purely on memory accesses. The two metrics are closely related, and can be converted using the following equation:

$$\frac{\text{Misses}}{\text{Instruction}} = \text{Miss rate} \times \frac{\text{Memory accesses}}{\text{Instruction}}$$

In this equation:

  • Miss rate refers to the percentage of memory accesses that result in a cache miss.

  • Memory accesses per instruction represent how many memory operations are performed per executed instruction.

This formulation allows the miss rate, which is typically measured per memory reference, to be expressed as misses per instruction. For example, if the miss rate is 2% (0.02) and each instruction performs an average of 1.5 memory accesses, then the misses per instruction would be:

$$\frac{\text{Misses}}{\text{Instruction}} = 0.02 \times 1.5 = 0.030$$

This means that, on average, there are 30 cache misses for every 1000 instructions.

One of the main benefits of measuring misses per instruction is that it is independent of the underlying hardware. Speculative processors, which may fetch more instructions than they commit, can skew the miss rate when measured per memory reference. In such cases, using misses per instruction avoids underestimating the miss rate by focusing on committed instructions.

However, the metric is architecture-dependent. Different architectures have varying numbers of memory accesses per instruction, which affects the calculation. For example, RISC architectures might have more consistent memory access patterns, while CISC architectures might vary widely. Despite this, misses per instruction remain a valuable metric for comparing performance within a given architecture.

List of Cache Equations:

$$2^{\text{index}} = \frac{\text{Cache size}}{\text{Block size} \times \text{Set associativity}}$$

$$\text{CPU execution time} = (\text{CPU clock cycles} + \text{Memory stall cycles}) \times \text{Clock cycle time}$$

$$\text{Memory stall cycles} = \text{Number of misses} \times \text{Miss penalty}$$

$$\text{Memory stall cycles} = \text{IC} \times \frac{\text{Misses}}{\text{Instruction}} \times \text{Miss penalty}$$

$$\frac{\text{Misses}}{\text{Instruction}} = \text{Miss rate} \times \frac{\text{Memory accesses}}{\text{Instruction}}$$

$$\text{Average memory access time} = \text{Hit time} + \text{Miss rate} \times \text{Miss penalty}$$

$$\text{CPU execution time} = \text{IC} \times \left( \text{CPI}_{\text{execution}} + \frac{\text{Memory stall clock cycles}}{\text{Instruction}} \right) \times \text{Clock cycle time}$$

$$\text{CPU execution time} = \text{IC} \times \left( \text{CPI}_{\text{execution}} + \frac{\text{Misses}}{\text{Instruction}} \times \text{Miss penalty} \right) \times \text{Clock cycle time}$$

$$\text{CPU execution time} = \text{IC} \times \left( \text{CPI}_{\text{execution}} + \frac{\text{Miss rate} \times \text{Memory accesses}}{\text{Instruction}} \times \text{Miss penalty} \right) \times \text{Clock cycle time}$$

$$\frac{\text{Memory stall cycles}}{\text{Instruction}} = \frac{\text{Misses}}{\text{Instruction}} \times (\text{Total miss latency} - \text{Overlapped miss latency})$$

$$\text{Average memory access time} = \text{Hit time}{L1} + \text{Miss rate}{L1} \times (\text{Hit time}{L2} + \text{Miss rate}{L2} \times \text{Miss penalty}_{L2})$$

$$\frac{\text{Memory stall cycles}}{\text{Instruction}} = \frac{\text{Misses}{L1}}{\text{Instruction}} \times \text{Hit time}{L2} + \frac{\text{Misses}{L2}}{\text{Instruction}} \times \text{Miss penalty}{L2}$$

Basic Cache Optimizations

When discussing cache optimizations, it’s important to first understand the average memory access time equation, which can be broken down as:

$$\text{Average memory access time} = \text{Hit time} + \text{Miss rate} \times \text{Miss penalty}$$

Using this framework, we can categorize cache optimizations into three main strategies:

  1. Reducing the Miss Rate: This can be achieved by increasing the block size, enlarging the overall cache size, or using higher associativity (i.e., more ways per set). By doing so, the likelihood of cache misses decreases.

  2. Reducing the Miss Penalty: Multilevel caches (L1, L2, etc.) are commonly used to reduce the cost of a cache miss. Additionally, prioritizing reads over writes can further lower miss penalties, as reads are more critical to overall performance.

  3. Reducing the Hit Time: Optimizing the time taken to determine a cache hit can also improve performance. Techniques like avoiding address translation when indexing the cache can help minimize the time spent on each hit.

A classical approach to improving cache performance is to focus on reducing the miss rate. To understand the causes of cache misses, misses can be divided into three types:

  • Compulsory Misses: These occur the first time a block is accessed. Since the block is not yet in the cache, it must be loaded. These misses are also called cold-start misses or first-reference misses.

  • Capacity Misses: If the cache is not large enough to hold all the data needed during execution, capacity misses occur. Blocks that were previously cached are discarded and later needed again, causing them to be reloaded.

  • Conflict Misses: In set-associative or direct-mapped caches, multiple memory blocks may map to the same set or line. When too many blocks map to the same location, some blocks will be evicted and cause a conflict miss. This issue is also referred to as a collision miss and is specific to caches with fixed placement strategies.

Optimization 1: Increase Block Size

The first optimization for reducing cache miss rates is increasing the block size. A larger block size can significantly lower the miss rate, particularly by reducing compulsory misses (also known as cold-start misses). This optimization works because of the principles of spatial locality and temporal locality.

Spatial locality refers to the likelihood that if a certain memory location is accessed, nearby memory locations will soon be accessed as well. By increasing the block size, more adjacent data is loaded into the cache, allowing for future accesses to nearby memory locations without additional misses. This is especially effective in programs that frequently access contiguous data, like arrays or loops with sequential data processing.

However, increasing the block size comes with a trade-off. Larger blocks reduce the number of blocks that can fit in the cache, which can lead to an increase in conflict misses and capacity misses. Conflict misses occur when multiple blocks map to the same cache location in set-associative or direct-mapped caches, leading to evictions of blocks that are still in use. Capacity misses happen when the cache is not large enough to hold all the data required during program execution, leading to frequent evictions and reloads.

Moreover, increasing the block size also increases the miss penalty. Since larger blocks take up more memory bandwidth, the time to transfer a block from the next level of the memory hierarchy (such as L2 cache or main memory) into L1 cache increases, raising the penalty for a miss. Therefore, while larger blocks can reduce the miss rate, the associated increase in miss penalty may outweigh the benefits if the block size becomes too large.

The choice of block size also depends on the latency and bandwidth of the memory system. Systems with high latency and high bandwidth can tolerate larger block sizes because they can transfer more bytes in a single transaction without significantly impacting overall performance. On the other hand, systems with low latency and low bandwidth may benefit more from smaller block sizes, as the marginal benefit of loading larger blocks is diminished due to shorter latencies.

Increasing block size is an effective way to reduce miss rates by leveraging spatial locality. However, the block size should be chosen carefully to balance the reduction in miss rate with the potential increase in conflict and capacity misses, as well as the larger miss penalty. Cache designers must consider the specific workload and memory system characteristics when determining the optimal block size for reducing miss rate and improving overall cache performance.

Optimization 2: Increase cache size

The second optimization for reducing cache miss rates is to increase the cache size. This strategy specifically targets reducing capacity misses, which occur when the cache is too small to store all the data needed during the execution of a program. By increasing the overall capacity of the cache, more data can be retained, reducing the frequency with which blocks are evicted and later needed again, thus lowering the miss rate.

While increasing cache size can be an effective way to reduce miss rates, it comes with certain drawbacks. A larger cache generally results in a longer hit time, which is the time it takes to access data that is present in the cache. Larger caches are also more expensive to build and consume more power, making them less suitable for systems where energy efficiency and cost are critical considerations.

Despite these potential drawbacks, increasing cache size remains a popular optimization, especially for off-chip caches, such as L2 and L3 caches. Off-chip caches typically handle large data sets and benefit more from increased capacity, as the longer hit times are more tolerable compared to the benefits of reduced miss rates in high-latency, high-throughput systems. This approach is especially beneficial in systems with workloads that require a large working set of data to be stored in the cache, minimizing the need to frequently access slower main memory.

While larger caches can reduce miss rates by storing more data, cache designers must carefully weigh the benefits against the potential costs, longer hit times, and increased power consumption.

Optimization 3: Higher Associativity to Reduce Miss Rate

The third optimization for reducing cache miss rates involves increasing the associativity of the cache. Higher associativity refers to allowing more potential locations (or "ways") where a block can be stored within a set in a set-associative cache. This flexibility helps reduce conflict misses, which occur when multiple blocks are mapped to the same cache line in a lower-associative cache.

There are two key observations when increasing associativity. The first is that an eight-way set-associative cache is, for practical purposes, almost as effective as a fully associative cache in terms of reducing miss rates. Fully associative caches allow any memory block to be placed in any cache line, which minimizes conflict misses. However, fully associative caches are often impractical due to their complexity and high hit times. An eight-way set-associative cache offers a good compromise, significantly reducing conflict misses without the full overhead of a fully associative design.

The second observation is referred to as the 2:1 cache rule of thumb. This rule suggests that a direct-mapped cache of size N has approximately the same miss rate as a two-way set-associative cache of size N/2. This is important because it illustrates how associativity can compensate for a smaller cache size, as a smaller but more associative cache can perform similarly to a larger direct-mapped cache in terms of miss rate.

However, increasing associativity also comes with trade-offs. Higher associativity tends to increase the hit time, as the cache must check multiple possible locations for a block, making each cache lookup more complex. This means that while increasing associativity reduces miss rates, it also raises the miss penalty and hit time, which can negatively impact overall performance. The challenge for cache designers is to balance the benefits of reduced miss rates with the potential cost of increased hit time and complexity.

Increasing cache associativity is a powerful tool for reducing miss rates, particularly conflict misses. However, it must be used carefully to avoid significantly increasing hit times and miss penalties, especially in systems with fast processor clock cycles where quick access times are critical.

Optimization 4: Multilevel Caches to Reduce Miss Penalty

The fourth optimization focuses on multilevel caches to reduce the miss penalty. As processors have become faster over time while DRAM speeds have not improved at the same rate, the gap between processor and memory speeds has widened. This increasing gap means that reducing the miss penalty can be just as beneficial to overall performance as reducing the miss rate.

One common way to reduce miss penalty is by introducing a second level (L2) of cache between the L1 cache and the main memory. The goal of this multilevel caching strategy is to ensure that when a miss occurs in the L1 cache, the data can still be retrieved quickly from the L2 cache rather than going all the way to the slower main memory. The first-level cache (L1) is kept small and fast to match the clock cycle of the processor, while the second-level cache (L2) can be larger and slower but still significantly faster than accessing main memory.

The average memory access time for a two-level cache system can be calculated as:

$$\text{Average memory access time} = \text{Hit time}{L1} + \text{Miss rate}{L1} \times \text{Miss penalty}_{L1}$$

For multilevel caches, the miss penalty for the L1 cache is the time required to access the L2 cache. The miss penalty for the L2 cache is the time needed to access main memory. So, the equation becomes:

$$\text{Miss penalty}{L1} = \text{Hit time}{L2} + \text{Miss rate}{L2} \times \text{Miss penalty}{L2}$$

Therefore, the average memory access time for a two-level cache system is:

$$\text{Average memory access time} = \text{Hit time}{L1} + \text{Miss rate}{L1}$$

$$\times (\text{Hit time}{L2} + \text{Miss rate}{L2} \times \text{Miss penalty}_{L2})$$

This equation demonstrates that the L1 cache is fast but has a higher miss rate compared to the L2 cache, which is slower but larger. By introducing an L2 cache, we can drastically reduce the miss penalty associated with L1 cache misses, thus improving the overall performance.

When analyzing the miss rates in a multilevel cache system, we distinguish between the local miss rate and the global miss rate:

  • Local miss rate refers to the number of misses in a specific cache (L1 or L2) divided by the total number of memory accesses to that particular cache.

  • Global miss rate measures the overall number of misses in the cache relative to the total number of memory accesses generated by the processor. For example, the global miss rate for the L2 cache is the product of the L1 miss rate and the L2 miss rate:

$$\text{Global miss rate}{L2} = \text{Miss rate}{L1} \times \text{Miss rate}_{L2}$$

In practice, the local miss rate for L2 is typically high because the L1 cache handles the bulk of the memory accesses, leaving only the "leftover" accesses that were missed in L1 for L2 to handle. Therefore, the global miss rate is the more useful measure when evaluating the effectiveness of the second-level cache.

In terms of memory stalls per instruction, we can expand the formula to account for the second-level cache:

$$\text{Average memory stalls per instruction} = \text{Misses per instruction}{L1} \times \text{Hit time}{L2}$$

$$+ \text{Misses per instruction}{L2} \times \text{Miss penalty}{L2}$$

By adding another level in the cache hierarchy, the overall complexity of performance analysis increases, but the benefits of reducing the miss penalty often outweigh the drawbacks. One challenge in multilevel caches is determining the best size for the L2 cache. Since the L2 cache stores data that is already in L1 and is typically much larger than the L1 cache, designers must ensure that it is big enough to reduce the frequency of accesses to main memory.

An important design consideration is multilevel inclusion versus multilevel exclusion:

  • Multilevel inclusion means that any data in the L1 cache is also present in the L2 cache. This simplifies consistency and coherence mechanisms because only the L2 cache needs to be checked for data.

  • Multilevel exclusion means that L1 data is not stored in L2, allowing the L2 cache to be fully dedicated to storing data not present in L1. This approach can be more efficient in terms of space usage, as seen in processors like the AMD Opteron, which employs this strategy.

Multilevel caches reduce the miss penalty by adding additional layers of cache between the processor and main memory. Proper cache hierarchy design, including the use of inclusion or exclusion policies, higher associativity, and careful consideration of cache size, can greatly improve the performance of a memory system by minimizing the time spent accessing slow main memory.

Optimization 5: Giving Priority to Read Misses over Writes to Reduce Miss Penalty

This optimization involves serving reads before writes have been completed, an approach commonly used in modern cache designs. Specifically, the use of a write buffer can improve the performance of write-through caches, allowing writes to be buffered while reads proceed.

In a write-through cache, the data is immediately written to both the cache and main memory. A write buffer temporarily stores the data being written, which helps the processor continue executing reads without having to wait for the write to complete. The challenge with write buffers arises when a read miss occurs, and the read data might be in the write buffer (i.e., an updated value that hasn't yet been written to memory). To avoid incorrect results, the cache must check the write buffer for potential conflicts.

One way to handle this issue is to stall the read miss until the write buffer is empty, but this can degrade performance. A more effective approach is to check the write buffer on a read miss, and if no conflicts are found (i.e., the data needed by the read is not in the buffer), the processor can proceed with the read. Most modern processors use this technique, prioritizing reads over writes for improved performance, as reads are more critical to the flow of program execution.

For write-back caches, where data is only written to memory when a cache block is evicted, the challenge is a bit different. Suppose a read miss is set to replace a dirty block (i.e., a block that has been modified but not yet written back to memory). Instead of first writing the dirty block to memory, then performing the read, an optimization can be applied where the dirty block is temporarily moved to the buffer, the memory read is performed, and then the block is written back. This strategy allows the read to complete sooner, reducing processor stalls.

In both cases, these optimizations help reduce write penalties by allowing reads to proceed even when writes are pending. As cache designs have evolved, optimizing hit time has become increasingly important. Hit time directly impacts the processor's clock rate, as many modern processors rely on fast cache access to maintain high clock speeds. By focusing on reducing hit time, cache optimizations provide broad performance improvements across all areas of memory access.

Optimization 6: Avoiding Address Translation during Indexing of the Cache to Reduce Hit Time

The sixth optimization focuses on avoiding address translation during cache indexing to reduce hit time. Address translation refers to the conversion of a virtual address used by the processor into a physical address to access memory. Typically, when a processor accesses memory, the address must go through this translation process, which can add latency to cache hits.

To address this, designers can implement virtual caches, which use virtual addresses for indexing, eliminating the need for address translation during cache hits. In contrast, physical caches require the virtual address to be translated into a physical address before accessing the cache. Since cache hits are more frequent than cache misses, optimizing the hit case by using virtual addresses can provide significant performance gains.

When designing caches, there are two key aspects to consider:

  1. Indexing the cache: whether the virtual or physical address is used to locate the cache line.

  2. Tag comparison: whether the tag used to identify the block is based on the virtual or physical address.

Using virtual addressing for both indexing and tag comparison can completely eliminate address translation during cache hits, but this approach presents several challenges. One of the primary concerns is protection: page-level protection is enforced as part of the virtual-to-physical address translation process. If this protection is bypassed by using virtual caches, the system would need additional mechanisms to ensure security and proper memory management, such as copying protection information from the Translation Lookaside Buffer (TLB) and checking it on every access.

Another issue arises when processes are switched. Since different processes may have the same virtual addresses but map to different physical addresses, switching processes would require flushing the cache to avoid conflicts. One solution to this problem is using a process-identifier tag (PID). By assigning a PID to each process, the cache can distinguish between data from different processes, avoiding unnecessary cache flushes.

A third challenge with virtual caches is the occurrence of synonyms or aliases, where multiple virtual addresses map to the same physical address. If a program modifies data at one virtual address, the cache may contain two copies of the same data, leading to inconsistencies. Physical caches avoid this problem by ensuring that all virtual addresses are translated to the same physical address before accessing the cache. Hardware solutions, like anti-aliasing, can address this by guaranteeing that every cache block has a unique physical address, even when multiple virtual addresses point to the same data.

One software-based approach to handle the synonym problem is called page coloring, where the operating system ensures that the physical and virtual addresses match in the lower bits used for indexing the cache. This method effectively forces virtual addresses to align with physical addresses, reducing the risk of synonyms.

To get the best of both virtual and physical caches, a technique known as virtually indexed, physically tagged caching is used. In this approach, the index is derived from the virtual address, allowing for fast cache lookup, while the tag is based on the physical address to ensure consistency. This method allows the cache read to begin immediately, while the virtual address is being translated in parallel. However, the limitation of this approach is that the cache size must be smaller than the page size, or part of the index must be translated.

Finally, associativity can help maintain a large cache while keeping the index in the physical part of the address. The size of the index is controlled by the following formula:

$$2^{\text{Index}} = \frac{\text{Cache size}}{\text{Block size} \times \text{Set associativity}}$$

Increasing associativity and cache size allows for efficient use of both physical indexing and virtual addressing. For example, a highly associative cache, such as a 16-way set associative cache, allows large cache sizes without requiring additional index translation, even when the page size is relatively small.

Avoiding address translation during cache indexing can significantly reduce hit times, but it comes with challenges such as maintaining security, handling process switching, and addressing synonyms. Techniques like PIDs, anti-aliasing, and page coloring can mitigate these issues, while virtually indexed, physically tagged caches offer a hybrid approach that balances performance and consistency.

Advanced Cache Optimizations

  1. Small and Simple First-Level Caches to Reduce Hit Time and Power: Optimizing L1 cache design is crucial for reducing hit time and power consumption. Smaller, simpler L1 caches allow faster clock cycles and use less power, making them ideal for modern processors. Direct-mapped caches, with their minimal levels of associativity, offer quick access times by simplifying the process of locating data. However, as CPU clock rates have risen, maintaining larger L1 caches without increasing latency has become challenging. Instead, higher associativity is often favored to reduce conflict misses, especially in multithreaded environments. Tools like CACTI help designers balance cache size, power, and performance, ultimately leading to efficient and powerful CPU cache designs.

  2. Way Prediction to Reduce Hit Time: To improve cache performance, way prediction is a clever technique that maintains the fast hit time of direct-mapped caches while reducing conflict misses. Essentially, extra bits are stored to predict which "way" or block will be accessed on the next cache access. This prediction enables the cache to choose a block earlier, overlapping tag comparison with data access. If the prediction is correct, access is fast; if not, the correct block is selected in the next cycle with minimal additional latency. Way prediction can achieve over 90% accuracy for two-way associative caches and around 80% for four-way associative caches, particularly effective for instruction caches (I-caches). It was initially used in the MIPS R10000 and is also implemented in processors like the ARM Cortex-A8. An extended approach, known as way selection, uses these prediction bits to decide whether a block should even be accessed. While this method saves power, it adds significant delay when predictions are wrong, making it suitable mainly for low-power designs. Despite the drawbacks, including increased complexity in pipelining, way prediction and way selection are powerful tools for balancing cache hit time and power efficiency.

  3. Pipelined Cache Access for Increased Bandwidth: Pipelining the cache access is a simple yet powerful way to improve cache bandwidth. By breaking down the cache access into multiple stages, cache hits can span several clock cycles, allowing for a faster clock cycle time and greater overall bandwidth, even if the hit latency itself becomes slower. For instance, Intel processors in the 1990s used 1-2 clock cycles for cache access, whereas modern CPUs, like the Intel Core i7, may use up to 4 clock cycles. While this adds a penalty for mispredicted branches, the advantage lies in allowing the integration of more associativity without significantly sacrificing performance.

  4. Nonblocking Caches to Handle Multiple Misses: Nonblocking or "lockup-free" caches take cache optimization a step further by allowing out-of-order execution. With a nonblocking cache, the CPU doesn't need to stall on a cache miss; instead, it can continue to fetch other instructions while waiting for the missing data. This technique, referred to as "hit under miss" or "miss under miss," reduces effective miss penalties by overlapping misses and allowing more efficient cache utilization. Studies have shown that nonblocking caches can significantly reduce cache latency, particularly for workloads involving multiple outstanding misses. By allowing multiple references to proceed concurrently, nonblocking caches help maintain a higher level of performance, especially in modern, high-associativity systems like the Intel Core i7. However, implementing nonblocking caches is complex, as the system must be capable of managing multiple outstanding requests without causing conflicts.

  5. Multibanked Caches to Increase Cache Bandwidth: Instead of treating the cache as a single large block, multibanked caches split it into independent banks that can support simultaneous accesses. This approach effectively increases cache bandwidth by allowing multiple memory accesses at the same time. For example, modern processors like the Intel Core i7 use multiple banks in their L1 cache, enabling up to two memory accesses per clock cycle. By using sequential interleaving, addresses are spread evenly across different banks, ensuring that memory accesses are well distributed, reducing contention, and improving cache performance. This technique is also widely used in DRAM to enhance memory access efficiency. When properly designed, multibanked caches can significantly boost the overall system bandwidth while reducing power consumption.

  6. Critical Word First and Early Restart to Reduce Miss Penalty: This optimization targets reducing cache miss penalties by fetching the most needed data sooner. The idea is simple—when a cache miss occurs, instead of waiting for the entire block to be loaded, the cache can either:

    1. Critical Word First: Request the needed word first and send it to the CPU as soon as it arrives.

    2. Early Restart: Start sending data to the CPU as soon as the requested word arrives, even if the rest of the block is still being fetched.

These techniques work well for large cache blocks, allowing the processor to continue work without waiting for the entire block, which ultimately reduces idle time and speeds up performance.

  1. Merging Write Buffer to Reduce Miss Penalty: Write-through caches often use write buffers to avoid stalling the processor while data is being written to memory. To optimize this, a technique called write merging is employed. If the write buffer has data for the same memory address, new data can be merged instead of writing separately. This optimization is used in many modern processors, including Intel's Core i7, and it helps reduce stalls and improve the efficiency of memory operations. Write merging is particularly effective when multiple stores are being made to consecutive memory addresses, reducing the time needed for writing compared to handling each store individually.

  2. Compiler Optimizations to Reduce Miss Rate: Unlike the other optimizations that require hardware changes, this one leverages software. Compiler optimizations like loop interchange and blocking are used to improve cache performance:

    • Loop Interchange: Changing the order of nested loops to make better use of spatial locality, ensuring that data is accessed in a more cache-friendly manner.

    • Blocking: Dividing operations into smaller "blocks" that fit into the cache, reducing the number of cache misses and making better use of temporal locality.

  3. Hardware Prefetching of Instructions and Data: To reduce miss penalties, hardware prefetching proactively fetches data or instructions before the CPU even requests them. For instructions, this means fetching the next consecutive block along with the current one on a cache miss. Prefetched data is stored in an external buffer, making it readily accessible if needed, thus reducing potential latency. For data accesses, hardware prefetching can be very effective—using stream buffers, it can capture a significant portion of misses, improving cache hit rates and reducing latency. The Intel Core i7, for instance, includes hardware prefetching for both L1 and L2 caches, making it particularly useful for high-performance applications. While prefetching can improve performance by leveraging unused memory bandwidth, it's essential to ensure that prefetched data is actually used. Otherwise, unnecessary prefetches could harm system efficiency and power consumption.

  4. Compiler-controlled prefetching is a software-based approach that instructs the processor to fetch data before it's explicitly needed. This is done by inserting prefetch instructions into the code, aiming to overlap data access with program execution. Prefetching instructions can be of two types:

    1. Register Prefetch: Loads data into a register.

    2. Cache Prefetch: Loads data into the cache without explicitly using a register.

The goal is to reduce the impact of cache misses by fetching data early and ensuring the processor doesn’t stall. Loops are a prime target for this kind of optimization, as compilers can efficiently predict memory access patterns. By unrolling loops or using software pipelining, compilers can make better use of prefetching to optimize execution.

Exercises

  1. A 256 KB cache is using direct mapping with a block size of 64 bytes. The system has 32-bit addresses. How many sets are in the cache, and how many bits are used for the index, block offset, and tag? If the block size is halved to 32 bytes, how would the number of sets, index bits, and block offset bits change?

    Solution:

    Cache size = 256 KB, Block size = 64 bytes, Address size = 32 bits.

    Step 1: Calculate the number of sets. The number of blocks in the cache is \( \frac{256 \, \text{KB}}{64 \, \text{bytes}} = 4096 \, \text{blocks} \) . Since it is a direct-mapped cache, the number of sets = 4096 sets.

    Step 2: Calculate block offset bits. The block size is 64 bytes, so we need \( \log_2(64) = 6 \) bits for the block offset.

    Step 3: Calculate index bits. Number of sets = 4096, so we need \( \log_2(4096) = 12 \) bits for the index.

    Step 4: Calculate tag bits. The total address size is 32 bits, so the tag bits are \( 32 - (12 + 6) = 14 \) bits.

  2. Consider a 4-way set associative cache with a size of 512 KB and a block size of 128 bytes. How many sets does the cache have, and how many bits are used for the tag, index, and block offset in a system with 36-bit addresses? What would happen to the number of index bits and sets if the number of ways were increased to 8?

    Solution:

    Cache size = 512 KB, Block size = 128 bytes, Address size = 36 bits, Associativity = 4-way.

    Step 1: Calculate the number of sets. The total number of blocks = \( \frac{512 \, \text{KB}}{128 \, \text{bytes}} = 4096 \, \text{blocks} \) . Since it is 4-way set associative, the number of sets is \( \frac{4096}{4} = 1024 \, \text{sets} \) .

    Step 2: Calculate block offset bits. The block size is 128 bytes, so we need \( \log_2(128) = 7 \) bits for the block offset.

    Step 3: Calculate index bits. The number of sets = 1024, so we need \( \log_2(1024) = 10 \) bits for the index.

    Step 4: Calculate tag bits. The total address size is 36 bits, so the tag bits are \( 36 - (10 + 7) = 19 \) bits.

    If the number of ways is increased to 8, the number of sets becomes \( \frac{4096}{8} = 512 \, \text{sets} \) , and the index bits become \( \log_2(512) = 9 \) . The tag bits become \( 36 - (9 + 7) = 20 \) .

  3. A fully associative cache with a total size of 128 KB has a block size of 32 bytes. In a system with 40-bit physical addresses, how many bits are used for the tag and block offset? If the block size is doubled, what happens to the number of tag bits?

    Solution:

    Cache size = 128 KB, Block size = 32 bytes, Address size = 40 bits.

    Step 1: Calculate block offset bits. The block size is 32 bytes, so we need \( \log_2(32) = 5 \) bits for the block offset.

    Step 2: Calculate tag bits. Since it is fully associative, there are no index bits, so the tag bits are \( 40 - 5 = 35 \) .

    If the block size is doubled to 64 bytes, the block offset becomes \( \log_2(64) = 6 \) , and the tag bits become \( 40 - 6 = 34 \) bits.

  4. You have a cache of 1 MB organized as a 2-way set associative cache with 64-byte blocks. How many index bits are used in a 48-bit addressing system, and how does the number of index and tag bits change if the cache becomes 4-way set associative?

    Solution:

    Cache size = 1 MB, Block size = 64 bytes, Address size = 48 bits, Associativity = 2-way.

    Step 1: Calculate the number of sets. The total number of blocks = \( \frac{1 \, \text{MB}}{64 \, \text{bytes}} = 16,384 \, \text{blocks} \) . Since it is 2-way set associative, the number of sets = \( \frac{16,384}{2} = 8192 \, \text{sets} \) .

    Step 2: Calculate block offset bits. The block size is 64 bytes, so we need \( \log_2(64) = 6 \) bits for the block offset.

    Step 3: Calculate index bits. The number of sets = 8192, so we need \( \log_2(8192) = 13 \) bits for the index.

    Step 4: Calculate tag bits. The total address size is 48 bits, so the tag bits are \( 48 - (13 + 6) = 29 \) .

    If the cache becomes 4-way set associative, the number of sets becomes \( \frac{16,384}{4} = 4096 \, \text{sets} \) , and the index bits become \( \log_2(4096) = 12 \) . The tag bits become \( 48 - (12 + 6) = 30 \) .

  5. In a direct-mapped cache with 64 KB of total size and 128-byte blocks, calculate the number of bits for the index, block offset, and tag assuming a 32-bit address space. If the block size is increased to 256 bytes, how do the index and block offset bits change?

    Solution:

    Cache size = 64 KB, Block size = 128 bytes, Address size = 32 bits.

    Step 1: Calculate the number of sets. The number of blocks = \( \frac{64 \, \text{KB}}{128 \, \text{bytes}} = 512 \, \text{blocks} \) . Since it is direct-mapped, the number of sets = 512.

    Step 2: Calculate block offset bits. The block size is 128 bytes, so we need \( \log_2(128) = 7 \) bits for the block offset.

    Step 3: Calculate index bits. The number of sets = 512, so we need \( \log_2(512) = 9 \) bits for the index.

    Step 4: Calculate tag bits. The total address size is 32 bits, so the tag bits are \( 32 - (9 + 7) = 16 \) .

    If the block size is increased to 256 bytes, the block offset becomes \( \log_2(256) = 8 \) , and the index bits remain at 9 (since the number of sets doesn’t change). The tag bits become \( 32 - (9 + 8) = 15 \) .

  6. A system has a 2-way set associative cache of size 256 KB with 32-byte blocks and 34-bit addresses. Determine the number of index bits, tag bits, and block offset bits. If the number of ways is increased to 4, how do these values change?

    Solution:

    Cache size = 256 KB, Block size = 32 bytes, Address size = 34 bits, Associativity = 2-way.

    Step 1: Calculate the number of sets. The total number of blocks = \( \frac{256 \, \text{KB}}{32 \, \text{bytes}} = 8192 \, \text{blocks} \) . Number of sets = \( \frac{8192}{2} = 4096 \, \text{sets} \) .

    Step 2: Calculate block offset bits. The block size is 32 bytes, so we need \( \log_2(32) = 5 \) bits for the block offset.

    Step 3: Calculate index bits. Number of sets = 4096, so we need \( \log_2(4096) = 12 \) bits for the index.

    Step 4: Calculate tag bits. The total address size is 34 bits, so the tag bits are \( 34 - (12 + 5) = 17 \) bits.

    If the associativity is increased to 4-way, the number of sets becomes \( \frac{8192}{4} = 2048 \, \text{sets} \) , the index bits become \( \log_2(2048) = 11 \) , and the tag bits become \( 34 - (11 + 5) = 18 \) bits.

  7. Given a fully associative cache of 256 KB with 64-byte blocks in a 40-bit address space, how many bits are used for the tag and block offset? If the cache were changed to 8-way set associative, how many index bits would now be required?

    Solution:

    Cache size = 256 KB, Block size = 64 bytes, Address size = 40 bits.

    Step 1: Calculate block offset bits. The block size is 64 bytes, so we need \( \log_2(64) = 6 \) bits for the block offset.

    Step 2: Calculate tag bits. Since it is fully associative, there are no index bits, so the tag bits are \( 40 - 6 = 34 \) bits.

    If the cache becomes 8-way set associative, the total number of blocks = \( \frac{256 \, \text{KB}}{64 \, \text{bytes}} = 4096 \, \text{blocks} \) . Number of sets = \( \frac{4096}{8} = 512 \, \text{sets} \) , so the index bits are \( \log_2(512) = 9 \) .

  8. A direct-mapped cache has a size of 512 KB and a block size of 64 bytes. With a 48-bit address space, calculate the number of tag, index, and block offset bits. If the cache is split into two equal-sized caches (i.e., halved), how would the index and tag bits change?

    Solution:

    Cache size = 512 KB, Block size = 64 bytes, Address size = 48 bits.

    Step 1: Calculate the number of sets. The number of blocks = \( \frac{512 \, \text{KB}}{64 \, \text{bytes}} = 8192 \, \text{blocks} \) . Since it's direct-mapped, the number of sets = 8192.

    Step 2: Calculate block offset bits. The block size is 64 bytes, so we need \( \log_2(64) = 6 \) bits for the block offset.

    Step 3: Calculate index bits. The number of sets = 8192, so we need \( \log_2(8192) = 13 \) bits for the index.

    Step 4: Calculate tag bits. The total address size is 48 bits, so the tag bits are \( 48 - (13 + 6) = 29 \) bits.

    If the cache is split into two equal-sized caches, each cache is now 256 KB, which gives 4096 sets. The new index bits = \( \log_2(4096) = 12 \) . The tag bits become \( 48 - (12 + 6) = 30 \) .

  9. You have a 4-way set associative cache with a size of 1 MB and a block size of 128 bytes in a 44-bit address system. How many index, tag, and block offset bits are required? If the block size is halved, how do the index, tag, and offset bits change?

    Solution:

    Cache size = 1 MB, Block size = 128 bytes, Address size = 44 bits, Associativity = 4-way.

    Step 1: Calculate the number of sets. The total number of blocks = \( \frac{1 \, \text{MB}}{128 \, \text{bytes}} = 8192 \, \text{blocks} \) . Number of sets = \( \frac{8192}{4} = 2048 \, \text{sets} \) .

    Step 2: Calculate block offset bits. The block size is 128 bytes, so we need \( \log_2(128) = 7 \) bits for the block offset.

    Step 3: Calculate index bits. Number of sets = 2048, so we need \( \log_2(2048) = 11 \) bits for the index.

    Step 4: Calculate tag bits. The total address size is 44 bits, so the tag bits are \( 44 - (11 + 7) = 26 \) bits.

    If the block size is halved to 64 bytes, the block offset becomes \( \log_2(64) = 6 \) bits. The index bits remain at 11 (number of sets doesn’t change). The tag bits become \( 44 - (11 + 6) = 27 \) .

  10. In a fully associative cache of 128 KB with 128-byte blocks in a 32-bit address space, calculate the number of tag bits and block offset bits. If the cache were organized as a 16-way set associative cache instead, what would be the new number of tag and index bits?

    Solution:

    Cache size = 128 KB, Block size = 128 bytes, Address size = 32 bits.

    Step 1: Calculate block offset bits. The block size is 128 bytes, so we need \( \log_2(128) = 7 \) bits for the block offset.

    Step 2: Calculate tag bits. Since it's fully associative, there are no index bits, so the tag bits are \( 32 - 7 = 25 \) bits.

    If the cache becomes 16-way set associative, the total number of blocks = \( \frac{128 \, \text{KB}}{128 \, \text{bytes}} = 1024 \, \text{blocks} \) . Number of sets = \( \frac{1024}{16} = 64 \, \text{sets} \) , so the index bits are \( \log_2(64) = 6 \) . The tag bits become \( 32 - (6 + 7) = 19 \) bits.

  11. A 2-way set associative cache has 8 sets, with a total size of 512 bytes and a block size of 32 bytes. The system uses 32-bit addresses. The cache is initially empty. A sequence of reads occurs at addresses 0x00000120, 0x000001A0, 0x00000220, 0x000001A0, and 0x00000320. Using the LRU replacement policy, what will the cache contents be after these accesses? Which block will be replaced if a new read occurs at address 0x00001200?

  12. A 4-way set associative cache has 16 sets, with a total size of 2048 bytes and a block size of 32 bytes. The system uses 32-bit addresses. The following addresses are accessed in sequence: 0x00000400, 0x00000440, 0x00000480, 0x000004C0, and 0x00000440 again. Now, a read to 0x00000500 occurs. Assuming the LFU (Least Frequently Used) replacement policy, which block will be replaced in the cache?

  13. A direct-mapped cache has a total size of 1024 bytes and a block size of 64 bytes. The system uses 32-bit addresses. The following sequence of memory addresses is accessed: 0x00000200, 0x00000280, 0x00000300, 0x00000200, 0x00000380, and 0x00000400. Using the FIFO replacement policy, which block will be replaced when a new read occurs at 0x00000500?

  14. A fully associative cache has 8 blocks and a total size of 256 bytes, with a block size of 32 bytes. The system uses 32-bit addresses. The cache is initially empty. Memory addresses 0x00000100, 0x00000200, 0x00000300, 0x00000400, 0x00000500, 0x00000600, 0x00000700, and 0x00000800 are read into the cache. If a new read to 0x00000900 occurs and the cache uses random replacement, which block is most likely to be replaced?

  15. A 4-way set associative cache has 8 sets and a total size of 1024 bytes, with a block size of 32 bytes. The system uses 32-bit addresses. Using the LRU replacement policy, the cache is accessed with reads to addresses 0x00000100, 0x00000180, 0x00000200, and 0x00000180. A write operation is then issued to address 0x00000280. Which block is replaced upon the write, and what will the final cache configuration be?

  16. A processor with an instruction count (IC) of 10 million and a base CPI of 1.2 has a clock cycle time of 0.5 ns. If the cache miss rate is 5%, and each miss incurs a penalty of 50 cycles, how much longer does it take to execute the program due to memory stalls compared to the ideal case where no misses occur?

  17. A system has a cache with a miss rate of 3% and an average memory access time of 200 cycles. The hit time is 2 cycles. Calculate the miss penalty for the system, and determine the overall average CPU execution time for a program with 100 million instructions and an average CPI (without memory stalls) of 1.5.

  18. If a processor has a memory access rate of 40% per instruction, a miss rate of 7%, and a miss penalty of 100 cycles, calculate the total memory stall clock cycles. Then, compute the overall execution time given a base CPI of 2.5, instruction count of 500 million, and a clock cycle time of 2 ns.

  19. A program executes 500 million instructions on a processor with a base CPI of 1.8. The cache has a hit time of 3 cycles, and the miss penalty is 80 cycles. The miss rate is 6%, and memory accesses per instruction is 35%. Calculate the total memory stall clock cycles, and the total CPU execution time.

  20. Consider a system where the hit time is 1 cycle, the miss rate is 10%, and the miss penalty is 150 cycles. If the program has 20% memory accesses per instruction, and the instruction count is 2 billion with a base CPI of 1.6, calculate the total memory stall cycles and the impact on the overall CPU execution time.

  21. A CPU has a miss rate of 4% and incurs 70 cycles of miss penalty per miss. The clock cycle time is 1.2 ns, and the program executes 300 million instructions with an average CPI of 2.0. Calculate the total execution time, including memory stalls, and the percentage increase in execution time due to cache misses.

  22. A processor runs a program with 200 million instructions, and it has a base CPI of 1.4. The memory access rate is 50%, the miss rate is 8%, and the miss penalty is 60 cycles. What is the total CPU execution time, and how many cycles are spent on memory stalls?

  23. Given a system with a hit time of 5 cycles, miss rate of 2%, and miss penalty of 120 cycles, determine the average memory access time. For a program with 1 billion instructions and a memory access rate of 30%, what is the overall memory stall clock cycles, and how does this affect the total execution time if the base CPI is 2.3?

  24. A processor has a cache with a hit time of 4 cycles and a miss rate of 9%. If the miss penalty is 100 cycles, and the program executes 800 million instructions with a base CPI of 2.0, calculate the overall execution time assuming the memory access rate is 40% per instruction.

  25. The hit time of a cache is 2 cycles, and the miss rate is 5%. The miss penalty is 150 cycles, and the memory access rate is 25% per instruction. Calculate the average memory access time. Given a program with 600 million instructions and a base CPI of 1.7, determine the total execution time.

  26. A program has an instruction count of 400 million, and the processor has a base CPI of 1.5. The cache miss rate is 7%, and the miss penalty is 90 cycles. The memory access rate is 35% per instruction. Calculate how much time is spent on memory stalls, and determine the overall CPU execution time with a clock cycle time of 1 ns.

  27. If a system's cache miss rate is 6% and its hit time is 3 cycles, with a miss penalty of 80 cycles, what is the average memory access time? For a program with a memory access rate of 20%, 1.5 billion instructions, and a base CPI of 2.2, calculate the total number of memory stall cycles and the execution time.

  28. A processor executes 700 million instructions with a base CPI of 1.6. The memory access rate is 40%, the miss rate is 4%, and the miss penalty is 70 cycles. Determine the total memory stall clock cycles, and calculate the overall execution time if the clock cycle time is 0.8 ns.

  29. A cache has a hit time of 6 cycles and a miss rate of 3%. The miss penalty is 90 cycles. What is the average memory access time? For a program with 2 billion instructions and a memory access rate of 50%, calculate the total execution time with a base CPI of 1.8 and a clock cycle time of 0.6 ns.

  30. If a processor has an instruction count of 500 million and a base CPI of 1.9, and the cache miss rate is 5%, with a miss penalty of 100 cycles and a memory access rate of 30%, calculate the total memory stall cycles and the impact on the overall execution time.

  31. A CPU has a hit time of 3 cycles, a miss rate of 4%, and a miss penalty of 120 cycles. If a program with 600 million instructions and a base CPI of 1.5 is run on this system, calculate the total memory stall cycles and the overall CPU execution time assuming a clock cycle time of 1.5 ns.

  32. For a system with a cache miss rate of 8%, a hit time of 2 cycles, and a miss penalty of 90 cycles, determine the average memory access time. A program with 1.2 billion instructions and a base CPI of 2.0 has a memory access rate of 40%. Calculate the total CPU execution time, including memory stalls.

  33. A processor has a memory access rate of 35%, a miss rate of 5%, and a miss penalty of 80 cycles. For a program with 900 million instructions and a base CPI of 1.7, determine how many clock cycles are spent on memory stalls, and calculate the total execution time with a clock cycle time of 0.7 ns.

  34. The cache of a system has a miss rate of 6%, a hit time of 4 cycles, and a miss penalty of 100 cycles. The memory access rate is 45%, and the program has 1 billion instructions with a base CPI of 1.6. Calculate the average memory access time and the total CPU execution time, including memory stalls.

  35. A system has a cache with a hit time of 3 cycles, a miss rate of 7%, and a miss penalty of 120 cycles. A program with 800 million instructions runs on this system with a memory access rate of 50% per instruction and a base CPI of 2. Calculate the total memory stall cycles and the overall CPU execution time assuming the clock cycle time is 1.2 ns.

  36. A CPU has an L1 cache with a hit time of 2 cycles and a miss rate of 5%. The L2 cache has a hit time of 10 cycles, a miss rate of 2%, and a miss penalty of 100 cycles. Given that the memory access rate is 40%, calculate the average memory access time, the total memory stall cycles, and the CPU execution time for a program with 1 billion instructions and a base CPI of 1.8.

  37. For a processor with an instruction count of 500 million and a base CPI of 2.0, the L1 miss rate is 4%, and the L2 miss rate is 1%. The miss penalty of L2 is 150 cycles, and the L1 hit time is 3 cycles. Determine the total memory stall cycles per instruction and the total CPU execution time, including memory stalls.

  38. A system has an L1 hit time of 4 cycles, an L2 hit time of 15 cycles, and an L2 miss penalty of 200 cycles. If the miss rate of L1 is 6%, and the miss rate of L2 is 2%, calculate the average memory access time. For a program with 2 billion instructions and a base CPI of 2.1, determine the total CPU execution time.

  39. If the L1 miss rate is 7%, and the L2 miss rate is 3%, with L1 and L2 hit times of 5 and 12 cycles, respectively, and an L2 miss penalty of 180 cycles, calculate the total number of memory stall cycles per instruction. Given a program with 900 million instructions and a base CPI of 1.9, determine the total execution time.

  40. A processor has an L1 miss rate of 3%, and an L2 miss rate of 1%, with L1 and L2 hit times of 3 and 10 cycles, respectively. The L2 miss penalty is 120 cycles. If the instruction count is 600 million, and the base CPI is 2.3, calculate the total memory stall cycles and the impact on the overall CPU execution time.

  41. For a system where the L1 cache has a miss rate of 5%, and the L2 miss rate is 2%, calculate the average memory access time given that the L1 hit time is 3 cycles, the L2 hit time is 20 cycles, and the L2 miss penalty is 150 cycles. If the instruction count is 1.5 billion, and the base CPI is 1.7, calculate the total execution time.

  42. A processor executes a program with 800 million instructions and a base CPI of 2.5. The L1 miss rate is 6%, and the L2 miss rate is 2%. The L1 and L2 hit times are 4 and 15 cycles, respectively, and the L2 miss penalty is 180 cycles. Calculate the total number of memory stall cycles per instruction and the total CPU execution time.

  43. Given a system where the L1 miss rate is 4%, and the L2 miss rate is 1.5%, with L1 and L2 hit times of 5 and 14 cycles, respectively, and an L2 miss penalty of 150 cycles, determine the average memory access time. For a program with 1 billion instructions and a base CPI of 1.9, calculate the total execution time.

  44. A CPU has an L1 cache with a hit time of 3 cycles and a miss rate of 4%. The L2 cache has a hit time of 12 cycles, a miss rate of 2%, and a miss penalty of 100 cycles. For a program with 1.2 billion instructions and a base CPI of 1.6, calculate the memory stall cycles per instruction and the total CPU execution time.

  45. For a system with an L1 miss rate of 5%, an L2 miss rate of 2%, and an L2 miss penalty of 120 cycles, calculate the total memory stall cycles per instruction. The L1 hit time is 4 cycles, and the L2 hit time is 10 cycles. If the instruction count is 900 million, and the base CPI is 1.8, determine the total execution time, including memory stalls.

Problems

  1. You are analyzing the importance of locality in justifying the use of cache memory. To explore this concept, you experiment with a computer system that has an L1 data cache and main memory, focusing specifically on data accesses. The latencies (in CPU cycles) of different types of memory accesses are as follows: cache hit = 2 cycles; cache miss = 130 cycles; main memory access without cache = 120 cycles.

    a. You run a program that has an overall cache miss rate of 8%. What will the average memory access time (in CPU cycles) be?

    b. Next, you run a program that generates completely random data accesses, resulting in no locality at all. To do this, you use an array of size 512 MB (which fits entirely in main memory). Accesses are made to random elements of this array (using a uniform random number generator for element indices). If the size of your data cache is 128 KB, what will the average memory access time be?

    c. If you compare the result from part (b) with the main memory access time when the cache is disabled, what conclusion can you draw about the role of the principle of locality in justifying cache usage?

    d. You observe that a cache hit saves 128 cycles (2 cycles for a cache hit vs. 130 cycles for a miss), but a cache miss results in a loss of 10 cycles compared to not using a cache (130 cycles for a miss vs. 120 cycles without a cache). Using these values, let's define G (gain) as the cycles saved by a cache hit and L (loss) as the cycles lost due to a cache miss. Using the quantities G and L, calculate the highest miss rate at which the cache would still provide an overall performance benefit.

  2. You are developing a system around a processor with in-order execution that runs at 1.4 GHz and has a CPI of 0.8 excluding memory accesses. The only instructions that access memory are loads (25% of all instructions) and stores (10% of all instructions). The system has a split L1 cache, with both the I-cache and D-cache being direct-mapped and holding 64 KB each. The I-cache has a 3% miss rate and uses 64-byte blocks, while the D-cache uses a write-through policy with a 6% miss rate and 32-byte blocks. The D-cache also has a write buffer that reduces stalls for 90% of all writes.

    The L2 cache is unified, write-back, with a size of 1 MB, 128-byte blocks, and an access time of 10 ns. The L1 and L2 caches are connected by a 256-bit bus running at 200 MHz, transferring a 256-bit word per bus cycle. Of all memory references sent to the L2 cache, 85% are satisfied without going to main memory. Additionally, 40% of the replaced blocks in L1 are dirty. The main memory uses a 256-bit-wide bus with an access latency of 70 ns, after which one 256-bit word is transferred per cycle on the 150 MHz bus.

    a. What is the average memory access time for instruction accesses?

    b. What is the average memory access time for data reads?

    c. What is the average memory access time for data writes?

    d. What is the overall CPI, including memory accesses?

  3. The Least Recently Used (LRU) replacement policy assumes that if memory address B1 was accessed less recently than address B2, B2 is more likely to be accessed again before B1. Thus, B2 is prioritized over B1. However, this assumption may fail in scenarios where a loop larger than the cache size is being continuously executed.

    Consider a fully associative 256-byte instruction cache with a 16-byte block size (each block can hold multiple instructions). The cache uses the LRU replacement policy.

    a. What is the asymptotic instruction miss rate for a 128-byte loop with a large number of iterations?

    b. Repeat part (a) for loop sizes of 256 bytes and 512 bytes.

    c. If the cache replacement policy is changed to Most Recently Used (MRU) (replace the most recently accessed cache line), which of the above loop sizes (128-, 256-, or 512-byte loops) would benefit the most from this policy?

    d. Suggest alternative replacement policies that might outperform LRU under specific conditions.

  4. Increasing the associativity of a cache (while keeping all other parameters constant) statistically reduces the miss rate. However, there can be certain pathological cases where increasing the associativity actually increases the miss rate for specific workloads.

    Consider the case of a direct-mapped cache compared to a four-way set associative cache of equal size. Assume the set-associative cache uses the Most Recently Used (MRU) replacement policy. To simplify, assume that the block size is 16 bytes.

    Construct a trace of memory accesses that would result in more misses in the four-way associative cache than in the direct-mapped cache.

  5. Consider a three-level memory hierarchy consisting of L1, L2, and L3 data caches. Assume that all caches use a write-back policy on a write hit, and the L1 cache has 16 KB, the L2 cache has 256 KB, and the L3 cache has 4 MB of storage. Each cache block size is 64 bytes. Now, consider the actions taken for the following events:

    a. An L1 cache miss when the caches are organized in an inclusive hierarchy. What happens to the contents of L1, L2, and L3?

    b. An L1 cache miss when the caches are organized in an exclusive hierarchy. How does data move between the L1, L2, and L3 caches?

    c. In both parts (a) and (b), consider the case where the evicted line may be clean or dirty. How does this affect the write-back process for each hierarchy type?

1
Subscribe to my newsletter

Read articles from Jyotiprakash Mishra directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jyotiprakash Mishra
Jyotiprakash Mishra

I am Jyotiprakash, a deeply driven computer systems engineer, software developer, teacher, and philosopher. With a decade of professional experience, I have contributed to various cutting-edge software products in network security, mobile apps, and healthcare software at renowned companies like Oracle, Yahoo, and Epic. My academic journey has taken me to prestigious institutions such as the University of Wisconsin-Madison and BITS Pilani in India, where I consistently ranked among the top of my class. At my core, I am a computer enthusiast with a profound interest in understanding the intricacies of computer programming. My skills are not limited to application programming in Java; I have also delved deeply into computer hardware, learning about various architectures, low-level assembly programming, Linux kernel implementation, and writing device drivers. The contributions of Linus Torvalds, Ken Thompson, and Dennis Ritchie—who revolutionized the computer industry—inspire me. I believe that real contributions to computer science are made by mastering all levels of abstraction and understanding systems inside out. In addition to my professional pursuits, I am passionate about teaching and sharing knowledge. I have spent two years as a teaching assistant at UW Madison, where I taught complex concepts in operating systems, computer graphics, and data structures to both graduate and undergraduate students. Currently, I am an assistant professor at KIIT, Bhubaneswar, where I continue to teach computer science to undergraduate and graduate students. I am also working on writing a few free books on systems programming, as I believe in freely sharing knowledge to empower others.