HPCA Learnings

Aug 5, 2019 09:15 · 735 words · 4 minute read

As mentioned in my previous post, I nearly dropped from high performance computing architecture (HPCA) course but ultimately decided to weather the storm and I’m happy with my decision since the course exposed me to new concepts while simulatenously solidifying concepts that I had previously learned (in previous computer science courses and in industry). And instead of enumerating a long list of concepts that the course covered, I’ll hand select a few compelling ones. In each of the following sections, I share what I had assumed (prior to taking the course) and what I learned.

Branch prediction

I never put much thought into how the the PC (program counter) value was set; and at a high level, the PC counter is simply incremented and points to the next memory address. However, it’s a bit more complicated than that for control dependencies.

I once took a computer organization course, a course that covered very little of computer architecture and as a result, this was the first time I was exposed to branch prediction. Branch prediction attempts to improve the throughput of the processing pipeline by predicting control dependencies — essentially, jump and branch statements. And of course, there are many ways to implement branch prediction; from simple ones like always not taken (or always taken) to more complex ones that maintain a history of branch predictions.

Out of order execution

The fact that the CPU can schedule instructions out of order still blows my mind. But, at least now I have a understanding of how that happens underneath the hood. With the help of register renaming, Tamasulo’s algorithm (i.e. use of reservation stations, pipeline executions), a reorder buffer, the CPU can safely execute instructions without fear that the register values are being overwritten by another instruction.

L1 Caches

When I took computer organization, I learned of L1 caches (and in general, the memory hiearchy). However, I learned about different types of caches: direct mapped, set asscotiative, fully associative caches. In fact, I had to write some C++ code that would track different types of cache misses (i.e. coherence misses, conflict misses, compulsory misses, capacity misses).

Implementing an LRU

For the second project, we were tasked with instrumenting SESC code in order to track L1 cache misses — compulsory misses, conflict misses, and capacity misses. And in order to correctly track a conflict miss, I had to implement an LRU (least recently used) cache since a conflict miss is one that would otherwise not be a miss in a fully associative cache.

Cache coherence

When I was taking operating systems (OS) course last term, cache coherency was sparsely covered and I remember being pretty confused as to 1) what cache coherence was 2) why test-and-set lock implementation was less desirable than test-test-and-set. However, after hand solving problems that required us to track each cache line (for various protocols, such as MSI and MESI and MOESI), I now have a much better understanding of the trade offs of cache coherence protocols.

Also, I learned the difference between cache coherence and cache consistency. The former deals with accesses (between cores) for a single address while the ladder deals with accesses for mutiple addresses.

Fault tolerance

Ever since I worked in industry, I’ve been building “fault tolerant” systems. But taking a step back, what exactly is fault tolerance? But in order to answer that, we need to first define a few terms:

  • Dependability - does the actual behavior of the system line up with the specified behavior?
  • Fault - a deviation (within a module) from the specified behavior
  • Error - actual behavior within system deviates from specified behavior
  • Failure - system (itself) deviates from specified behavior

Not all faults manifest into errors. Say you have a bug in your program but that code path is never called during the runtime. In that case, there’s a fault but an error never manifests. Basically, the fault is a latent bug.


I was first introduced to synchronization in my OS (operating systems) course and HPCA built on top of that existing knowledge. In the course, I learned how to correctly (as well as incorrectly) write a barrier, allowing two threads to only proceed once all of them reach a specific point in code. You would think it’s pretty straight forward but its pretty nuanced since its entirely possible to write code that can trigger a deadlock situation, when no threads can proceed.