ARCH

this is a representation of the Golden Cove intel architecture

Dynamic Frequency Scaling (DFS)

Also called Turbo mode.
Allows the CPU to temporarily increase its frequency above the base clock to boost performance.
This can distort benchmarks, since the CPU does not run at a constant frequency, reducing result consistency and accuracy.


Architecture models (high level)

1) Register-based, load–store architectures

  • Arithmetic ops use registers only
  • Memory accessed only via explicit load/store
  • Easy pipelining, OoO-friendly, compiler-friendly
  • Most modern ISAs

Examples: ARM, RISC-V

2) Register-based, register–memory architectures

  • Arithmetic ops may use register + memory operand
  • ISA is not load–store, but microarchitecture behaves like one
  • Legacy-friendly, complex decoding

Example: x86

Small remark: Stack-based architectures

  • Operands implicit on a stack
  • Compact encoding, poor for pipelining
  • Common in VMs, not modern CPUs

Examples: JVM, WebAssembly


Pipeline

detail cpu pipeline here

PIPELINE Example

add rax, [rbx+8]

1. Frontend — What instruction to run?

  • BPU predicts next PC
  • L1 I-cache fetches instruction bytes
  • Pre-decode finds instruction boundary
  • Decode → 2 µops:
    • µop 1: load [rbx+8]
    • µop 2: add → rax
  • µop cache may bypass decode entirely
  • µops enter IDQ

2. Rename / Allocate — Prepare execution

  • ROB allocates entries
  • RAT maps:
    • rax → physical register P42
  • Dependencies tracked
  • µops enter Scheduler (RS)

3. Execute — Do work when ready

Load µop

  • AGU computes rbx + 8 (virtual address)
  • DTLB translates VA → PA
  • L1 D-cache lookup
  • If hit → value ready

Add µop

  • Waits for load result
  • Dispatched to INT ALU port
  • Computes addition

4. Memory system — Handle misses if needed

  • If L1 miss:
    • L2 → L3 → RAM
  • Fill Buffer tracks request
  • Load resumes when data arrives

5. Retire — Make it architecturally visible

  • µops complete out of order
  • ROB retires in order
  • rax updated
  • Instruction committed

6. What each block did (one line each)

  • BPU: guessed next instruction
  • Frontend: turned bytes → µops
  • ROB / RAT: renamed + tracked speculation
  • Scheduler: waited for operands
  • AGU: computed address
  • TLB: translated address
  • Cache: supplied data
  • ALU: computed result
  • ROB retire: committed state

Pipeline Hazards (Reminder)

  • Break ideal pipelining → stalls
  • Affect throughput / effective frequency
  • Fully handled by hardware

Structural

  • Resource conflict (same unit, same cycle)
  • Mitigation: more hardware (ALUs, ports)

Data

  • RAW: true dependency → forwarding
  • WAR: false dependency → register renaming
  • WAW: false dependency → register renaming

Control

  • Branches / flow changes
  • Mitigation: branch prediction, speculation

A superscalar engine

executes multiple independent instructions per cycle by issuing them to several execution units in parallel.


Branch Prediction

  • Correct prediction → high throughput
  • Misprediction → expensive pipeline flush
  • Uses dynamic, adaptive predictors

Branch Types

  • Unconditional / direct
    • Always taken
    • Single target
    • Easy to predict
  • Conditional
    • Taken / not taken
    • Forward (if/else): often not taken
    • Backward (loops): often taken
  • Indirect
    • Multiple targets
    • From switch, function pointers, virtual calls
    • Returns are also indirect

CACHE

Cache is address-based

  • Cache does not store variables
  • Stores data indexed by physical addresses
  • Any access is:
    • Address → set → tag match → hit/miss
  • cache line (usually 64B, Apple M chips: 128B)

Address generation

  • Address Generation Unit (AGU):
    • base register + offset
  • Result = virtual address

Addresses used by programs

  • Load / store instructions issue virtual addresses (VA)
  • All address generation happens in load–store units
  • Other execution units do not see addresses

Cache access

  • Cache stores physical memory only
  • Cache lines are physically tagged
  • Allows cache indexing and address translation to proceed in parallel (L1)
  • L2 / L3 caches require translation before lookup

TLB (Translation Lookaside Buffer)

Purpose

  • Cache recent virtual → physical translations
  • Avoid expensive page-table walks

TLB hierarchy

  • L1 iTLB: instruction fetch
  • L1 dTLB: data access
  • L2 TLB (STLB): shared, larger, slower

What TLB stores

  • Virtual Page Number (VPN)
  • Physical Page Number (PPN)
  • Permission bits:
    • read / write / execute
    • user / kernel
  • State bits (valid, global, etc.)

TLB miss handling

  • L1 TLB miss → check L2 TLB
  • L2 TLB miss → hardware page-table walker
  • Page walker:
    • reads page tables from memory
    • uses cache hierarchy
  • Mapping exists → TLB filled, load/store retried
  • Mapping does not exist → page fault

Page fault

  • Page table entry not present
  • CPU raises page fault exception
  • OS handles:
    • allocate page or
    • load page from disk
  • Instruction is restarted
  • TLB: translates virtual addresses → physical addresses (validity, permissions)
  • Cache: uses physical addresses → returns data quickly

TLB lookup and L1 cache indexing can proceed in parallel

Important rules

  • TLB miss ≠ cache miss
  • TLB miss ≠ page fault
  • TLB miss cost > cache miss cost

TESTING

CI Perfomoramce

  1. Set up the system under test
  2. Run benchmark suite
  3. Collect and report results
  4. Detect performance changes
  5. Alert on unexpected regressions or gains
  6. Visualize results for human analysis

CI Threshold

Performance tests in CI must define explicit acceptable variation thresholds.
Unexpected performance gains in unrelated code paths often indicate measurement errors or regressions, not better code.

LUCI

performance test suite of the chromium project

Manual Testing

  • Always compare baseline vs modified
  • Never trust single-run results
  • Run benchmarks multiple times
  • Treat results as a distribution
  • Check standard deviation first
    • High variance → results unreliable
  • Do not compute speedups on noisy data
  • Unexpected speedups often mean measurement error
  • Use visualization to validate results
  • Outliers matter — don’t discard blindly
  • Prefer median / percentiles if data is skewed
  • Reduce noise before increasing sample count
  • More samples ≠ better results if variance remains high

Timer

  • Long-running tests: use standard library time functions
    • Based on system (wall-clock) time
  • Short / precise tests: use CPU cycle counters
    • x86: RDTSC/RDTSCP instruction
    • ARM: architectural timer sts: use CPU cycle counters

X86 Time Reg

  • RDTSC — Read Time-Stamp Counter
    • Reads the TSC (Time Stamp Counter)
    • Counts cycles since reset
  • RDTSCP — Serializing variant of RDTSC
    • Ensures better ordering for measurements

Notes

  • Modern CPUs use invariant TSC
  • TSC frequency is constant (independent of DFS / Turbo)
  • Still affected by out-of-order execution → use serialization if needed

ARM Time Reg

  • CNTVCT_EL0 — Virtual Count Register (most commonly used in user space)
  • CNTPCT_EL0 — Physical Count Register (more privileged)

Both count cycles of the architectural timer, not core frequency.

Typical choice:

  • User-space timingCNTVCT_EL0
  • Kernel / low-level timingCNTPCT_EL0

(They are the ARM equivalent of x86 RDTSC, but with a fixed-frequency timer.)


Terminology and Metrics

retired vs executed instr

  • executed that is executed by the cpu
  • retired that is executed and in the speculation is right

if an instruction is executed but the branch end up not taken all the result get flush

CPU Utilization

The percentage of time a core is not idle.

  • Formula: Ref Cycles (Unhalted)​ / Total TSC Cycles =
    • Ref Cycle how many reference clock cycles happened while the CPU core was awake and not idle
    • TSC Timestamp Counter (always ticking)
  • The Trap: High utilization != High performance. A CPU can be 100% “utilized” while completely stalled waiting for memory or spinning on a lock.

IPC and CPI

Metrics for microarchitectural efficiency.

1. IPC (Instructions Per Cycle) most use for perf

  • Definition: How many instructions retired per clock cycle
  • Goal: Higher is better
    • Compute-heavy: 4–6
    • Memory-heavy: 0–1
  • Formula: Instructions Retired​ / Cycles (Unhalted)

2. CPI (Cycles Per Instruction)

  • Definition: How many cycles to retire one instruction 1/IPC
  • Goal: Lower is better

The Performance Equation

Performance = IPC × Frequency

Frequency vs. IPC

  • Independence:
    • IPC and Frequency are independent metrics in benchmarking.
  • Clock Speed:
    • Frequency only defines how fast a cycle is
    • IPC defines how much work happens inside that cycle.
  • The Reality:
    • Running a CPU at 1 GHz vs. 5 GHz usually results in the same IPC
    • To increase IPC: Improve the design (larger caches, better branch prediction, Out-of-Order execution).
    • Design Trade-off:
      • Intel/AMD often push High Frequency
      • Apple M-series pushes High IPC (which allows for lower frequency and better power efficiency).

Micro-operation (µop)

x86 architecture is:

  • Outside (ASM): CISC (Complex Instruction Set Computer)
  • Inside (Hardware): RISC-style (Reduced Instruction Set Computer)

x86 CISC instructions are split into RISC-style µops to:

  • OOO : push rbx split into:
    • sub rsp, 8
    • store [rps], rbx
  • Execute in parallel: Complex instructions are broken into independent parts.
    • Example: HADDPD xmm1, xmm2 (Horizontal Add) splits into:
      1. Reduce xmm2tmp1
      2. Reduce xmm1tmp2
      3. Merge results into xmm1
    • Steps 1 and 2 are independent and can run simultaneously.
  • Simplify the Execution Pipeline: The backend only needs to handle simple, uniform operations rather than hundreds of complex legacy instructions.

Microfusion

if a instruction is :

  • memory write
  • read-modify you can be fuse into 1 µop (ex : `add eax [mem])

Macrofusion

use μops from different machine instructions. Applies to: Typically an arithmetic/logic instruction followed by a conditional jump.

Example :

dec rdi
jnz .loop

Modern updates: Zen 4 now supports macro-fusion for DIV and NOP sequences.

Pipeline Slot

A pipeline slot represents the hardware resources required to process a single µop.

  • Definition: It is the unit of “work capacity” for a CPU cycle.
  • Allocation Width in µops:
    • Skylake, Zen 3: 4 wide
    • Golden Cove / Zen 4: 6-wide
    • Apple M3: 9-wide
  • The IPC Cap: A machine’s width is its theoretical maximum IPC. If you measure an IPC of 7 on a 6-wide Golden Cove, your measurements are likely wrong.

Core vs. Reference Cycles

Because of Turbo Boost (DFS), “one cycle” has a variable duration.

  • Core Cycles (cycles): Actual pulses at the current frequency.
    • Best for: Measuring code complexity/efficiency.
  • Ref Cycles (ref-cycles): Pulses at the fixed base frequency.
    • Best for: Measuring time relative to the official clock speed.
  • Key Rule: IPC and Frequency are independent. * Changing frequency (1GHz → 5GHz) doesn’t change IPC.
    • IPC is improved by design (bigger caches, better predictors).

Cache Miss

Occurs when the CPU looks for data in a cache and it isn’t there.

  • L1 Cache 4 cycles (~1 ns)
  • L2 Cache 10-25 cycles (5-10 ns)
  • L3 Cache ~40 cycles (20 ns)
  • Main Memory 200+ cycles (100 ns)

Mispredicted branch

ispredicted branch typically involves a penalty between 10 and 25 clock cycles.

  • all instr exec need to be flush
  • some buffers may require cleanup to restore the state

Memory Latency and Bandwidth

  • Latency (how fast to get 1 byte)
  • Bandwidth (how many bytes per second).

Measuring Performance

  • Latency: Measured by pointer chasing (sequential, dependent loads).
  • Bandwidth: Measured by independent loads (often using SIMD/AVX2) to saturate the data bus.
  • Tools: Intel MLC (Memory Latency Checker) for x86; lmbench or STREAM for ARM/Generic.

Hierarchy Metrics (Example: Alder Lake)

LevelSize (P-core)LatencyMax Bandwidth (All Cores)
L1 D-Cache48 KB~1.1 ns~TB/s scale
L2 Cache1.25 MB~3.2 ns~175 GB/s
L3 (Shared)18 MB~12.5 ns~300 GB/s
DDR4 RAM16 GB~45–90 ns~33.7 GB/s

Idle vs. Loaded

  • Idle Latency: Travel time on an empty highway; data arrives at the absolute maximum hardware speed.
  • Loaded Latency: Travel time during rush hour; data is delayed because it must wait in line (queue) behind other requests.

Case Study: Analyzing Performance Metrics of Four Benchmarks

To collect performance metrics, I used the toplev.py script from Andi Kleen’s pmu-tools

Metric NameCore TypeBlenderStockfishClang 15CloverLeaf
InstructionsP-core6.02E+126.59E+112.40E+131.06E+12
Core CyclesP-core4.31E+123.65E+113.78E+135.25E+12
IPCP-core1.401.800.640.20
CPIP-core0.720.551.574.96
InstructionsE-core4.97E+1201.43E+131.11E+12
Core CyclesE-core3.73E+1203.19E+134.28E+12
IPCE-core1.3300.450.26
L1MPKIP-core3.8821.386.0113.44
L2MPKIP-core0.151.671.093.58
L3MPKIP-core0.040.140.563.43
Br. Misp. RatioP-core0.020.080.030.01
Code STLB MPKIP-core00.010.350.01
Ld STLB MPKIP-core0.080.040.510.03
LdMissLat (Clk)P-core12.9210.3776.7253.89
ILPP-core3.673.652.932.53
MLPP-core1.612.621.572.78
Dram Bw (GB/s)All1.581.4210.6724.57
IpCallAll176.8153.540.92,729
IpMispredictAll610.4214.7177.72,416
IpFLOPAll1.11.82E+06286,3481.8
IpSWPFAll90.22,565105,933172,348

1. The “Big Picture” Efficiency

  • IPC (Instructions Per Cycle): * High (> 1.5): Likely Compute-Bound (Blender, Stockfish).
    • Low (< 1.0): Likely Memory or Latency-Bound (Clang, CloverLeaf).
  • Core Distribution: Check if work is balanced between P-cores and E-cores (Blender/CloverLeaf) or if the app is pinned to a single thread (Stockfish)

2. Compute & Instruction Mix

  • IpFLOP / IpArith: Are you doing math?
    • If high, check IpArith AVX128/256: Is the code vectorized? (Blender/CloverLeaf).
    • If zero, you are likely doing Integer logic or Pointer Chasing (Stockfish/Clang).
  • IpSWPF (Software Prefetches): Look for these to see if the compiler or programmer is manually “helping” the memory controller.

3. Branching & Control Flow

  • Br. Misp. Ratio: Anything over 2-5% is a major red flag (Stockfish/Clang).
  • IpMispredict: How many instructions pass between “surprises” for the CPU?
  • IpCall: Are functions too small? If this is low (~40), the CPU spends more time handling function “overhead” (stack setup/cleanup) than doing real work (Clang).

4. Memory Hierarchy & TLB

  • L*MPKI (Misses Per Kilo-Instruction):
    • High L1/L2 MPKI: Code doesn’t fit in local cache.
    • High L3 MPKI: Code is streaming from RAM (CloverLeaf).
  • LdMissLat (Load Miss Latency): This is the “pain metric.”
    • Low (~12 cycles): Hits are mostly in L3.
    • High (>200 cycles): Hits are coming from RAM; CPU is starving (CloverLeaf).
  • STLB MPKI (TLB Misses): Does the binary have a huge footprint? High values here mean the CPU is “lost” trying to find where code or data lives in memory (Clang).

5. System Limits

  • DRAM Bandwidth: Compare your measured GB/s against your RAM’s theoretical max.
    • If you are near the limit (~70-80%+), adding more threads will actually slow down the program due to contention (CloverLeaf).