Files
rk3588-ddr-analysis/BUG_ANALYSIS.md
T
marfrit e30127d056 BUG_ANALYSIS + regs_annotated.h: TRM-canonical names for poll-site regs
Per RK3588 TRM Part 2 chapter 2 (DMC, 522 pages):
  +0x10080 = DDRCTL_MRCTRL0   (Mode Register Control, was MicroReset)
  +0x10090 = DDRCTL_MRSTAT    (MR Status mr_wr_busy, was MicroContMuxSel)
  +0x10514 = DDRCTL_DFISTAT   (DFI Status dfi_init_complete, was UctWriteProtShadow)

These are uMCTL2 controller registers — Rockchip-documented — NOT the
opaque PHY firmware scratch regs our 2026-04 analysis guessed. Poll
semantics now vendor-grounded: wait for MR command roundtrip, wait
for PHY-side DFI handshake.

Low-offset polls in train_phy_block (0x110, 0x118, 0x120, 0x154, 0x160,
0x184) plus the 0x684/0xa24/0xb88 ones remain DWC PUB and thus
undocumented; kept the best-effort RE names with `(RE)` tag in the
BUG_ANALYSIS table so a reader can tell which ones are vendor-canonical
and which are guesses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 08:38:49 +02:00

16 KiB

RK3588 DDR Init Blob — Bug Analysis & Training Explainer

What is DDR Training?

DDR training is the process by which the memory controller and PHY (physical interface) calibrate their timing to reliably communicate with the DRAM chips. At the frequencies involved (2400-3200 MHz clock, 4800-6400 MT/s data rate), signal integrity is the primary challenge. Electrical signals traveling on PCB traces experience:

  • Propagation delay — different trace lengths = different arrival times
  • Crosstalk — adjacent signals interfere with each other
  • ISI (inter-symbol interference) — previous bit values affect current bit
  • Voltage droop/overshoot — impedance mismatches cause reflections
  • PVT variation — process, voltage, temperature affect timing margins

Training compensates for all of these by finding the optimal timing window (the "eye") for each signal individually.

Training Stages (as seen in this blob)

The RK3588 uses a Synopsys DWC (DesignWare Core) LPDDR5/4X multiPHY. The training sequence visible in the decompiled code follows the standard DWC PHY training flow:

1. ZQ Calibration (0x684 — CalBusy register)

  • What: Calibrates output driver impedance (pull-up/pull-down strength)
  • Why: Process variation means each chip's transistors are slightly different
  • In the code: Polls DWC_DDRPHY_MASTER_CalBusy (offset 0x684) — 11 uses
  • Bug risk: No timeout on CalBusy poll (Line 3226)

2. Write Leveling

  • What: Aligns the DQS (data strobe) signal with the CK (clock) at the DRAM
  • Why: PCB trace lengths differ — the clock and data arrive at different times
  • How: Controller sends DQS edges, DRAM reports whether DQS arrived early/late
  • In the code: Part of the large training functions with nested loops (0x10 iterations = 16 DQ bits, 0x20 iterations = 32-bit data path)

3. Read Gate Training

  • What: Finds the correct time to enable the read data capture circuit
  • Why: The controller must know exactly when valid read data will arrive
  • In the code: Functions polling DfiStatus (offset 0xa24, 65 uses)

4. Read/Write DQ Training (per-bit deskew)

  • What: Adjusts timing for each individual data bit (DQ0-DQ15)
  • Why: Each bit trace has slightly different length and coupling
  • How: Sends known patterns (0xAA55AA55, 0x55AA55AA), reads back, adjusts delays
  • In the code: The 0xaa55aa55 pattern writes at offsets 0x93c-0x970 (Lines 3857-3868)
  • Size: ~30 functions dedicated to DQ training — the bulk of the blob

5. Read/Write Eye Training

  • What: Scans the timing and voltage range where data is reliably captured
  • Why: Finds the center of the "eye diagram" — maximum noise margin
  • How: Sweeps delay and VREF values, tests at each point
  • In the code: The "eyescan" blob variant runs extended eye analysis
  • Note: The 0xff00aa pattern (28 uses) is a BUS_GRF configuration for interleaving/routing — not directly eye training but enables the channel configuration needed for it.

6. VREF Training (Voltage Reference)

  • What: Finds optimal voltage threshold for distinguishing 0 and 1
  • Why: At high speeds, the voltage swing is smaller — VREF must be centered
  • Two sides:
    • Host-side VREF: PHY's input comparator threshold
    • DRAM-side VREF: DRAM's input comparator threshold (set via Mode Register Write)
  • In the code: Functions accessing offsets 0x600, 0x608, 0x60c (29+24+14 = 67 uses)

7. CA Training (Command/Address)

  • What: Calibrates timing for the command/address bus (separate from data)
  • Why: Commands must arrive reliably — a missed command corrupts everything
  • In the code: Uses the same DFI interface (offset 0xa24) but with CA-specific mode register commands

Why Training Must Run Every Boot

Training results depend on:

  • Temperature — timing shifts by ~1-2 ps/°C
  • Voltage — supply voltage affects driver strength
  • DRAM internal state — varies between power cycles
  • PCB and component aging — long-term drift

The results are stored in SRAM (offsets 0x001FE000-0x001FE010) and passed to the kernel via PMU GRF OS registers. The kernel uses these for DVFS (Dynamic Voltage and Frequency Scaling) during runtime.


Bug Analysis

CRITICAL: 20 Timeout-less Hardware Polls

The most serious class of bugs. These are do {} while loops that poll hardware registers indefinitely. If the hardware doesn't respond (due to a clock issue, reset problem, or silicon defect), the system hangs permanently during boot with no diagnostic output.

Updated 2026-04-15 per RK3588 TRM Part 2: the uMCTL2 controller offsets are now vendor-named. PHY PUB offsets remain undocumented in the TRM (DWC / Innosilicon IP is not republished) — the names below with (RE) are our reverse-engineering guesses.

Register Offset (base) Uses What it waits for Source
SGRF_DDR_STATUS abs 0xFE0500E0 1 Security GRF config done RK3588 TRM part 1
SGRF_DDR_CON21 abs 0xFE050054 2 Security GRF configuration done RK3588 TRM part 1
DDRCTL_DFISTAT DDRCTL + 0x10514 5 dfi_init_complete — PHY↔controller handshake TRM part 2 Ch.2 (renamed from "UctWriteProtShadow")
DDRCTL_MRSTAT DDRCTL + 0x10090 4 mr_wr_busy — Mode Register Write complete TRM part 2 Ch.2 (renamed from "MicroContMuxSel")
DDRCTL_MRCTRL0 DDRCTL + 0x10080 2 Mode Register Control (not a poll target by itself — but polled by code waiting for MR command completion) TRM part 2 Ch.2 (renamed from "MicroReset")
PHY_CTL_STATE PHY + 0x14 (RE) 4 PHY state machine: [2:0] == 1 (idle) or (val & 7) == 3 (some training stage) Reverse-engineered — still not in TRM
PHY_CALBUSY PHY + 0x684 (RE) 1 ZQ calibration complete — name matches DWC PUB convention Reverse-engineered
PHY_DFI_READY PHY + 0xa24 (RE) 4 DFI-side handshake bit from PHY — separate from DDRCTL_DFISTAT Reverse-engineered
PHY_SHADOW_BB8 PHY + 0xb88 (RE) 2 Shadow status word that carries training firmware state between sub-blocks Reverse-engineered
PHY_TRAIN_STEP PHY + 0x118 / 0x120 (RE) 2 Step-complete bits [31:28] — used in train_phy_block (d328) Reverse-engineered
PHY_HANDSHAKE PHY + 0x184 (RE) 2 HANDSHAKE bits [1:0] — writer/reader sync in d328 Reverse-engineered

Base conventions:

  • DDRCTL = per-channel uMCTL2 controller base (four channels: DDRCTL0..3 per TRM Table "DDR Channel X IO description", pp. 557-558).
  • PHY = per-channel PHY base pointer held in ctx[ch*32], with the +0x8000 sub-block for the "Master"-class PHY block seen in d328 and the +0x10000 sub-block for the larger PHY block seen in d10c.

Impact: Any of these can cause a boot hang. The most likely failure mode:

  • Cold boot at extreme temperatures (timing margins shrink)
  • DRAM module with slow ZQ calibration
  • Power supply droop during training (PHY doesn't respond)

Fix: Add a timeout counter (e.g., 1000 iterations with 1µs delay = 1ms timeout) and return an error code. The calling function already checks for 0xFFFFFFFF error returns (23 instances).

WARNING: Read-Modify-Write on MMIO Without Memory Barriers

Several MMIO registers are read, modified, and written back without memory barriers (dsb, dmb, or isb). On AArch64 with strongly-ordered device memory, this is usually safe if the memory type is set correctly (Device-nGnRE or Device-nGnRnE). However, if the MMU mapping is incorrect (Normal memory type), these operations could be reordered.

Affected registers:

  • SGRF_DDR_ENABLE (|= 1, &= ~1)
  • FW_DDR_ACCESS_CTRL (|= 0xFFFF, &= 0xFFFF0000)

Since this runs in EL3 with the MMU configuration controlled by BL31, this is likely safe — but it's a latent risk if the memory map changes.

WARNING: Firewall Left Open

ddr_open_firewall() (Line 137) sets FW_DDR_ACCESS_CTRL |= 0xFFFF, granting all bus masters DDR access. The matching ddr_close_firewall() (Line 206) re-restricts it. However, the close function may not be called on all error paths — an early return due to training failure could leave the firewall wide open.

OPTIMIZATION: Redundant Register Polls

Several functions poll the same register in sequence:

  • Lines 1969-1979: Three consecutive polls of +0x10090 and +0x10080
  • Lines 2470-2480: Same triple poll pattern

These appear to be:

  1. Wait for PHY firmware to finish current operation
  2. Check firmware status
  3. Wait for firmware to accept new command

The first and third polls are redundant if the firmware always transitions atomically. This could be a defensive coding pattern or a workaround for a PHY firmware bug where the status isn't updated atomically.

OPTIMIZATION: Magic Number 0xFF00AA

The value 0xFF00AA appears 28 times in BUS_GRF register writes. This is the Rockchip GRF "write-enable mask" pattern:

  • Upper 16 bits = write mask (0xFF00 = bits 15:8 writable)
  • Lower 16 bits = value (0x00AA)

This is a hardware feature of Rockchip GRF registers — not a bug, but the decompiled code obscures the intent. In readable form:

BUS_GRF_REG[15:8] = 0xAA;  // set bits 15:8 to 10101010

OBSERVATION: Error Recovery Strategy

The blob has 23 error returns (0xFFFFFFFF) across 1405 conditional checks — a 1.6% error handling ratio. Most errors result in immediate abort with no retry. The main orchestrator function (ddr_pmu_status_check at 0x9A90, the largest at 43K chars) does attempt retries by calling training subfunctions in sequence and checking their return values.

The error flow is:

  1. Training function fails → returns 0xFFFFFFFF
  2. Orchestrator detects failure → prints error string via UART
  3. Returns failure to BL2
  4. BL2 typically resets the SoC and tries again

There is no selective retry (e.g., "write leveling passed but read gate training failed, retry only read gate training"). Each failure restarts the entire training sequence from scratch.


Code Structure Summary

Component Functions Lines Purpose
Entry/dispatch 3 ~100 Reset vector, version check
Security setup 2 ~50 SGRF, firewall open/close
Clock/PLL 3 ~200 DPLL config, clock gating, reset
Bus config 1 ~800 27 BUS_GRF registers
PHY training ~30 ~6000 DQ/CA/VREF/eye training
DDRC init 5 ~2000 Controller configuration
Timing calc 3 ~1500 Timing parameter computation
Orchestrator 1 ~1500 Main sequence, error handling
Mailbox/SRAM 2 ~200 BL31 communication
Scramble 1 ~100 DDR encryption
Utilities ~65 ~500 Helper functions

Cross-Reference: Known Bugs vs Decompiled Code

v1.18 "Single-rank LPDDR5 derate crash" — Found in Code

The v1.18 release notes say: "Fixed derate issue with single-rank LPDDR5" and "System might hang in kernel when switching frequency".

In the decompiled code, the DERATEINT/MR4 logic is in the large timing calculation function FUN_0000de40 (22,819 chars, 162 branches). This function computes timing parameters including derating adjustments. The single-rank bug likely affected the branch at the CS0/CS1 asymmetric capacity handling, which was added in v1.16 but not correctly gated for single-rank configurations.

The timeout-less polls at offsets +0x10090 and +0x10080 (PHY firmware mailbox and reset) are on the DVFS frequency switch path — exactly where the v1.18 hang was reported.

v1.15 "PHY skew > DLL lock" — Found in Code

The training functions contain per-bit deskew calculations with clamping logic. In the DQ training functions (e.g., at line ~2040-2100), nested loops iterate over 0x20 (32) delay taps and 4 byte lanes. The clamping check ensures the selected delay tap doesn't exceed the DLL's lock range — a boundary condition that v1.15 fixed.

The 20 Timeout-less Polls — Explain Cold Boot Failures

Community reports of cold boot failures (Armbian, Radxa forums) are consistent with the 20 timeout-less hardware polls found in this analysis. At low temperatures:

  1. ZQ calibration takes longer (CalBusy at +0x684) — silicon is slower
  2. PHY firmware startup is slower (MicroReset at +0x10080)
  3. DFI interface negotiation takes longer (DfiStatus at +0xA24)

Without timeouts, any of these becoming "slightly too slow" causes a permanent boot hang. The board appears dead until power-cycled (which changes temperature slightly, possibly allowing the next boot to succeed).

The LPDDR5 Bandwidth Paradox

ThomasKaiser documented that LPDDR5 at 5472 MT/s showed worse latency than LPDDR4X at 4224 MT/s on the Rock 5 ITX. This is explained by the LPDDR5 protocol overhead visible in the decompiled code:

  • WCK synchronization — LPDDR5 requires WCK-to-CK alignment before every data transfer, adding ~5 ns latency per burst
  • Longer CA training — the separate CA bus requires CBT Mode 1/2 training
  • More training stages — 15 steps vs ~8 for LPDDR4X

The bus configuration in BUS_GRF (27 registers at 0xFD5F8xxx) is significantly more complex for LPDDR5, with the 0xFF00AA write-mask pattern used 28 times to configure interleaving and routing for the 4-channel LPDDR5 topology.


Synopsys DWC PHY — Training Sequence in the Code

The decompiled training flow maps to the standard Synopsys DWC PHY sequence:

Register-to-Training-Stage Mapping

PHY Offset Synopsys Name Training Stage Uses
+0x684 CalBusy ZQ Calibration 11
+0xA24 DfiStatus DFI ready / gate training 65
+0x600 VrefDAC0 VREF training (host-side) 29
+0x608 VrefDAC1 VREF training (DRAM-side) 24
+0x60C VrefDAC2 VREF training 14
+0x10080 MicroReset PHY firmware control 13
+0x10090 MicroContMuxSel Firmware ↔ APB mux varies
+0x10180 AcsmPlayback Address/Command SM 26
+0x10280 AcsmPlayback+0x100 AC training 21
+0x10510 UctWriteOnlyShadow Training write commands 28
+0x10514 UctWriteProtShadow Training status/complete 28
+0x12BA0 Reserved/vendor Vendor-specific training 11

The 0xAA55AA55 Training Pattern

The distinctive pattern 0xAA55AA55 written to offsets 0x93C-0x970 (lines 3857-3868) is the DQ training data pattern. The alternating bit pattern is specifically chosen because:

  • 10101010... maximizes switching noise (worst-case ISI)
  • 01010101... tests the complementary case
  • 10101010_01010101 (0xAA55) tests all DQ-to-DQ crosstalk combinations

The variations (0xAAAA5555, 0x55AA55AA, 0x00005555) provide different crosstalk scenarios — each pattern stresses a different subset of inter-bit coupling on the PCB.


Optimization Opportunities

1. Add Timeouts to Hardware Polls (Critical)

Add a countdown with ~1ms timeout to all 20 identified polls. Return 0xFFFFFFFF on timeout — the infrastructure already exists.

2. Selective Training Retry

Currently, any training failure restarts the full sequence. The Synopsys PUB supports restarting individual training steps via the PIR register. Retrying only the failed step would reduce recovery time from ~100ms to ~10ms.

3. Parallel Channel Training

The code appears to train channels sequentially (single-channel DDRC access at 0xFE010000). The Synopsys PUB firmware supports parallel training of independent channels — this could halve training time for 4-channel configs.

4. Remove Redundant Polls

The triple-poll pattern (lines 1969-1979, 2470-2480) appears to be defensive coding for a PHY firmware race condition. If the race is fixed in current firmware, these could be collapsed to single polls.

5. Spread-Spectrum Clocking

The ddrbin_tool supports spread-spectrum mode (center/up/down spread) for EMI reduction. This is not configured in the standard blob — enabling center-spread could reduce DDR EMI by 6-10 dB with negligible performance impact.