
This is Part 8 in the series: Linux on STM32MP135. See other articles.
My STM32MP135 board includes DDR3L RAM and initial tests shows that I can fill it up with pseudo-random data and read it back correctly. ST provides a DDR test utility with a suite of memory tests, all of which pass. I decided to take it a step further and test the memory on a more intensive real-world task: “unzipping” a compressed file.
The result of the decompression test was very bad: most of the file was uncompressed correctly, with just a few bits always wrong, and just a few of them only sometimes wrong. I spent two or three days tracing my way through the “unzip” code, instruction by instruction, to try to catch where exactly it goes wrong.
In the end, I made an embarrassing discovery: I have partially swapped byte lanes. DDR3L on this SoC has two byte lanes, each consisting of {data, mask, strobe}. I have connected the data bits correctly, but swapped the mask & strobe between the two bytes. (Six high speed traces, some on inner layers—there’s no fixing that by hand.) Had I also swapped the data bits, everything would have been fine; indeed, the eval board swaps all the wires, which led me astray. (Partially.)
Sadly, AI was of no help in this instance. Given my DDR3L wiring, I can convince it either way: the connections are good; the connections are not good. In the end, only Rev B will tell for sure.
In this article we will proceed with debugging boot of the compressed Linux
kernel image (zImage) on a custom
board populated with the
STM32MP135 SoC. The starting point will be the build that runs on the
evaluation board as described in the previous
article.
Despite booting just fine, the zImage gets stuck on boot on the custom board,
without any messages printed to the UART console. Following along with the
debugger shows that the decompressor code does run, but it’s not clear where
exactly it gets stuck.
It is possible that the burst of DDR activity during the high-speed decompression draws more current than the 1.35V supply is able to provide, despite the decoupling capacitance.
Indeed, on the scope I see a 30mV drop in the 1.35V supply voltage for about 500ms. However, if I raise the supply voltage by the 30mV, the boot still gets stuck. This was with kernel being written to 0xC2008000 and the DTB to 0xC4008000, which means that relocation isn’t necessary. My interpretation is that the scope trace shows that decompression takes about half a second.
Interestingly, if the kernel is written to 0xC0008000 and DTB to 0xC2008000, in which case relocation is necessary, the 20mV supply drop is shorter, about 150ms, and is followed by 10ms of a bigger drop, 120mV. That drop is indeed enough to disturb the decompression, since raising the supply voltage setpoint to 1.38V makes the bigger voltage drop be followed by 500ms of the usual 30mV drop. My interpretation: relocation takes 150ms, followed by 500ms of decompression, but the power supply is not stiff enough for relocation/decompression.
Soldering 1000uF electrolytic capacitors to the 1.25V and 1.35V rails, the effect is that both relocation and decompression complete (according to the scope trace, i.e., the 150ms and 500ms voltage drops are visible) with the two rails at 1.35V, 1.30V, 1.25V, 1.20V, 1.15V, but not below that. Restoring the supply setpoint to 1.35V, we see that the relocation and decompression complete as expected.
In order to avoid wasting time with relocation, we will from now on load the kernel to 0xC2000000 and the device tree to 0xC4000000. The scope trace of the 1.35V rail shows a small voltage drop for 500ms (decompression).
It’s not reassuring that we get zero console output during decompression. Trying
to get at least some output, I added CONFIG_DEBUG_LL=y to the .config file
and accepted most of the default options suggested by make:
Kernel low-level debugging functions (read help!) (DEBUG_LL) [Y/n/?] y
Kernel low-level debugging port
> 1. Use STM32MP1 UART for low-level debug (STM32MP1_DEBUG_UART) (NEW)
2. Kernel low-level debugging via EmbeddedICE DCC channel (DEBUG_ICEDCC) (NEW)
3. Kernel low-level debug output via semihosting I/O (DEBUG_SEMIHOSTING) (NEW)
4. Kernel low-level debugging via 8250 UART (DEBUG_LL_UART_8250) (NEW)
5. Kernel low-level debugging via ARM Ltd PL01x Primecell UART (DEBUG_LL_UART_PL01X) (NEW)
choice[1-5?]:
Enable flow control (CTS) for the debug UART (DEBUG_UART_FLOW_CONTROL) [N/y/?] (NEW)
Physical base address of debug UART (DEBUG_UART_PHYS) [0x40010000] (NEW)
Virtual base address of debug UART (DEBUG_UART_VIRT) [0xfe010000] (NEW)
Early printk (EARLY_PRINTK) [N/y/?] (NEW) y
Write the current PID to the CONTEXTIDR register (PID_IN_CONTEXTIDR) [N/y/?] n
However, no output appeared on the UART. Loading Image (rather than zImage)
produces the early prints, but the decompression hang mystery persists.
Note: follow along this section with the help of linusw’s article, “How the
ARM32 Linux kernel
decompresses”.
Let’s try to follow along the decompression using a J-Link debug probe. First, open the GDB server and connect to it:
JLinkGDBServer.exe -device STM32MP135F -if swd -port 2330
arm-none-eabi-gdb.exe -q -x load.gdb
Where the load.gdb script contains:
file build/main.elf
add-symbol-file build/compressed 0xc2000000
target remote localhost:2330
monitor reset
monitor flash device=STM32MP135F
load build/main.elf
monitor go
break handoff.S:93
Step instruction a few times till reaching just after the handoff code:
(gdb) bt
#0 0xc2000004 in _text () at arch/arm/boot/compressed/head.S:202
This shows that execution has begun at the beginning of the decompressor, in
file arch/arm/boot/compressed/head.S, in the start: label. We can step
through the code lines (n command in gdb) until reaching the line bne not_angel, which we have to step into (si):
(gdb) si
not_angel () at arch/arm/boot/compressed/head.S:245
245 safe_svcmode_maskall r0
Go forward (n) a few steps till reaching the C function
fdt_check_mem_start() (arch/arm/boot/compressed/fdt_check_mem_start.c), then
call finish to get out of it and continue stepping through the not_angel
section:
(gdb) finish
Run till exit from #0 fdt_check_mem_start (mem_start=1, fdt=0xc4000000) at
arch/arm/boot/compressed/fdt_check_mem_start.c:106
not_angel () at arch/arm/boot/compressed/head.S:312
312 add r4, r0, #TEXT_OFFSET
Value returned is $3 = 3221225472
(gdb) n
323 mov r0, pc
324 cmp r0, r4
325 ldrcc r0, .Lheadroom
326 addcc r0, r0, pc
327 cmpcc r4, r0
328 orrcc r4, r4, #1 @ remember we skipped cache_on
329 blcs cache_on
Step into cache_on and later call_cache_fn, and go through the many lines
till reaching the return from __armv7_mmu_cache_on:. Thus we reach the
restart: section:
(gdb) b 902
Breakpoint 3 at 0xc200055c: file arch/arm/boot/compressed/head.S, line 902.
(gdb) c
Continuing.
Breakpoint 3, __armv7_mmu_cache_on () at arch/arm/boot/compressed/head.S:902
902 mcr p15, 0, r0, c7, c5, 4 @ ISB
(gdb) n
903 mov pc, r12
(gdb) si
restart () at arch/arm/boot/compressed/head.S:331
331 restart: adr r0, LC1
Continue stepping through until reaching the wont_overwrite: section, and
then not_relocated:, where we clear BSS. Step through that, and we reach the
beginning of the decompression proper: the decompress_kernel() function in
arch/arm/boot/compressed/misc.c. Interestingly, we step right past the
putstr("Uncompressing Linux..."); line without seeing anything printed on the
UART console.
The function decompress_kernel() calls do_decompress(), which calls
__decompress which calls __gunzip. Calling finish on the latter exactly
correlates with the 500ms of the voltage drop observed on the 1.35V supply as
mentioned above. Now we’re back in the decompress_kernel() function, which
should print " done, booting the kernel.\n" (but doesn’t, since there’s
something wrong with my putstr function).
We return back to the not_relocated: section of the compressed head.S and
call get_inflated_image_size to find out how large the decompressed kernel
is:
not_relocated () at arch/arm/boot/compressed/head.S:636
636 get_inflated_image_size r1, r2, r3
638 mov r0, r4 @ start of inflated image
639 add r1, r1, r0 @ end of inflated image
(gdb) p/x $r0
$3 = 0xc0008000
(gdb) p/x $r1
$4 = 0xc1241f48
(gdb)
Subtracting the r1 and r0 values, we see that the uncompressed kernel is
exactly 19111752 bytes in size, which is identical to the size of the
arch/arm/boot/Image file. So far so good!
Next, the startup code cleans caches and turns them off again and jumps to
__enter_kernel just like we may do directly, had we loaded the uncompressed
image in memory with the bootloader. This places the pointer to the DTB into
r2 and passes control to the kernel:
__enter_kernel () at arch/arm/boot/compressed/head.S:1435
1435 mov r0, #0 @ must be 0
1436 mov r1, r7 @ restore architecture number
1437 mov r2, r8 @ restore atags pointer
1438 ARM( mov pc, r4 ) @ call kernel
Just before the jump to the kernel, we can check that the register values make
sense: r0 and r1 are zero, r2 has the DTB address, and the decompressed
kernel will run from location 0xC0008000 (= TEXT_OFFSET):
(gdb) p $r0
$5 = 0
(gdb) p $r1
$6 = 0
(gdb) p/x $r2
$8 = 0xc4000000
(gdb) p/x $r4
$9 = 0xc0008000
(gdb)
One fateful step and we’re running in the uncompressed kernel proper. Let’s load the symbols from the main kernel ELF file to see what’s going on:
(gdb) si
0xc0008000 in ?? ()
(gdb) add-symbol-file build/vmlinux 0xc0008000
add symbol table from file "build/vmlinux" at
.text_addr = 0xc0008000
Reading symbols from build/vmlinux...
(gdb)
Interesting, just one more step and the debugger stops as some much later point:
gdb) si
0xc0114620 in perf_swevent_init_hrtimer (event=0xc0008000 <stext>) at kernel/events/core.c:10836
10836 hwc->sample_period = event->attr.sample_period;
(gdb) bt
#0 0xc0114620 in perf_swevent_init_hrtimer (event=0xc0008000 <stext>) at kernel/events/core.c:10836
#1 perf_swevent_init_hrtimer (event=0xc0008000 <stext>) at kernel/events/core.c:10818
#2 cpu_clock_event_init (event=0xc0008000 <stext>) at kernel/events/core.c:10902
#3 0xc271e9f0 in ?? ()
But if we finish running the perf_swevent_init_hrtimer function, then
somehow we end up back in arch/arm/kernel/head.S. Debugging from that point
onwards appears to have gone totally insane!
Let’s start again from scratch. Set a breakpoint at the point where the uncompressed kernel is supposed to begin executing:
(gdb) b *0xc0008000
Breakpoint 6 at 0xc0008000: file arch/arm/kernel/head.S, line 501.
(gdb) c
Continuing.
Breakpoint 6, stext () at arch/arm/kernel/head.S:501
501 mov r0, r0
(gdb) p $pc
$11 = (void (*)()) 0xc0008000 <stext>
This is strange: program counter is in the expected location, but we’re on line
501 into head.S, rather than closer to the beginning of the file. The reason
is that we have incorrectly instructed GDB that the entire vmlinux starts at
0xC0008000, instead of just the first section. We can fix it by clearing the
symbol file, re-loading the symbols at their natural link address, and
verifying everything makes sense:
(gdb) symbol-file
Error in re-setting breakpoint 1: No source file named handoff.S.
No symbol file now.
(gdb) file build/vmlinux
Reading symbols from build/vmlinux...
(gdb) p/x &stext
$15 = 0xc0008000
(gdb) si
__hyp_stub_install () at arch/arm/kernel/hyp-stub.S:73
73 store_primary_cpu_mode r4, r5
(gdb) finish
Run till exit from #0 __hyp_stub_install () at arch/arm/kernel/hyp-stub.S:73
stext () at arch/arm/kernel/head.S:105
105 safe_svcmode_maskall r9
Now we’re simply running through the beginning of the normal kernel start in
section ENTRY(stext) in file arch/arm/kernel/head.S. By single stepping
through the code, we can find the exact section where things go badly wrong:
stext () at arch/arm/kernel/head.S:162
162 badr lr, 1f @ return (PIC) address
167 mov r8, r4 @ set TTBR1 to swapper_pg_dir
169 ldr r12, [r10, #PROCINFO_INITFUNC]
170 add r12, r12, r10
171 ret r12
__v7_ca7mp_setup () at arch/arm/mm/proc-v7.S:302
302 do_invalidate_l1
0xc01197fc 302 do_invalidate_l1
0xc0119800 302 do_invalidate_l1
0xc0119804 302 do_invalidate_l1
v7_invalidate_l1 () at arch/arm/mm/cache-v7.S:40
40 mov r0, #0
41 mcr p15, 2, r0, c0, c0, 0 @ select L1 data cache in CSSELR
(gdb)
0x2fff2f08 in ?? ()
We see that after the last mcr instruction, the code lands up in SYSRAM
instead of the DDR, from where we’ve been executing so far. That address
corresponds to the vectors as have been installed by the bootloader; in
particular, we have gotten into the dummy SVC handler.
Let’s examine the program instructions at the point just before where the failure occurs:
Breakpoint 7, v7_invalidate_l1 () at arch/arm/mm/cache-v7.S:40
40 mov r0, #0
(gdb) x/4x $pc
0xc0118b2c <v7_invalidate_l1>: 0xe3a00000 0x2f400f10 0xffffffff 0xee300f10
Very interesting! The expected instruction, 0xe3a00000, is followed by
0x2f400f10 and 0xffffffff. The first one is the “mystery” SVC call, and the second one
is simply undefined:
(gdb) set {int}0xc0000000 = 0x2f400f10
(gdb) x/i 0xc0000000
0xc0000000: svccs 0x00400f10
(gdb) set {int}0xc0000000 = 0xffffffff
(gdb) x/i 0xc0000000
0xc0000000: @ <UNDEFINED> instruction: 0xffffffff
For comparison, here’s the instructions we expect to find from the disassembly of the ELF file:
$ arm-linux-gnueabi-objdump -d linux/vmlinux | grep -A 4 "v7_invalidate_l1"
c0118b2c <v7_invalidate_l1>:
c0118b2c: e3a00000 mov r0, #0
c0118b30: ee400f10 mcr 15, 2, r0, cr0, cr0, {0}
c0118b34: f57ff06f isb sy
c0118b38: ee300f10 mrc 15, 1, r0, cr0, cr0, {0}
Let’s compare the binary pattern between the expected and actual instructions:
Expected: 0xee400f10 = 0b11101110010000000000111100010000
Actual: 0x2f400f10 = 0b00101111010000000000111100010000
---------------------------------------------------------
Diff: ^^ ^^ ^
Three bits have been flipped in this instruction, changing it from mcr to
svc. This could be explained if DDR is miswired or misconfigured. However,
the pattern of data corruption is repeatable: reboot after reboot, the same
instruction gets corrupted in exactly the same way!
To prove that the DDR is capable of holding data at this address, we can write it manually and step through the instructions without any weird jumps to vectors:
(gdb) x/4x $pc
0xc0118b2c <v7_invalidate_l1>: 0xe3a00000 0x2f400f10 0xffffffff 0xee300f10
(gdb) set {int}0xc0118b30 = 0xee400f10
(gdb) set {int}0xc0118b34 = 0xf57ff06f
(gdb) x/4x $pc
0xc0118b2c <v7_invalidate_l1>: 0xe3a00000 0xee400f10 0xf57ff06f 0xee300f10
(gdb) si
41 mcr p15, 2, r0, c0, c0, 0 @ select L1 data cache in CSSELR
42 isb
43 mrc p15, 1, r0, c0, c0, 0 @ read cache geometry from CCSIDR
45 movw r3, #0x3ff
We can also load and run the decompressor as usual and set a breakpoint to 0xC0008000, where the uncompressed kernel is supposed to take over. Then, we simply overwrite whatever the decompressor has written from gdb:
(gdb) restore build/Image binary 0xc0008000
Restoring binary file build/Image into memory (0xc0008000 to 0xc1241f48)
(gdb) c
Nothing has been printed to the console, since apparently the decompressor disabled the console, but if we stop the debugger (Ctrl-C), we see that the kernel proceeded with the boot and finally came to a stop when mounting the root filesystem (understandable, since we haven’t given it a rootfs yet):
(gdb) bt
#0 0xc0b87034 in __timer_delay (cycles=63999) at arch/arm/lib/delay.c:50
#1 0xc0bb2238 in panic (fmt=0xc0defa0c "VFS: Unable to mount root fs on %s") at kernel/panic.c:451
#2 0xc1001878 in mount_block_root (name=0x51 <error: Cannot access memory at address 0x51>, name@entry=0xc0defaa0 "/dev/root", flags=3900) at init/do_mounts.c:432
#3 0xc1001b50 in mount_root () at init/do_mounts.c:592
#4 0xc1001cc8 in prepare_namespace () at init/do_mounts.c:644
#5 0xc1001448 in kernel_init_freeable () at init/main.c:1644
#6 0xc0bc5f18 in kernel_init (unused=<optimized out>) at init/main.c:1519
#7 0xc0100148 in ret_from_fork () at arch/arm/kernel/entry-common.S:148
Let’s assume that the data corruption is deterministic (repeatable) because it is caused by a voltage drop. Since the voltage drop corresponds to the CPU/DDR activity, the same activity causes the same voltage drop, which causes the same corruption.
Let’s check the same instruction at different supply voltages. At 1.35V, 1.30V, 1.25V, the corruption is:
0xc0118b2c <v7_invalidate_l1>: 0xe3a00000 0x2f400f10 0x00000000 0xee300f10
At 1.20V, the pattern is more interesting: the third instruction gets corrupted each time, but differently each reset:
0xc0118b2c <v7_invalidate_l1>: 0xe3a00000 0x2f400f10 0xe464f8f6 0xee300f10
# or this one:
0xc0118b2c <v7_invalidate_l1>: 0xe3a00000 0x2f400f10 0xcbfd2cb6 0xee300f10
# or this one:
0xc0118b2c <v7_invalidate_l1>: 0xe3a00000 0x2f400f10 0xaefc67e9 0xee300f10
Even more strange: restoring voltage back up to 1.35V, the third instruction now gets corrupted differently every time, while the first and last are always correct, and the second one is always corrupted the same way.
One obvious way that data corruption could happen is the if the compressed
zImage was written wrong to the SD card, or if the bootloader writes it to
DDR wrong. First, we check how big the zImage is, and then ask the debugger
to dump the data from the DDR to a file, at the point just before the handoff
from the bootloader into the decompressor:
$ ls -l linux/arch/arm/boot/zImage
-rwxr-xr-x 1 jk jk 7461288 Jan 7 11:09 linux/arch/arm/boot/zImage
Breakpoint 1, handoff_jump () at src/handoff.S:93
93 smc #0
(gdb) dump binary memory dump.bin 0xC2000000 0xC271d9a8
We see that the original image is identical to the one we obtained from the dump, so the SD card and bootloader writes are not corrupted:
9040ec8b8da5e613aa6e56060cc0cacf6779eec670c3a4123177cd07aff63300 zImage
9040ec8b8da5e613aa6e56060cc0cacf6779eec670c3a4123177cd07aff63300 dump.bin
ST provides a utility which they recommend to run as a part of any new PCB bring-up. I have done that already and did not think much of it since all tests passed. Let’s take a closer look.
My “version” of the utility can be found in
this
repository. I made two small changes: instead of requiring the complicated
“Cube” software suite, there is a simple Makefile so that the whole utility can
be compiled easily with a single make invocation. Second, I have commented
out the three or so lines that initialize the STPMIC1, since my board does not
use that power controller.
Let’s load the utility through the debugger, since it is running already:
(gdb) file build/fwutil.elf
Reading symbols from build/fwutil.elf...
(gdb) load
Loading section .RESET, size 0xe000 lma 0x2ffe0000
Loading section .ARM, size 0x8 lma 0x2ffee000
Loading section .init_array, size 0x4 lma 0x2ffee008
Loading section .fini_array, size 0x4 lma 0x2ffee00c
Loading section .data, size 0x7fa lma 0x2ffee010
Start address 0x2ffe0000, load size 59402
Transfer rate: 260 KB/sec, 7425 bytes/write.
(gdb) c
Continuing.
On the serial console, we are greeted with the expected prompt:
=============== UTILITIES-DDR Tool ===============
Model: STM32MP13XX_DK
RAM: DDR3-1066 bin F 1x4Gb 533MHz v1.53
0:DDR_RESET
DDR>
As the utility readme instructs us, let us enter the DDR_READY step and then
execute all the tests:
DDR>step 3
step to 3:DDR_READY
1:DDR_CTRL_INIT_DONE
2:DDR_PHY_INIT_DONE
3:DDR_READY
DDR>test 0
result 1:Test Simple DataBus = Passed
result 2:Test DataBusWalking0 = Passed
result 3:Test DataBusWalking1 = Passed
result 4:Test AddressBus = Passed
result 5:Test MemDevice = Passed
result 6:Test SimultaneousSwitchingOutput = Passed
result 7:Test Noise = Passed
result 8:Test NoiseBurst = Passed
result 9:Test Random = Passed
result 10:Test FrequencySelectivePattern = Passed
result 11:Test BlockSequential = Passed
result 12:Test Checkerboard = Passed
result 13:Test BitSpread = Passed
result 14:Test BitFlip = Passed
result 15:Test WalkingZeroes = Passed
result 16:Test WalkingOnes = Passed
Result: Pass [Test All]
This takes about a second to complete, and on the scope trace monitoring the 1.35V supply we see a tiny (maybe 2-5mV) dip during this time.
After all the tests are done, we can use the save command to get the DDR
parameters from the utility. Here are the dynamic ones, reporting on the
status:
/* ctl.dyn */
#define DDR_STAT 0x00000001
#define DDR_INIT0 0x4002004e
#define DDR_DFIMISC 0x00000001
#define DDR_DFISTAT 0x00000001
#define DDR_SWCTL 0x00000001
#define DDR_SWSTAT 0x00000001
#define DDR_PCTRL_0 0x00000001
/* phy.dyn */
#define DDR_PIR 0x00000000
#define DDR_PGSR 0x0000001f
#define DDR_ZQ0SR0 0x80021dee
#define DDR_ZQ0SR1 0x00000000
#define DDR_DX0GSR0 0x00008001
#define DDR_DX0GSR1 0x00000000
#define DDR_DX0DLLCR 0x40000000
#define DDR_DX0DQTR 0xffffffff
#define DDR_DX0DQSTR 0x3db02001
#define DDR_DX1GSR0 0x00008001
#define DDR_DX1GSR1 0x00000000
#define DDR_DX1DLLCR 0x40000000
#define DDR_DX1DQTR 0xffffffff
#define DDR_DX1DQSTR 0x3db02001
All the other parameters returned from the utility are identical to the values already used in the bootloader. Thus, I hope I can assume that the DDR configuration in the bootloader is identical to the one used in the bootloader.
Above we have found that while decompression appears to finish successfully, it in fact leaves behind lots of partially corrupted data. The uncompressed kernel starts executing, only the trip into the SVC handler because of a corrupted instruction. Now, let’s try to track down exactly when the data first gets corrupted.
As seen above, in the current configuration, decompression takes place in the
__gunzip routine (decompress_inflate.c). The decompression is done by
zlib_inflate() (lib/zlib_inflate/inflate.c). First, clear the memory
location that we’re interested in observing:
set {unsigned int}0xc0118b2c = 0x0
set {unsigned int}0xc0118b30 = 0x0
set {unsigned int}0xc0118b34 = 0x0
set {unsigned int}0xc0118b38 = 0x0
Verify it has been cleared:
(gdb) x/4x 0xc0118b2c
0xc0118b2c: 0x00000000 0x00000000 0x00000000 0x00000000
Some interesting breakpoints:
(gdb) b *0xc2001878
Breakpoint 20 at 0xc2001878: file arch/arm/boot/compressed/../../../../lib/zlib_inflate/inflate.c, line 63.
(gdb) b *0xc2001fa4
Breakpoint 34 at 0xc2001fa4: file arch/arm/boot/compressed/../../../../lib/zlib_inflate/inflate.c, line 582.
As it turns out, the corruption appears after the second call to inflate_fast:
(gdb) c
Continuing.
Breakpoint 36, zlib_inflate (strm=0xc271ea44, strm@entry=0xc271e9c0, flush=1072676126, flush@entry=0) at arch/arm/boot/compressed/../../../../lib/zlib_inflate/inflate.c:582
582 inflate_fast(strm, out);
(gdb) x/4x 0xc0118b2c
0xc0118b2c: 0x00000000 0x00000000 0x00000000 0x00000000
(gdb) c
Continuing.
Breakpoint 36, zlib_inflate (strm=0xc271ea44, strm@entry=0xc271e9c0, flush=1072590367, flush@entry=0) at arch/arm/boot/compressed/../../../../lib/zlib_inflate/inflate.c:582
582 inflate_fast(strm, out);
(gdb) x/4x 0xc0118b2c
0xc0118b2c: 0xe3a00000 0x2f400f10 0xffedecfd 0xee300f1
While we press c (or continue) in GDB, inflate_fast() runs and very
briefly (about 3.5ms), a voltage drop of about 30–40mV is observed on the
1.35V supply. In the same period, VREF_DDR0, VREF_DDR1, and VREF_DDR2
droops are barely perceptible.
We can go a step further and set a watchpoint, so the debugger triggers on the first access of the given memory location:
(gdb) watch *(uint32_t *)0xc0118b2c
Hardware watchpoint 38: *(uint32_t *)0xc0118b2c
Set the memory locations to zero as before, and after the watchpoint triggers, single step through the execution and each time check the memory. Skipping ahead many such steps, we see how the value gets progressively filled in:
0xc0118b2c: 0xe3a00000 0x00000000 0x00000000 0x00000000
0xc0118b2c: 0xe3a00000 0x00000010 0x00000000 0x00000000
0xc0118b2c: 0xe3a00000 0x00000f10 0x00000000 0x00000000
0xc0118b2c: 0xe3a00000 0x00400f10 0x00000000 0x00000000
0xc0118b2c: 0xe3a00000 0x2f400f10 0x00000000 0x00000000
We see how it fills up in steps of half byte: zero, 10, 0f, 40, 2f.
That final 2f is erroneous; it should be ee as we have seen previously in
the disassembly of vmlinux.
The code loop that populates this word can be found in
lib/zlib_inflate/inffast.c, lines 119 through 308; in particular, the line
that wrote the incorrect 2f is number 247, in the middle of this section:
/* Align out addr */
if (!((long)(out - 1) & 1)) {
*out++ = *from++;
len--;
}
Let’s recap the situation so far. DDR appears to work as far as my own tests are concerned: I can fill the memory with pseudo-random data and read it all back correctly. The STM32DDRFW-UTIL tests all pass. The kernel runs if it’s loaded into memory uncompressed, but the decompression fails. Remembering further back, when writing the bootloader I had to force all DDR writes to be 32-bit aligned. All of this brings to mind the quote from Jay Carlson:
if your design doesn’t work, length-tuning is probably the last thing you should be looking at. For starters, make sure you have all the pins connected properly — even if the failures appear intermittent. For example, accidentally swapping byte lane strobes / masks (like I’ve done) will cause 8-bit operations to fail without affecting 32-bit operations. Since the bulk of RAM accesses are 32-bit, things will appear to kinda-sorta work.
Let’s take a good hard look at the connections on my custom board (Rev
A)
between the memory chip (MT41K256M16TW-107:P TR) and the SoC
(STM32MP135FAE):
| DDR pin | DDR signal | SoC signal | SoC pin | Notes |
|---|---|---|---|---|
M2 |
BA0 |
BA0 |
G17 |
|
N8 |
BA1 |
BA1 |
L16 |
|
M3 |
BA2 |
BA2 |
G13 |
|
N3 |
A0 |
A0 |
G16 |
|
P7 |
A1 |
A1 |
K15 |
|
P3 |
A2 |
A2 |
F17 |
|
N2 |
A3 |
A3 |
G15 |
|
P8 |
A4 |
A4 |
M14 |
|
P2 |
A5 |
A5 |
E16 |
|
R8 |
A6 |
A6 |
M17 |
|
R2 |
A7 |
A7 |
G14 |
|
T8 |
A8 |
A8 |
L15 |
|
R3 |
A9 |
A9 |
F16 |
|
L7 |
A10/AP |
A10 |
J14 |
|
R7 |
A11 |
A11 |
K13 |
|
N7 |
A12/BC# |
A12 |
K17 |
|
T3 |
A13 |
A13 |
F14 |
|
T7 |
A14 |
A14 |
L17 |
|
D3 |
UDM |
DQM0 |
D15 |
|
E7 |
LDM |
DQM1 |
N14 |
|
B7 |
UDQS# |
DQS0N |
C16 |
|
C7 |
UDQS |
DQS0P |
C17 |
|
G3 |
LDQS# |
DQS1N |
R16 |
|
F3 |
LDQS |
DQS1P |
R17 |
|
E3 |
DQ0 |
DQ4 |
B16 |
|
F7 |
DQ1 |
DQ2 |
C13 |
|
F2 |
DQ2 |
DQ0 |
B17 |
|
F8 |
DQ3 |
DQ5 |
D16 |
|
H3 |
DQ4 |
DQ3 |
D17 |
|
H8 |
DQ5 |
DQ7 |
E15 |
|
G2 |
DQ6 |
DQ1 |
C15 |
|
H7 |
DQ7 |
DQ6 |
E14 |
|
D7 |
DQ8 |
DQ8 |
N16 |
|
C3 |
DQ9 |
DQ9 |
P17 |
|
C8 |
DQ10 |
DQ10 |
N15 |
|
C2 |
DQ11 |
DQ15 |
T16 |
|
A7 |
DQ12 |
DQ11 |
P15 |
|
A2 |
DQ13 |
DQ12 |
R15 |
|
B8 |
DQ14 |
DQ13 |
P16 |
|
A3 |
DQ15 |
DQ14 |
T17 |
|
K3 |
CASN |
CASN |
J15 |
|
K9 |
CKE |
CKE |
K14 |
10k pulldown |
K7 |
CK# |
CLKN |
J17 |
100R to CK at DDR |
J7 |
CK |
CLKP |
J16 |
|
L2 |
CS# |
CSN |
H16 |
|
K1 |
ODT |
ODT |
H15 |
|
J3 |
RAS# |
RASN |
H17 |
|
T2 |
RESET# |
RESETN |
E17 |
10k pulldown |
L3 |
WE# |
WEN |
H13 |
Let’s check carefully what the DDR datasheet considers “upper” vs “lower”:
DQ[7:0]Lower byte of bidirectional data bus for the x16 configuration.
DQ[15:8]Upper byte of bidirectional data bus for the x16 configuration.
In other words, we should have mapped DQ[7:0] together with the DDR signals
LDM and LDQS, while the upper byte DQ[15:8] should have been placed
together with UDM and USDQS. Looking at the table above, we see that the
mask/strobe signals are swapped:
DDR:UDM → SoC:DQM0
DDR:LDM → SoC:DQM1
But the data bits are not swapped, so this is incorrect:
DDR:DQ[7:0] → SoC[7:0] (scrambled)
DDR:DQ[15:8] → SoC[15:8] (scrambled)
My confusion can be traced back to the eval board design, which similarly swaps
the mask/strobe wires, except they also (correctly) swap the two DQ lanes. AI
seems to be of little use: I can easy convince them either way regarding the
correctness of my “semi-byte swap”.
We saw above that the official ST DDR utility did not detect any problems with my incorrectly-wired DDR. After some prompting, Gemini 3 gave me the following test:
void ddr_align_test(int argc, uint32_t arg1, uint32_t arg2, uint32_t arg3)
{
(void)argc; (void)arg1; (void)arg2; (void)arg3;
uint32_t sctlr;
// 1. READ SCTLR
__asm__ volatile("mrc p15, 0, %0, c1, c0, 0" : "=r" (sctlr));
// 2. DISABLE CACHE (Bit 2) AND MMU (Bit 0)
uint32_t sctlr_disabled = sctlr & ~((1 << 2) | (1 << 0));
__asm__ volatile("mcr p15, 0, %0, c1, c0, 0" : : "r" (sctlr_disabled));
__asm__ volatile("isb sy"); // Instruction sync barrier
my_printf("!!! CACHE DISABLED !!! Testing raw hardware wires...\r\n");
volatile uint8_t *p8 = (volatile uint8_t *)0xc0001000;
// Perform a partial write
p8[0] = 0xAA;
__asm__ volatile("dsb sy"); // Force pin toggle
if (p8[0] != 0xAA) {
my_printf("FAILURE DETECTED: Byte 0 is 0x%02x (expected 0xAA)\r\n", p8[0]);
} else {
my_printf("SUCCESS: Byte 0 worked without cache.\r\n");
}
// 3. RE-ENABLE CACHE
__asm__ volatile("mcr p15, 0, %0, c1, c0, 0" : : "r" (sctlr));
__asm__ volatile("isb sy");
}
On the evaluation board, the printout is:
Eval board: !!! CACHE DISABLED !!! Testing raw hardware wires...
SUCCESS: Byte 0 worked without cache.
On my board:
!!! CACHE DISABLED !!! Testing raw hardware wires...
FAILURE DETECTED: Byte 0 is 0x55 (expected 0xAA)
While the explanation in the previous section (swapped byte lanes) seems plausible enough to stop debugging at this point and wait for “Rev B”, in the process I noted other possible avenues to explore:
Just because we found one issue with my connections, it does not mean we have found all of them. From the same article by Jay Carlson:
Because DDR memory doesn’t care about the order of the bits getting stored, you can swap individual bits — except the least-significant one if you’re using write-leveling — in each byte lane with no issues.
I have not been able to find any evidence of the LSB swapping restriction in ST literature (datasheet, reference manual, app notes). Indeed, one app note[1] just says that the DDR3L connection features “two swappable bytes, and swappable bits in the same byte”.
However, the MT41K DDR3L datasheet includes a section on Write Leveling which
explains what’s up:
For better signal integrity, DDR3 SDRAM memory modules have adopted fly-by topology for the commands, addresses, control signals, and clocks. Write leveling is a scheme for the memory controller to adjust or de-skew the DQS strobe (DQS, DQS#) to CK relationship at the DRAM with a simple feedback feature provided by the DRAM. Write leveling is generally used as part of the initialization process, if required. For normal DRAM operation, this feature must be disabled. […]
When write leveling is enabled, the rising edge of DQS samples CK, and the prime DQ outputs the sampled CK’s status. The prime DQ for a x4 or x8 configuration is DQ0 with all other DQ (DQ[7:1]) driving LOW. The prime DQ for a x16 configuration is DQ0 for the lower byte and DQ8 for the upper byte.
So, just in case, we should make sure not to “swizzle” the two LSBs in each byte.
Application note AN5692: DDR memory routing guidelines for STM32MP13x product lines. January 2023. ↩︎