Another day of debugging

By Toshihito Kikuchi

One of the key components in the Kinesis ecosystem is Dynamo, an agent program to promote a computer to a computing node in the network. Recently, I made a patch to enable Dynamo to run on ARM64 devices. Usually we use Ubuntu as Linux distro for everyday use. On this day, however, I was in a mood to try something new and chose Amazon Linux in AWS. The story began there.

Dynamo got stuck

On a node, we run Dynamo as a systemd service. When I started the Dynamo service on an Amazon Linux node, the process started, but I realized some important events were not logged. It seemed that the process was stuck in the middle of the startup phase.

The first triage is to run the exact same command manually, and when I ran the command on shell, it just worked normally. Here are the output: the former is from systemd; the latter is from manual run.

#### systemd

[ec2-user@ip-172-31-15-164 kinesis-dynamo]$ sudo systemctl start dynamo
[ec2-user@ip-172-31-15-164 kinesis-dynamo]$ sudo journalctl -u dynamo -f
Aug 27 23:39:20 ip-172-31-15-164.ec2.internal systemd[1]: dynamo.service: Deactivated successfully.
Aug 27 23:39:20 ip-172-31-15-164.ec2.internal systemd[1]: Stopped dynamo.service - "Dynamo node service".
Aug 27 23:39:20 ip-172-31-15-164.ec2.internal systemd[1]: dynamo.service: Consumed 42.170s CPU time.
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal systemd[1]: Started dynamo.service - "Dynamo node service".
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.132Z level=INFO msg="Loaded your wallet" addr=0xa5e07b0a3944dd9158a8a72a6794004201669468 file=/opt/dynamo/id_ecdsa
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.132Z level=INFO msg="Loaded AppCacheFile" file=/opt/dynamo/app-cache.json
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.132Z level=INFO msg="Loaded a valid certificate" file=/opt/dynamo/backend.crt
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.132Z level=INFO msg="Loaded config" file=/opt/dynamo/config.json
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.182Z level=ERROR msg="Cannot initialize Docker Manager" err="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.183Z level=INFO msg="Serving gRPC" laddr=/tmp/kinesis-dynamo.sock
^C
[ec2-user@ip-172-31-15-164 kinesis-dynamo]$ ps -ef | grep node
root         778       2  0 22:38 ?        00:00:00 [xfs-inodegc/nvm]
ec2-user   37562       1  1 23:39 ?        00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user   37577   36391  0 23:39 pts/0    00:00:00 grep --color=auto node

### manual run

[ec2-user@ip-172-31-15-164 kinesis-dynamo]$ /opt/dynamo/noded -config=/opt/dynamo/config.json
time=2025-08-27T23:40:03.620Z level=INFO msg="Loaded your wallet" addr=0xa5e07b0a3944dd9158a8a72a6794004201669468 file=/opt/dynamo/id_ecdsa
time=2025-08-27T23:40:03.620Z level=INFO msg="Loaded AppCacheFile" file=/opt/dynamo/app-cache.json
time=2025-08-27T23:40:03.621Z level=INFO msg="Loaded a valid certificate" file=/opt/dynamo/backend.crt
time=2025-08-27T23:40:03.621Z level=INFO msg="Loaded config" file=/opt/dynamo/config.json
time=2025-08-27T23:40:03.660Z level=ERROR msg="Cannot initialize Docker Manager" err="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
time=2025-08-27T23:40:03.661Z level=INFO msg="Serving gRPC" laddr=/tmp/kinesis-dynamo.sock

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
time=2025-08-27T23:40:03.767Z level=ERROR msg="Cannot initialize NVML" result=9
time=2025-08-27T23:40:03.802Z level=INFO msg="Successfully registered with Node Manager" address=0xa5e07b0a3944dd9158a8a72a6794004201669468 ip=""
^C

As you can see in the second output, we use NVML to collect information about NVIDIA GPU. The warning and the error of "Cannot initialize NVML” in the output are expected because this node doesn’t have any GPU. The problem is these warning and error were not logged from Dynamo running as a systemd service.

Debugging with Delve

Since Dynamo is written in Go, the second triage is to attach Delve to a systemd process to debug. I could debug in interactive mode, but if the target process is a daemon like this case, I always start with batch mode i.e. running dlv to print all goroutines and dump it to a file. Why? If you pause the process for a long time, systemd may trigger restart. In general, you should keep your debug target fresh and untouched, like cooking fish. Dumping to a file also helps if the output is too long. You can use your favorite editor to analyze the output for unlimited time. Here are the commands to dump goroutines to a file.

[ec2-user@ip-172-31-15-164 ~]$ sudo systemctl start dynamo
[ec2-user@ip-172-31-15-164 ~]$ ps -ef | grep noded
ec2-user   38720       1  3 23:54 ?        00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user   38735   36391  0 23:54 pts/0    00:00:00 grep --color=auto noded
[ec2-user@ip-172-31-15-164 ~]$ dlv --allow-non-terminal-interactive attach 38720 <<< "goroutines -t 100" >goroutines.log
[ec2-user@ip-172-31-15-164 ~]$ ls -lh
total 64K
drwxr-xr-x.  4 ec2-user ec2-user  28 Aug 27 22:49 go
-rw-r--r--.  1 ec2-user ec2-user 45K Aug 27 23:54 goroutines.log
drwxr-xr-x. 22 ec2-user ec2-user 16K Aug 27 22:49 kinesis-dynamo

I found a groutine that looks indeed stuck.

* Goroutine 36 - User: _cgo_gotypes.go:150 github.com/kinesis-network/kinesis-dynamo/nvml._Cfunc_NvmlInit (0x8fb5f0) (thread 38720) [select]
	 0  0x0000ffff87f68ff8 in [1m???[0m
	     at [1m?:-1[0m
	 1  0x000000000048de18 in [1mruntime.systemstack_switch[0m
	     at ./.go/src/runtime/[1masm_arm64.s:249[0m
	 2  0x000000000048572c in [1mruntime.cgocall[0m
	     at ./.go/src/runtime/[1mcgocall.go:185[0m
	 3  0x00000000008fb5f0 in [1mgithub.com/kinesis-network/kinesis-dynamo/nvml._Cfunc_NvmlInit[0m
	     at [1m_cgo_gotypes.go:150[0m
	 4  0x00000000008fb708 in [1mgithub.com/kinesis-network/kinesis-dynamo/nvml.Init.func1[0m
	     at ./kinesis-dynamo/nvml/[1mnvml.go:59[0m
	 5  0x00000000004998e0 in [1msync.(*Once).doSlow[0m
	     at ./.go/src/sync/[1monce.go:78[0m
	 6  0x00000000008fb6b4 in [1msync.(*Once).Do[0m
	     at ./.go/src/sync/[1monce.go:69[0m
	 7  0x00000000008fb6b4 in [1mgithub.com/kinesis-network/kinesis-dynamo/nvml.Init[0m
	     at ./kinesis-dynamo/nvml/[1mnvml.go:58[0m
	 8  0x00000000009bca08 in [1mgithub.com/kinesis-network/kinesis-dynamo/pulse.FillGpuInfo[0m
	     at ./kinesis-dynamo/pulse/[1mgpu.go:14[0m
	 9  0x0000000000d25f1c in [1mgithub.com/kinesis-network/kinesis-dynamo/core.(*Server).collectNodeInfo[0m
	     at ./kinesis-dynamo/core/[1mserver.go:389[0m
	10  0x0000000000d25720 in [1mgithub.com/kinesis-network/kinesis-dynamo/core.(*Server).advertiseSelf[0m
	     at ./kinesis-dynamo/core/[1mserver.go:340[0m
	11  0x0000000000d26c8c in [1mgithub.com/kinesis-network/kinesis-dynamo/core.(*Server).Serve.func1[0m
	     at ./kinesis-dynamo/core/[1mserver.go:464[0m
	12  0x0000000000490334 in [1mruntime.goexit[0m
	     at ./.go/src/runtime/[1masm_arm64.s:1268[0m

As mentioned above, Dynamo uses NVML. Since NVML is a C library (and Python bindings are available too), Dynamo implements a stub written in C to consume NVML APIs and compile/link it with CGO. The callstack above indicates that stub code never returns. Below is the actual Go code of Dynamo. The frame #7 in the output above indicates the call to C.NvmlInit() below never returns. Well, there is nothing interesting here in Go code.

func Init(logger utils.ILogger) bool {
	nvmlInitOnce.Do(func() {
		if result := C.NvmlInit(); result == 0 {
			nvmlInitialized = true
			logger.Info("Initialized NVML")
		} else {
			logger.Error("Cannot initialize NVML", "result", result)
		}
	})
	return nvmlInitialized
}

What does NvmlInit() do? You may be surprised. Here’s the actual C code of Dynamo.

nvmlReturn_t NvmlInit() {
  return nvmlInit();
}

It does nothing but calling a NVML function nvmlInit , which is actually an alias of a versioned function nvmlInit_v2 . NVIDIA’s reference has some description and notes about this method, but no information mentions a possible “stuck”.

So what’s next? NVML is closed-source unfortunately. Calling NVIDIA support?

Debugging with GDB

Of course not. We dig deeper with GDB (Or you could go with LLDB if you prefer). No source code is no problem.

Let’s restart dynamo and attach gdb to the process.

[ec2-user@ip-172-31-15-164 ~]$ sudo systemctl restart dynamo
[ec2-user@ip-172-31-15-164 ~]$ ps -ef | grep noded
ec2-user   42723       1  2 01:52 ?        00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user   42787   36391  0 01:52 pts/0    00:00:00 grep --color=auto noded
[ec2-user@ip-172-31-15-164 ~]$ gdb -q /opt/dynamo/noded
Reading symbols from /opt/dynamo/noded...
warning: File "/home/ec2-user/.go/.versions/1.25.0/src/runtime/runtime-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /home/ec2-user/.go/.versions/1.25.0/src/runtime/runtime-gdb.py
line to your configuration file "/home/ec2-user/.config/gdb/gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/ec2-user/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
(gdb) attach 42723
Attaching to program: /opt/dynamo/noded, process 42723
[New LWP 42788]
[New LWP 42773]
[New LWP 42759]
[New LWP 42739]
[New LWP 42729]
[New LWP 42728]
[New LWP 42727]
[New LWP 42726]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
runtime.futex () at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
651             SVC
Missing rpms, try: dnf --enablerepo='*debug*' install glibc-debuginfo-2.34-196.amzn2023.0.1.aarch64

If you want to dump callstacks of all threads as we did with dlv earlier, you can run

(gdb) thread apply all bt

However, let’s try another approach this time. First, list all threads.

(gdb) info threads
  Id   Target Id                                 Frame
* 1    Thread 0xffffbc4f5020 (LWP 42723) "noded" runtime.futex ()
    at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
  2    Thread 0xffff70f3e100 (LWP 42788) "noded" runtime.futex ()
    at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
  3    Thread 0xffff7194e100 (LWP 42773) "noded" runtime.futex ()
    at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
  4    Thread 0xffff7235e100 (LWP 42759) "noded" internal/runtime/syscall.Syscall6 ()
    at /home/ec2-user/.go/src/internal/runtime/syscall/asm_linux_arm64.s:17
  5    Thread 0xffff72e2e100 (LWP 42739) "noded" runtime.futex ()
    at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
  6    Thread 0xffff7395e100 (LWP 42729) "noded" runtime.futex ()
    at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
  7    Thread 0xffff743ae100 (LWP 42728) "noded" 0x0000ffffbc38aff8 in read () from /lib64/libc.so.6
  8    Thread 0xffff74dbe100 (LWP 42727) "noded" runtime.futex ()
    at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
  9    Thread 0xffff7580e100 (LWP 42726) "noded" runtime.futex ()
    at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651

Probably we can ignore threads with runtime.futex because many are waiting there so it looks like Go’s management stuff. This means we’re interested in thread #4 or #7. Let’s check them one by one. In this case, thread #4 wasn’t interesting and #7 was what we were looking for.

(gdb) thread 7
[Switching to thread 7 (Thread 0xffff62976100 (LWP 40398))]
#0  0x0000ffffaa952ff8 in read () from /lib64/libc.so.6
(gdb) bt
#0  0x0000ffffaa952ff8 in read () from /lib64/libc.so.6
#1  0x0000ffffaa8ed888 in __GI__IO_file_underflow () from /lib64/libc.so.6
#2  0x0000ffffaa8ee944 in _IO_default_uflow () from /lib64/libc.so.6
#3  0x0000ffffaa8e0ddc in _IO_getline_info () from /lib64/libc.so.6
#4  0x0000ffffaa8dfb20 in fgets () from /lib64/libc.so.6
#5  0x0000000000d824a4 in ?? ()
#6  0x0000000000d982c0 in nvmlInit_v2 ()
#7  0x0000000000d75c3c in NvmlInit () at nvml.c:87
#8  0x0000000000d757f0 in _cgo_4e2c63ea9b15_Cfunc_NvmlInit (v=0x40002259a8) at /tmp/go-build/cgo-gcc-prolog:114
#9  0x000000000049013c in runtime.asmcgocall () at /home/ec2-user/.go/src/runtime/asm_arm64.s:1049
#10 0x0000004000003dc0 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further

It’s in the middle of our stub function NvmlInit , which is calling fgets . Let’s confirm if fgets really never returns. This is an important step because this stuck could be caused by an infinite loop calling fgets repeatedly.

(gdb) b *0xd824a4
Breakpoint 1 at 0xd824a4
(gdb) c
Continuing.

I left it for seconds and this breakpoint wasn’t hit. So Dynamo was stuck because fgets never returned. What kind of data are we reading from what? Standard input?

We’ll get there, but before that, let’s step back a little bit and remember this happens only when we run Dynamo via systemd. Two possibilities. If we run Dynamo directly from shell, fgets just returns without any problem, or fgets isn’t called at all. Let’s find it out. I took this approach because debugging a normal process is much easier than debugging a systemd process. We don’t have to check the PID, and we don’t have to worry about automatic restart, etc.

I just ran dynamo with gdb and set a breakpoint at nvmlInit_v2.

[ec2-user@ip-172-31-15-164 ~]$ gdb -q /opt/dynamo/noded
Reading symbols from /opt/dynamo/noded...
warning: File "/home/ec2-user/.go/.versions/1.25.0/src/runtime/runtime-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /home/ec2-user/.go/.versions/1.25.0/src/runtime/runtime-gdb.py
line to your configuration file "/home/ec2-user/.config/gdb/gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/ec2-user/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
(gdb) b nvmlInit_v2
Breakpoint 1 at 0xd98264
(gdb) r -config=/opt/dynamo/config.json
Starting program: /opt/dynamo/noded -config=/opt/dynamo/config.json
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0xffffb130a100 (LWP 43272)]
[New Thread 0xffffabfff100 (LWP 43273)]
[New Thread 0xffffab5ef100 (LWP 43274)]
[New Thread 0xffffaabdf100 (LWP 43275)]
[New Thread 0xffffaa1cf100 (LWP 43276)]
time=2025-08-28T02:09:22.824Z level=INFO msg="Loaded your wallet" addr=0xa5e07b0a3944dd9158a8a72a6794004201669468 file=/opt/dynamo/id_ecdsa
time=2025-08-28T02:09:22.825Z level=INFO msg="Loaded AppCacheFile" file=/opt/dynamo/app-cache.json
time=2025-08-28T02:09:22.825Z level=INFO msg="Loaded a valid certificate" file=/opt/dynamo/backend.crt
time=2025-08-28T02:09:22.825Z level=INFO msg="Loaded config" file=/opt/dynamo/config.json
[New Thread 0xffffa97bf100 (LWP 43277)]
time=2025-08-28T02:09:22.873Z level=ERROR msg="Cannot initialize Docker Manager" err="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
[New Thread 0xffffa8daf100 (LWP 43278)]
time=2025-08-28T02:09:22.874Z level=INFO msg="Serving gRPC" laddr=/tmp/kinesis-dynamo.sock
[Switching to Thread 0xffffabfff100 (LWP 43273)]

Thread 3 "noded" hit Breakpoint 1, 0x0000000000d98264 in nvmlInit_v2 ()
Missing rpms, try: dnf --enablerepo='*debug*' install glibc-debuginfo-2.34-196.amzn2023.0.1.aarch64
(gdb) bt
#0  0x0000000000d98264 in nvmlInit_v2 ()
#1  0x0000000000d75c3c in NvmlInit () at nvml.c:87
#2  0x0000000000d757f0 in _cgo_4e2c63ea9b15_Cfunc_NvmlInit (v=0x400038d9a8) at /tmp/go-build/cgo-gcc-prolog:114
#3  0x000000000049013c in runtime.asmcgocall () at /home/ec2-user/.go/src/runtime/asm_arm64.s:1049
#4  0x00000040000a4a80 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further

So far so good. Next, I set a breakpoint on fgets and continued.

(gdb) b fgets
Breakpoint 2 at 0xfffff7e13a90
(gdb) c
Continuing.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
time=2025-08-28T02:09:33.691Z level=INFO msg="Disconnected from Runtime Manager" reason="connection error: desc = \\"error reading server preface: EOF\\""
[Detaching after vfork from child process 43280]

Thread 3 "noded" hit Breakpoint 2, 0x0000fffff7e13a90 in fgets () from /lib64/libc.so.6
(gdb) bt
#0  0x0000fffff7e13a90 in fgets () from /lib64/libc.so.6
#1  0x0000000000d82440 in ?? ()
#2  0x0000000000d982c0 in nvmlInit_v2 ()
#3  0x0000000000d75c3c in NvmlInit () at nvml.c:87
#4  0x0000000000d757f0 in _cgo_4e2c63ea9b15_Cfunc_NvmlInit (v=0x400038d9a8) at /tmp/go-build/cgo-gcc-prolog:114
#5  0x000000000049013c in runtime.asmcgocall () at /home/ec2-user/.go/src/runtime/asm_arm64.s:1049
#6  0x00000040000a4a80 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
(gdb)

The breakpoint got hit nicely, but wait, did you notice the difference?

It’s obvious. We got NVML warning here. Remember we didn’t see this warning in the repro. So this call to fgets is not the one getting stuck. If you look at the stacktrace carefully, you’ll see the return address is different too. Here’s the callstack of the repro: d82440 vs d824a4.

#4  0x0000ffffaa8dfb20 in fgets () from /lib64/libc.so.6
#5  0x0000000000d824a4 in ?? ()
#6  0x0000000000d982c0 in nvmlInit_v2 ()
#7  0x0000000000d75c3c in NvmlInit () at nvml.c:87

The imagebase might be different because this is a different process? That’s a good point, but probably it’s not the case. Usually the imagebase is reused for efficiency, but in this case the address is only 0x64 bytes different. If the module is mapped onto a different address, the difference would be aligned with page size 4k.

So this is a different call to fgets. Good news is both calls are made from the same function that is called from nvmlInit_v2 . You can get it because the return address of a function calling fgets is 0xd982c0 in both cases.

Fun time to dive into assembly! Let’s start from nvmlInit_v2.

(gdb) disas nvmlInit_v2
Dump of assembler code for function nvmlInit_v2:
   0x0000000000d98260 <+0>:     stp     x19, x30, [sp, #-16]!
   0x0000000000d98264 <+4>:     adrp    x19, 0x1aa1000
   0x0000000000d98268 <+8>:     bl      0xd825a0 <stubSpinLock>
   0x0000000000d9826c <+12>:    ldr     x0, [x19, #1080]
   0x0000000000d98270 <+16>:    cbz     x0, 0xd982a0 <nvmlInit_v2+64>
   0x0000000000d98274 <+20>:    adrp    x1, 0x1234000
   0x0000000000d98278 <+24>:    mov     w19, #0x0                       // #0
   0x0000000000d9827c <+28>:    add     x1, x1, #0x168
   0x0000000000d98280 <+32>:    bl      0x4103a0 <dlsym@plt>
   0x0000000000d98284 <+36>:    cbz     x0, 0xd98290 <nvmlInit_v2+48>
   0x0000000000d98288 <+40>:    blr     x0
   0x0000000000d9828c <+44>:    mov     w19, w0
   0x0000000000d98290 <+48>:    bl      0xd825d0 <stubUnLock>
   0x0000000000d98294 <+52>:    mov     w0, w19
   0x0000000000d98298 <+56>:    ldp     x19, x30, [sp], #16
   0x0000000000d9829c <+60>:    ret
   0x0000000000d982a0 <+64>:    bl      0xd824e0
   0x0000000000d982a4 <+68>:    str     x0, [x19, #1080]
   0x0000000000d982a8 <+72>:    cbnz    x0, 0xd98274 <nvmlInit_v2+20>
   0x0000000000d982ac <+76>:    adrp    x0, 0x1234000
   0x0000000000d982b0 <+80>:    add     x0, x0, #0x28
   0x0000000000d982b4 <+84>:    bl      0x410270 <puts@plt>
   0x0000000000d982b8 <+88>:    mov     w19, #0x9                       // #9
   0x0000000000d982bc <+92>:    bl      0xd823d0
   0x0000000000d982c0 <+96>:    bl      0xd825d0 <stubUnLock>
   0x0000000000d982c4 <+100>:   mov     w0, w19
   0x0000000000d982c8 <+104>:   ldp     x19, x30, [sp], #16
   0x0000000000d982cc <+108>:   ret

Right above 0xd982c0, there is a BL to 0xd823d0, which must be the function call we’re looking for.

One disadvantage of gdb compared to Windows Debugger (ntsd/cdb/kd/windbg) is the disas command (uf in Windows debuggers) doesn’t work if there is no matching symbol on the target address. We need to use x command instead.

(gdb) x/80i 0xd823d0
   0xd823d0:    sub     sp, sp, #0x330
   0xd823d4:    mov     x2, #0x83                       // #131
   0xd823d8:    adrp    x1, 0x1233000 <crypto/internal/fips140/rsa..stmp_0+96>
   0xd823dc:    add     x1, x1, #0xe18
...snip...
   0xd82428:    bl      0x4103c0 <strstr@plt>
   0xd8242c:    cbnz    x0, 0xd824a8
   0xd82430:    mov     x2, x20
   0xd82434:    mov     x0, x19
   0xd82438:    mov     w1, #0x100                      // #256
   0xd8243c:    bl      0x410480 <fgets@plt>
   0xd82440:    cbnz    x0, 0xd82420
   0xd82444:    mov     x0, x20
   0xd82448:    bl      0x4103d0 <pclose@plt>
   0xd8244c:    bl      0x410130 <getpid@plt>
   0xd82450:    mov     w2, w0
   0xd82454:    adrp    x1, 0x1233000 <crypto/internal/fips140/rsa..stmp_0+96>
   0xd82458:    mov     x0, x22
   0xd8245c:    add     x1, x1, #0xeb0
   0xd82460:    bl      0x4100e0 <sprintf@plt>
   0xd82464:    add     x1, x23, #0xea0
   0xd82468:    mov     x0, x22
   0xd8246c:    bl      0x410160 <popen@plt>
   0xd82470:    mov     x20, x0
   0xd82474:    cbz     x0, 0xd824b0
   0xd82478:    adrp    x21, 0x1233000 <crypto/internal/fips140/rsa..stmp_0+96>
   0xd8247c:    add     x21, x21, #0xee0
   0xd82480:    b       0xd82494
   0xd82484:    mov     x1, x21
   0xd82488:    mov     x0, x19
   0xd8248c:    bl      0x4103c0 <strstr@plt>
   0xd82490:    cbnz    x0, 0xd824c4
   0xd82494:    mov     x2, x20
   0xd82498:    mov     x0, x19
   0xd8249c:    mov     w1, #0x100                      // #256
   0xd824a0:    bl      0x410480 <fgets@plt>            // <----- this causes stuck in the repro
   0xd824a4:    cbnz    x0, 0xd82484
...snip...

Alright, we found the call to fgets that is stuck in the repro. Well, we could enjoy finding out why it went down a different path with/without systemd, but not today. Let’s focus on the original issue.

The next thing to find out is what’s this fgets call. Before asking AI for help, just spend some time on reading this assembly code. The hint is what happens before we call fgets. Around there, we can see calls to standard functions like getpid, sprintf, popen, strstr, etc. If you have some experience in C programming, you can easily draw a picture about the behavior. Here’s my thinking process:

There’s popen before fgets . Probably fgets is to read the output from a process spawned by this popen
Before popen, the function calls sprintf. This must be to construct a command-line string to pass to popen
getpid before popen is to embed the current process’s PID into the command-line string

What does this mean? Always step back and remember the original issue. The problem is fgets never returns. And we just found out nvmlInit_v2 spawns a process and reads the output via fgets , which gets stuck. No documentation about this behavior in NVIDIA’s reference. Who knows a simple initialization function spawns a child process!

The next question is of course, what process is spawned? You can go back to the shell and run ps command to find out, but hold on, let’s stick to code and figure out the command line string first. sprintf is our next target because it looks like constructing a command to execute.

I believe you know sprintf. To understand a command to be constructed, it’s better to see a format string, which is passed to sprintf as the second argument. By the way, one advantage of ARM64 compared to x86 is fixed-sized instructions. Another advantage is a straightforward calling convention. The second argument is stored in the R1 register. Let’s see that value. Here’s the assembly to call sprintf.

   0xd82450:    mov     w2, w0
   0xd82454:    adrp    x1, 0x1233000 <crypto/internal/fips140/rsa..stmp_0+96>
   0xd82458:    mov     x0, x22
   0xd8245c:    add     x1, x1, #0xeb0
   0xd82460:    bl      0x4100e0 <sprintf@plt>

If you’re not familiar with ARM64 instructions, specs can be found here. If you know ARM64 instructions already, you’ll easily see the second argument is 0x1233000 + 0xeb0. Let’s check what’s there.

(gdb) x/s 0x1233000 + 0xeb0
0x1233eb0:      "lsof -p %d | tr -s ' ' | cut -d ' ' -f 9 "

This is apparently a template of a command-line string, and explains why we call getpid .

Now, it’s time to confirm this finding. Start Dynamo and see if we’re really running this command.

[ec2-user@ip-172-31-15-164 ~]$ sudo systemctl restart dynamo
[ec2-user@ip-172-31-15-164 ~]$ ps -ef | grep noded
ec2-user   45772       1  1 03:32 ?        00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user   45789   36391  0 03:32 pts/0    00:00:00 grep --color=auto noded
[ec2-user@ip-172-31-15-164 ~]$ ps -ef | grep 45772
ec2-user   45772       1  0 03:32 ?        00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user   45780   45772  0 03:32 ?        00:00:00 sh -c lsof -p 45772 | tr -s ' ' | cut -d ' ' -f 9
ec2-user   45781   45780  0 03:32 ?        00:00:00 lsof -p 45772
ec2-user   45784   45781 96 03:32 ?        00:00:09 lsof -p 45772
ec2-user   45791   36391  0 03:32 pts/0    00:00:00 grep --color=auto 45772

We’re running that command indeed! The next thing to do is of course, to run the exact same command manually.

[ec2-user@ip-172-31-15-164 ~]$ sh -c lsof -p 45772 | tr -s ' ' | cut -d ' ' -f 9
SIZE/OFF
denied)
denied)
denied)
...snip...
/usr/lib64/libtirpc.so.3.0.0
/usr/lib64/libselinux.so.1
/usr/lib/ld-linux-aarch64.so.1
/usr/lib64/gconv/gconv-modules.cache
pipe
pipe

It just worked. Weird. Anyway, we know the root cause is not in Dynamo but in lsof , which doesn’t finish if it’s spawned from Dynamo spawned via systemd. What’s next? Should we attach gdb to a lsof process to find out? Well, I could, but at this time, I took a different approach, strace. And it worked very well.

Root cause

Let’s restart dynamo and run strace for lsof.

[ec2-user@ip-172-31-15-164 ~]$ sudo systemctl restart dynamo
[ec2-user@ip-172-31-15-164 ~]$ ps -ef | grep noded
ec2-user   46510       1  2 03:47 ?        00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user   46526   36391  0 03:47 pts/0    00:00:00 grep --color=auto noded
[ec2-user@ip-172-31-15-164 ~]$ ps -ef | grep 46510
ec2-user   46510       1  0 03:47 ?        00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user   46519   46510  0 03:47 ?        00:00:00 sh -c lsof -p 46510 | tr -s ' ' | cut -d ' ' -f 9
ec2-user   46520   46519  0 03:47 ?        00:00:00 lsof -p 46510
ec2-user   46524   46520 92 03:47 ?        00:00:09 lsof -p 46510
ec2-user   46528   36391  0 03:47 pts/0    00:00:00 grep --color=auto 46510
[ec2-user@ip-172-31-15-164 ~]$ strace -p 46520
strace: Process 46520 attached
wait4(-1, ^Cstrace: Process 46520 detached
 <detached ...>
[ec2-user@ip-172-31-15-164 ~]$ strace -p 46524
close(503905516)                        = -1 EBADF (Bad file descriptor)
close(503905517)                        = -1 EBADF (Bad file descriptor)
close(503905518)                        = -1 EBADF (Bad file descriptor)
close(503905519)                        = -1 EBADF (Bad file descriptor)
close(503905520)                        = -1 EBADF (Bad file descriptor)
close(503905521)                        = -1 EBADF (Bad file descriptor)
...

Hmm, we found something really unusual. In the lsof process, a bunch of close are failing.

So, it’s finally time to debug lsof? Again, I could, but to be more practical and save time, I thought it’s time to ask AI about what’s a possible cause. And here’s the actual answer I got.

Yes — that strace output is a big hint.

When you see lsof looping on

close(503905516) = -1 EBADF close(503905517) = -1 EBADF …


it means lsof is walking a huge range of possible fd numbers and calling close() on them one by one.

That is not normal fd enumeration (which should only read /proc/<pid>/fd). This is an old implementation detail of lsof:

Before using /proc/<pid>/fd, lsof tries to "sanitize" the fd table by closing everything above its stdio fds, to make sure it doesn’t inherit descriptors it shouldn’t.

It doesn’t ask the kernel what the max fd is, but instead guesses from getrlimit(RLIMIT_NOFILE).

There is what I haven’t exposed yet: the service definition of Dynamo, which is this:

[ec2-user@ip-172-31-15-164 ~]$ cat /etc/systemd/system/dynamo.service
[Unit]
Description="Dynamo node service"
After=network-online.target
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=on-failure
RestartSec=5s
User=ec2-user
SyslogIdentifier=dynamo
WorkingDirectory=/opt/dynamo
KillSignal=SIGHUP
SyslogFacility=local7
ExecStart=/opt/dynamo/noded \\
  -config=/opt/dynamo/config.json
LimitNOFILE=infinity
LimitNPROC=infinity

[Install]
WantedBy=multi-user.target

Do you see anything unusual? It must be LimitNOFILE=infinity. There was no particular reason I added it. According to the AI’s answer, an old implementation of lsof guesses the max fd from getrlimit(RLIMIT_NOFILE) . In this case, it must be RLIMIT_NOFILE = RLIM_INFINITY = 2^63-1 , meaning lsof is closing everything above its stdio fds below 2^63-1. This is the reason why Dynamo is stuck.

Alright, it seems that we found out the root cause, The next move is to confirm this. It’s easy. Let’s comment out LimitNOFILE=infinity and run Dynamo.

I don’t put the command output here because it’s obvious. The result is, it actually worked! Without LimitNOFILE=infinity , Dynamo just works normally.

What’s a conclusion here? It’s a bug in lsof? Basically yes, but remember what AI said: This is an old implementation detail of lsof. This implicates the latest lsof behaves differently. And remember I found this issue when I ran Dynamo on Amazon Linux ARM64, which recently started being supported. We’ve never seen this issue before, from x64 machines. Let’s check the version of lsof.

[ec2-user@ip-172-31-15-164 ~]$ cat /proc/version
Linux version 6.1.147-172.266.amzn2023.aarch64 (mockbuild@ip-10-0-37-70) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.41-50.amzn2023.0.3) #1 SMP Thu Aug  7 19:28:45 UTC 2025
[ec2-user@ip-172-31-15-164 ~]$ lsof -v
lsof version information:
    revision: 4.94.0
    latest revision: <https://github.com/lsof-org/lsof>
    latest FAQ: <https://github.com/lsof-org/lsof/blob/master/00FAQ>
    latest (non-formatted) man page: <https://github.com/lsof-org/lsof/blob/master/Lsof.8>
    constructed: Mon May 19 00:00:00 UTC 2025
    compiler: cc
    compiler version: 11.5.0 20240719 (Red Hat 11.5.0-5) (GCC)
...

It’s 4.94.0. How about another machine of x64 where Dynamo runs without any problem?

ubuntu@dynamo:~$ cat /proc/version
Linux version 6.8.0-59-generic (buildd@lcy02-amd64-035) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #61-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 11 23:16:11 UTC 2025
ubuntu@dynamo:~$ lsof -v
lsof version information:
    revision: 4.95.0
    latest revision: <https://github.com/lsof-org/lsof>
    latest FAQ: <https://github.com/lsof-org/lsof/blob/master/00FAQ>
    latest (non-formatted) man page: <https://github.com/lsof-org/lsof/blob/master/Lsof.8>
    compiler: cc
    compiler version: 13.2.0 (Ubuntu 13.2.0-23ubuntu3)

Oops, it’s different: 4.94.0 vs 4.95.0!

Fortunately, unlike NVML, as you can see it in the output above, lsof is open-sourced. Actually the release note of 4.95.0 mentions a fix for a bug that sound pretty much like our issue.

	[linux] use close_range instead of calling close repeatedly
	At the starting up, lsof closes its file descriptors greater
	than 2 by calling close(2) repeatedly. As reported in #186,
	it can take long time. Linux 5.9 introduced close_range(2).
	The new system call can close multiple file descriptors faster.
	@qianzhangyl reported the original issue (#186).

So I built lsof from source to see if the version 4.95.0 really fixed our issue (with uncommenting the line of LimitNOFILE=infinity of course). Unfortunately it didn’t. Thus I tried several version to find out which version fixed this issue. By the way, here are the commands to build lsof.

$ git clone <https://github.com/lsof-org/lsof.git> -b 4.95.0
$ cd lsof
$ ./Configure linux
...
$ make -j4
...
$ ./lsof -v
lsof version information:
    revision: 4.95.0
    latest revision: <https://github.com/lsof-org/lsof>
    latest FAQ: <https://github.com/lsof-org/lsof/blob/master/00FAQ>
    latest (non-formatted) man page: <https://github.com/lsof-org/lsof/blob/master/Lsof.8>
    constructed by and on: [email protected]
    compiler: cc
    compiler version: 11.5.0 20240719 (Red Hat 11.5.0-5) (GCC)
...

Finally, I confirmed the issue is gone with 4.99.0 while the issue exists with 4.98.0. Let’s check the release note of 4.99.0.

	[linux] Improve performance by using closefrom(). Closes #281.

The issue 281 must be our issue. To really confirm, I could revert this patch and see, but I didn’t go that much further. Another mystery is why lsof 4.95.0 on Ubuntu x64 instance doesn’t hit this issue. That’s a very good question. I’ll dive into these another time.

Let’s recap:

We saw an issue where Dynamo got stuck if it’s run through systemd on Amazon Linux ARM64.
Dynamo calls nvmlInit_v2, which spawns lsof , which gets stuck
lsof v4.94.0 has a bug that it attempts to close everything above its stdio fds
Service definition of Dynamo specifies LimitNOFILE=infinity , meaning the upper limit of lsof’s close attempts is 2^63-1

Considering all the above, what’s a fix? I simply decided to specify LimitNOFILE=1048576 instead of infinity. Again, there was no particular reason why I chose infinity from the beginning. Another possible approach is to apply this setting only if the version of lsof on the system is lower than 4.99.0, or set a custom value only while calling nvmlInit_v2 , but I prefer the simplest fix.

This is just another day of debugging at Kinesis. Happy debugging!

PreviousReaching out to home computers NextBiotech Is Ready to Save Lives. But Compute Is Holding It Back.

Last updated 1 month ago

Was this helpful?