One of the key components in the Kinesis ecosystem is Dynamo, an agent program to promote a computer to a computing node in the network. Recently, I made a patch to enable Dynamo to run on ARM64 devices. Usually we use Ubuntu as Linux distro for everyday use. On this day, however, I was in a mood to try something new and chose Amazon Linux in AWS. The story began there.
Dynamo got stuck
On a node, we run Dynamo as a systemd service. When I started the Dynamo service on an Amazon Linux node, the process started, but I realized some important events were not logged. It seemed that the process was stuck in the middle of the startup phase.
The first triage is to run the exact same command manually, and when I ran the command on shell, it just worked normally. Here are the output: the former is from systemd; the latter is from manual run.
#### systemd[ec2-user@ip-172-31-15-164kinesis-dynamo]$sudosystemctlstartdynamo[ec2-user@ip-172-31-15-164kinesis-dynamo]$sudojournalctl-udynamo-fAug2723:39:20ip-172-31-15-164.ec2.internalsystemd[1]: dynamo.service: Deactivatedsuccessfully.Aug2723:39:20ip-172-31-15-164.ec2.internalsystemd[1]: Stoppeddynamo.service-"Dynamo node service".Aug2723:39:20ip-172-31-15-164.ec2.internalsystemd[1]: dynamo.service: Consumed 42.170sCPUtime.Aug2723:39:27ip-172-31-15-164.ec2.internalsystemd[1]: Starteddynamo.service-"Dynamo node service".Aug2723:39:27ip-172-31-15-164.ec2.internaldynamo[37562]: time=2025-08-27T23:39:27.132Zlevel=INFOmsg="Loaded your wallet"addr=0xa5e07b0a3944dd9158a8a72a6794004201669468file=/opt/dynamo/id_ecdsaAug2723:39:27ip-172-31-15-164.ec2.internaldynamo[37562]: time=2025-08-27T23:39:27.132Zlevel=INFOmsg="Loaded AppCacheFile"file=/opt/dynamo/app-cache.jsonAug2723:39:27ip-172-31-15-164.ec2.internaldynamo[37562]: time=2025-08-27T23:39:27.132Zlevel=INFOmsg="Loaded a valid certificate"file=/opt/dynamo/backend.crtAug2723:39:27ip-172-31-15-164.ec2.internaldynamo[37562]: time=2025-08-27T23:39:27.132Zlevel=INFOmsg="Loaded config"file=/opt/dynamo/config.jsonAug2723:39:27ip-172-31-15-164.ec2.internaldynamo[37562]: time=2025-08-27T23:39:27.182Zlevel=ERRORmsg="Cannot initialize Docker Manager"err="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"Aug2723:39:27ip-172-31-15-164.ec2.internaldynamo[37562]: time=2025-08-27T23:39:27.183Zlevel=INFOmsg="Serving gRPC"laddr=/tmp/kinesis-dynamo.sock^C[ec2-user@ip-172-31-15-164kinesis-dynamo]$ps-ef|grepnoderoot7782022:38?00:00:00 [xfs-inodegc/nvm]ec2-user375621123:39?00:00:00/opt/dynamo/noded-config=/opt/dynamo/config.jsonec2-user3757736391023:39pts/000:00:00grep--color=autonode### manualrun[ec2-user@ip-172-31-15-164kinesis-dynamo]$/opt/dynamo/noded-config=/opt/dynamo/config.jsontime=2025-08-27T23:40:03.620Zlevel=INFOmsg="Loaded your wallet"addr=0xa5e07b0a3944dd9158a8a72a6794004201669468file=/opt/dynamo/id_ecdsatime=2025-08-27T23:40:03.620Zlevel=INFOmsg="Loaded AppCacheFile"file=/opt/dynamo/app-cache.jsontime=2025-08-27T23:40:03.621Zlevel=INFOmsg="Loaded a valid certificate"file=/opt/dynamo/backend.crttime=2025-08-27T23:40:03.621Zlevel=INFOmsg="Loaded config"file=/opt/dynamo/config.jsontime=2025-08-27T23:40:03.660Zlevel=ERRORmsg="Cannot initialize Docker Manager"err="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"time=2025-08-27T23:40:03.661Zlevel=INFOmsg="Serving gRPC"laddr=/tmp/kinesis-dynamo.sock!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!WARNING:Youshouldalwaysrunwithlibnvidia-ml.sothatisinstalledwithyourNVIDIADisplayDriver.Bydefaultit's installed in /usr/lib and /usr/lib64.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!time=2025-08-27T23:40:03.767Zlevel=ERRORmsg="Cannot initialize NVML"result=9time=2025-08-27T23:40:03.802Zlevel=INFOmsg="Successfully registered with Node Manager"address=0xa5e07b0a3944dd9158a8a72a6794004201669468ip=""^C
As you can see in the second output, we use NVML to collect information about NVIDIA GPU. The warning and the error of "Cannot initialize NVML” in the output are expected because this node doesn’t have any GPU. The problem is these warning and error were not logged from Dynamo running as a systemd service.
Debugging with Delve
Since Dynamo is written in Go, the second triage is to attach Delve to a systemd process to debug. I could debug in interactive mode, but if the target process is a daemon like this case, I always start with batch mode i.e. running dlv to print all goroutines and dump it to a file. Why? If you pause the process for a long time, systemd may trigger restart. In general, you should keep your debug target fresh and untouched, like cooking fish. Dumping to a file also helps if the output is too long. You can use your favorite editor to analyze the output for unlimited time. Here are the commands to dump goroutines to a file.
I found a groutine that looks indeed stuck.
As mentioned above, Dynamo uses NVML. Since NVML is a C library (and Python bindings are available too), Dynamo implements a stub written in C to consume NVML APIs and compile/link it with CGO. The callstack above indicates that stub code never returns. Below is the actual Go code of Dynamo. The frame #7 in the output above indicates the call to C.NvmlInit() below never returns. Well, there is nothing interesting here in Go code.
What does NvmlInit() do? You may be surprised. Here’s the actual C code of Dynamo.
It does nothing but calling a NVML function nvmlInit , which is actually an alias of a versioned function nvmlInit_v2 . NVIDIA’s reference has some description and notes about this method, but no information mentions a possible “stuck”.
So what’s next? NVML is closed-source unfortunately. Calling NVIDIA support?
Debugging with GDB
Of course not. We dig deeper with GDB (Or you could go with LLDB if you prefer). No source code is no problem.
Let’s restart dynamo and attach gdb to the process.
If you want to dump callstacks of all threads as we did with dlv earlier, you can run
However, let’s try another approach this time. First, list all threads.
Probably we can ignore threads with runtime.futex because many are waiting there so it looks like Go’s management stuff. This means we’re interested in thread #4 or #7. Let’s check them one by one. In this case, thread #4 wasn’t interesting and #7 was what we were looking for.
It’s in the middle of our stub function NvmlInit , which is calling fgets . Let’s confirm if fgets really never returns. This is an important step because this stuck could be caused by an infinite loop calling fgets repeatedly.
I left it for seconds and this breakpoint wasn’t hit. So Dynamo was stuck because fgets never returned. What kind of data are we reading from what? Standard input?
We’ll get there, but before that, let’s step back a little bit and remember this happens only when we run Dynamo via systemd. Two possibilities. If we run Dynamo directly from shell, fgets just returns without any problem, or fgets isn’t called at all. Let’s find it out. I took this approach because debugging a normal process is much easier than debugging a systemd process. We don’t have to check the PID, and we don’t have to worry about automatic restart, etc.
I just ran dynamo with gdb and set a breakpoint at nvmlInit_v2.
So far so good. Next, I set a breakpoint on fgets and continued.
The breakpoint got hit nicely, but wait, did you notice the difference?
It’s obvious. We got NVML warning here. Remember we didn’t see this warning in the repro. So this call to fgets is not the one getting stuck. If you look at the stacktrace carefully, you’ll see the return address is different too. Here’s the callstack of the repro: d82440 vs d824a4.
The imagebase might be different because this is a different process? That’s a good point, but probably it’s not the case. Usually the imagebase is reused for efficiency, but in this case the address is only 0x64 bytes different. If the module is mapped onto a different address, the difference would be aligned with page size 4k.
So this is a different call to fgets. Good news is both calls are made from the same function that is called from nvmlInit_v2 . You can get it because the return address of a function calling fgets is 0xd982c0 in both cases.
Fun time to dive into assembly! Let’s start from nvmlInit_v2.
Right above 0xd982c0, there is a BL to 0xd823d0, which must be the function call we’re looking for.
One disadvantage of gdb compared to Windows Debugger (ntsd/cdb/kd/windbg) is the disas command (uf in Windows debuggers) doesn’t work if there is no matching symbol on the target address. We need to use x command instead.
Alright, we found the call to fgets that is stuck in the repro. Well, we could enjoy finding out why it went down a different path with/without systemd, but not today. Let’s focus on the original issue.
The next thing to find out is what’s this fgets call. Before asking AI for help, just spend some time on reading this assembly code. The hint is what happens before we call fgets. Around there, we can see calls to standard functions like getpid, sprintf, popen, strstr, etc. If you have some experience in C programming, you can easily draw a picture about the behavior. Here’s my thinking process:
There’s popen before fgets . Probably fgets is to read the output from a process spawned by this popen
Before popen, the function calls sprintf. This must be to construct a command-line string to pass to popen
getpid before popen is to embed the current process’s PID into the command-line string
What does this mean? Always step back and remember the original issue. The problem is fgets never returns. And we just found out nvmlInit_v2 spawns a process and reads the output via fgets , which gets stuck. No documentation about this behavior in NVIDIA’s reference. Who knows a simple initialization function spawns a child process!
The next question is of course, what process is spawned? You can go back to the shell and run ps command to find out, but hold on, let’s stick to code and figure out the command line string first. sprintf is our next target because it looks like constructing a command to execute.
I believe you know sprintf. To understand a command to be constructed, it’s better to see a format string, which is passed to sprintf as the second argument. By the way, one advantage of ARM64 compared to x86 is fixed-sized instructions. Another advantage is a straightforward calling convention. The second argument is stored in the R1 register. Let’s see that value. Here’s the assembly to call sprintf.
If you’re not familiar with ARM64 instructions, specs can be found here. If you know ARM64 instructions already, you’ll easily see the second argument is 0x1233000 + 0xeb0. Let’s check what’s there.
This is apparently a template of a command-line string, and explains why we call getpid .
Now, it’s time to confirm this finding. Start Dynamo and see if we’re really running this command.
We’re running that command indeed! The next thing to do is of course, to run the exact same command manually.
It just worked. Weird. Anyway, we know the root cause is not in Dynamo but in lsof , which doesn’t finish if it’s spawned from Dynamo spawned via systemd. What’s next? Should we attach gdb to a lsof process to find out? Well, I could, but at this time, I took a different approach, strace. And it worked very well.
Root cause
Let’s restart dynamo and run strace for lsof.
Hmm, we found something really unusual. In the lsof process, a bunch of close are failing.
So, it’s finally time to debug lsof? Again, I could, but to be more practical and save time, I thought it’s time to ask AI about what’s a possible cause. And here’s the actual answer I got.
There is what I haven’t exposed yet: the service definition of Dynamo, which is this:
Do you see anything unusual? It must be LimitNOFILE=infinity. There was no particular reason I added it. According to the AI’s answer, an old implementation of lsof guesses the max fd from getrlimit(RLIMIT_NOFILE) . In this case, it must be RLIMIT_NOFILE = RLIM_INFINITY = 2^63-1 , meaning lsof is closing everything above its stdio fds below 2^63-1. This is the reason why Dynamo is stuck.
Alright, it seems that we found out the root cause, The next move is to confirm this. It’s easy. Let’s comment out LimitNOFILE=infinity and run Dynamo.
I don’t put the command output here because it’s obvious. The result is, it actually worked! Without LimitNOFILE=infinity , Dynamo just works normally.
What’s a conclusion here? It’s a bug in lsof? Basically yes, but remember what AI said: This is an old implementation detail of lsof. This implicates the latest lsof behaves differently. And remember I found this issue when I ran Dynamo on Amazon Linux ARM64, which recently started being supported. We’ve never seen this issue before, from x64 machines. Let’s check the version of lsof.
It’s 4.94.0. How about another machine of x64 where Dynamo runs without any problem?
Oops, it’s different: 4.94.0 vs 4.95.0!
Fortunately, unlike NVML, as you can see it in the output above, lsof is open-sourced. Actually the release note of 4.95.0 mentions a fix for a bug that sound pretty much like our issue.
So I built lsof from source to see if the version 4.95.0 really fixed our issue (with uncommenting the line of LimitNOFILE=infinity of course). Unfortunately it didn’t. Thus I tried several version to find out which version fixed this issue. By the way, here are the commands to build lsof.
Finally, I confirmed the issue is gone with 4.99.0 while the issue exists with 4.98.0. Let’s check the release note of 4.99.0.
The issue 281 must be our issue. To really confirm, I could revert this patch and see, but I didn’t go that much further. Another mystery is why lsof 4.95.0 on Ubuntu x64 instance doesn’t hit this issue. That’s a very good question. I’ll dive into these another time.
Let’s recap:
We saw an issue where Dynamo got stuck if it’s run through systemd on Amazon Linux ARM64.
Dynamo calls nvmlInit_v2, which spawns lsof , which gets stuck
lsof v4.94.0 has a bug that it attempts to close everything above its stdio fds
Service definition of Dynamo specifies LimitNOFILE=infinity , meaning the upper limit of lsof’s close attempts is 2^63-1
Considering all the above, what’s a fix? I simply decided to specify LimitNOFILE=1048576 instead of infinity. Again, there was no particular reason why I chose infinity from the beginning. Another possible approach is to apply this setting only if the version of lsof on the system is lower than 4.99.0, or set a custom value only while calling nvmlInit_v2 , but I prefer the simplest fix.
This is just another day of debugging at Kinesis. Happy debugging!
* Goroutine 36 - User: _cgo_gotypes.go:150 github.com/kinesis-network/kinesis-dynamo/nvml._Cfunc_NvmlInit (0x8fb5f0) (thread 38720) [select]
0 0x0000ffff87f68ff8 in [1m???[0m
at [1m?:-1[0m
1 0x000000000048de18 in [1mruntime.systemstack_switch[0m
at ./.go/src/runtime/[1masm_arm64.s:249[0m
2 0x000000000048572c in [1mruntime.cgocall[0m
at ./.go/src/runtime/[1mcgocall.go:185[0m
3 0x00000000008fb5f0 in [1mgithub.com/kinesis-network/kinesis-dynamo/nvml._Cfunc_NvmlInit[0m
at [1m_cgo_gotypes.go:150[0m
4 0x00000000008fb708 in [1mgithub.com/kinesis-network/kinesis-dynamo/nvml.Init.func1[0m
at ./kinesis-dynamo/nvml/[1mnvml.go:59[0m
5 0x00000000004998e0 in [1msync.(*Once).doSlow[0m
at ./.go/src/sync/[1monce.go:78[0m
6 0x00000000008fb6b4 in [1msync.(*Once).Do[0m
at ./.go/src/sync/[1monce.go:69[0m
7 0x00000000008fb6b4 in [1mgithub.com/kinesis-network/kinesis-dynamo/nvml.Init[0m
at ./kinesis-dynamo/nvml/[1mnvml.go:58[0m
8 0x00000000009bca08 in [1mgithub.com/kinesis-network/kinesis-dynamo/pulse.FillGpuInfo[0m
at ./kinesis-dynamo/pulse/[1mgpu.go:14[0m
9 0x0000000000d25f1c in [1mgithub.com/kinesis-network/kinesis-dynamo/core.(*Server).collectNodeInfo[0m
at ./kinesis-dynamo/core/[1mserver.go:389[0m
10 0x0000000000d25720 in [1mgithub.com/kinesis-network/kinesis-dynamo/core.(*Server).advertiseSelf[0m
at ./kinesis-dynamo/core/[1mserver.go:340[0m
11 0x0000000000d26c8c in [1mgithub.com/kinesis-network/kinesis-dynamo/core.(*Server).Serve.func1[0m
at ./kinesis-dynamo/core/[1mserver.go:464[0m
12 0x0000000000490334 in [1mruntime.goexit[0m
at ./.go/src/runtime/[1masm_arm64.s:1268[0m
[ec2-user@ip-172-31-15-164 ~]$ sudo systemctl restart dynamo
[ec2-user@ip-172-31-15-164 ~]$ ps -ef | grep noded
ec2-user 42723 1 2 01:52 ? 00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user 42787 36391 0 01:52 pts/0 00:00:00 grep --color=auto noded
[ec2-user@ip-172-31-15-164 ~]$ gdb -q /opt/dynamo/noded
Reading symbols from /opt/dynamo/noded...
warning: File "/home/ec2-user/.go/.versions/1.25.0/src/runtime/runtime-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /home/ec2-user/.go/.versions/1.25.0/src/runtime/runtime-gdb.py
line to your configuration file "/home/ec2-user/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/home/ec2-user/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
(gdb) attach 42723
Attaching to program: /opt/dynamo/noded, process 42723
[New LWP 42788]
[New LWP 42773]
[New LWP 42759]
[New LWP 42739]
[New LWP 42729]
[New LWP 42728]
[New LWP 42727]
[New LWP 42726]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
runtime.futex () at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
651 SVC
Missing rpms, try: dnf --enablerepo='*debug*' install glibc-debuginfo-2.34-196.amzn2023.0.1.aarch64
(gdb) thread apply all bt
(gdb) info threads
Id Target Id Frame
* 1 Thread 0xffffbc4f5020 (LWP 42723) "noded" runtime.futex ()
at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
2 Thread 0xffff70f3e100 (LWP 42788) "noded" runtime.futex ()
at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
3 Thread 0xffff7194e100 (LWP 42773) "noded" runtime.futex ()
at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
4 Thread 0xffff7235e100 (LWP 42759) "noded" internal/runtime/syscall.Syscall6 ()
at /home/ec2-user/.go/src/internal/runtime/syscall/asm_linux_arm64.s:17
5 Thread 0xffff72e2e100 (LWP 42739) "noded" runtime.futex ()
at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
6 Thread 0xffff7395e100 (LWP 42729) "noded" runtime.futex ()
at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
7 Thread 0xffff743ae100 (LWP 42728) "noded" 0x0000ffffbc38aff8 in read () from /lib64/libc.so.6
8 Thread 0xffff74dbe100 (LWP 42727) "noded" runtime.futex ()
at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
9 Thread 0xffff7580e100 (LWP 42726) "noded" runtime.futex ()
at /home/ec2-user/.go/src/runtime/sys_linux_arm64.s:651
(gdb) thread 7
[Switching to thread 7 (Thread 0xffff62976100 (LWP 40398))]
#0 0x0000ffffaa952ff8 in read () from /lib64/libc.so.6
(gdb) bt
#0 0x0000ffffaa952ff8 in read () from /lib64/libc.so.6
#1 0x0000ffffaa8ed888 in __GI__IO_file_underflow () from /lib64/libc.so.6
#2 0x0000ffffaa8ee944 in _IO_default_uflow () from /lib64/libc.so.6
#3 0x0000ffffaa8e0ddc in _IO_getline_info () from /lib64/libc.so.6
#4 0x0000ffffaa8dfb20 in fgets () from /lib64/libc.so.6
#5 0x0000000000d824a4 in ?? ()
#6 0x0000000000d982c0 in nvmlInit_v2 ()
#7 0x0000000000d75c3c in NvmlInit () at nvml.c:87
#8 0x0000000000d757f0 in _cgo_4e2c63ea9b15_Cfunc_NvmlInit (v=0x40002259a8) at /tmp/go-build/cgo-gcc-prolog:114
#9 0x000000000049013c in runtime.asmcgocall () at /home/ec2-user/.go/src/runtime/asm_arm64.s:1049
#10 0x0000004000003dc0 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
(gdb) b *0xd824a4
Breakpoint 1 at 0xd824a4
(gdb) c
Continuing.
[ec2-user@ip-172-31-15-164 ~]$ gdb -q /opt/dynamo/noded
Reading symbols from /opt/dynamo/noded...
warning: File "/home/ec2-user/.go/.versions/1.25.0/src/runtime/runtime-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /home/ec2-user/.go/.versions/1.25.0/src/runtime/runtime-gdb.py
line to your configuration file "/home/ec2-user/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/home/ec2-user/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
(gdb) b nvmlInit_v2
Breakpoint 1 at 0xd98264
(gdb) r -config=/opt/dynamo/config.json
Starting program: /opt/dynamo/noded -config=/opt/dynamo/config.json
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0xffffb130a100 (LWP 43272)]
[New Thread 0xffffabfff100 (LWP 43273)]
[New Thread 0xffffab5ef100 (LWP 43274)]
[New Thread 0xffffaabdf100 (LWP 43275)]
[New Thread 0xffffaa1cf100 (LWP 43276)]
time=2025-08-28T02:09:22.824Z level=INFO msg="Loaded your wallet" addr=0xa5e07b0a3944dd9158a8a72a6794004201669468 file=/opt/dynamo/id_ecdsa
time=2025-08-28T02:09:22.825Z level=INFO msg="Loaded AppCacheFile" file=/opt/dynamo/app-cache.json
time=2025-08-28T02:09:22.825Z level=INFO msg="Loaded a valid certificate" file=/opt/dynamo/backend.crt
time=2025-08-28T02:09:22.825Z level=INFO msg="Loaded config" file=/opt/dynamo/config.json
[New Thread 0xffffa97bf100 (LWP 43277)]
time=2025-08-28T02:09:22.873Z level=ERROR msg="Cannot initialize Docker Manager" err="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
[New Thread 0xffffa8daf100 (LWP 43278)]
time=2025-08-28T02:09:22.874Z level=INFO msg="Serving gRPC" laddr=/tmp/kinesis-dynamo.sock
[Switching to Thread 0xffffabfff100 (LWP 43273)]
Thread 3 "noded" hit Breakpoint 1, 0x0000000000d98264 in nvmlInit_v2 ()
Missing rpms, try: dnf --enablerepo='*debug*' install glibc-debuginfo-2.34-196.amzn2023.0.1.aarch64
(gdb) bt
#0 0x0000000000d98264 in nvmlInit_v2 ()
#1 0x0000000000d75c3c in NvmlInit () at nvml.c:87
#2 0x0000000000d757f0 in _cgo_4e2c63ea9b15_Cfunc_NvmlInit (v=0x400038d9a8) at /tmp/go-build/cgo-gcc-prolog:114
#3 0x000000000049013c in runtime.asmcgocall () at /home/ec2-user/.go/src/runtime/asm_arm64.s:1049
#4 0x00000040000a4a80 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
(gdb) b fgets
Breakpoint 2 at 0xfffff7e13a90
(gdb) c
Continuing.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
time=2025-08-28T02:09:33.691Z level=INFO msg="Disconnected from Runtime Manager" reason="connection error: desc = \\"error reading server preface: EOF\\""
[Detaching after vfork from child process 43280]
Thread 3 "noded" hit Breakpoint 2, 0x0000fffff7e13a90 in fgets () from /lib64/libc.so.6
(gdb) bt
#0 0x0000fffff7e13a90 in fgets () from /lib64/libc.so.6
#1 0x0000000000d82440 in ?? ()
#2 0x0000000000d982c0 in nvmlInit_v2 ()
#3 0x0000000000d75c3c in NvmlInit () at nvml.c:87
#4 0x0000000000d757f0 in _cgo_4e2c63ea9b15_Cfunc_NvmlInit (v=0x400038d9a8) at /tmp/go-build/cgo-gcc-prolog:114
#5 0x000000000049013c in runtime.asmcgocall () at /home/ec2-user/.go/src/runtime/asm_arm64.s:1049
#6 0x00000040000a4a80 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
(gdb)
#4 0x0000ffffaa8dfb20 in fgets () from /lib64/libc.so.6
#5 0x0000000000d824a4 in ?? ()
#6 0x0000000000d982c0 in nvmlInit_v2 ()
#7 0x0000000000d75c3c in NvmlInit () at nvml.c:87
Yes — that strace output is a big hint.
When you see lsof looping on
it means lsof is walking a huge range of possible fd numbers and calling close() on them one by one.
That is not normal fd enumeration (which should only read /proc/<pid>/fd). This is an old implementation detail of lsof:
Before using /proc/<pid>/fd, lsof tries to "sanitize" the fd table by closing everything above its stdio fds, to make sure it doesn’t inherit descriptors it shouldn’t.
It doesn’t ask the kernel what the max fd is, but instead guesses from getrlimit(RLIMIT_NOFILE).
[ec2-user@ip-172-31-15-164 ~]$ cat /proc/version
Linux version 6.1.147-172.266.amzn2023.aarch64 (mockbuild@ip-10-0-37-70) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.41-50.amzn2023.0.3) #1 SMP Thu Aug 7 19:28:45 UTC 2025
[ec2-user@ip-172-31-15-164 ~]$ lsof -v
lsof version information:
revision: 4.94.0
latest revision: <https://github.com/lsof-org/lsof>
latest FAQ: <https://github.com/lsof-org/lsof/blob/master/00FAQ>
latest (non-formatted) man page: <https://github.com/lsof-org/lsof/blob/master/Lsof.8>
constructed: Mon May 19 00:00:00 UTC 2025
compiler: cc
compiler version: 11.5.0 20240719 (Red Hat 11.5.0-5) (GCC)
...
ubuntu@dynamo:~$ cat /proc/version
Linux version 6.8.0-59-generic (buildd@lcy02-amd64-035) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #61-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 11 23:16:11 UTC 2025
ubuntu@dynamo:~$ lsof -v
lsof version information:
revision: 4.95.0
latest revision: <https://github.com/lsof-org/lsof>
latest FAQ: <https://github.com/lsof-org/lsof/blob/master/00FAQ>
latest (non-formatted) man page: <https://github.com/lsof-org/lsof/blob/master/Lsof.8>
compiler: cc
compiler version: 13.2.0 (Ubuntu 13.2.0-23ubuntu3)
[linux] use close_range instead of calling close repeatedly
At the starting up, lsof closes its file descriptors greater
than 2 by calling close(2) repeatedly. As reported in #186,
it can take long time. Linux 5.9 introduced close_range(2).
The new system call can close multiple file descriptors faster.
@qianzhangyl reported the original issue (#186).
$ git clone <https://github.com/lsof-org/lsof.git> -b 4.95.0
$ cd lsof
$ ./Configure linux
...
$ make -j4
...
$ ./lsof -v
lsof version information:
revision: 4.95.0
latest revision: <https://github.com/lsof-org/lsof>
latest FAQ: <https://github.com/lsof-org/lsof/blob/master/00FAQ>
latest (non-formatted) man page: <https://github.com/lsof-org/lsof/blob/master/Lsof.8>
constructed by and on: [email protected] compiler: cc
compiler version: 11.5.0 20240719 (Red Hat 11.5.0-5) (GCC)
...
[linux] Improve performance by using closefrom(). Closes #281.