Another day of debugging

By Toshihito Kikuchi

One of the key components in the Kinesis ecosystem is Dynamo, an agent program to promote a computer to a computing node in the network. Recently, I made a patch to enable Dynamo to run on ARM64 devices. Usually we use Ubuntu as Linux distro for everyday use. On this day, however, I was in a mood to try something new and chose Amazon Linux in AWS. The story began there.

Dynamo got stuck

On a node, we run Dynamo as a systemd service. When I started the Dynamo service on an Amazon Linux node, the process started, but I realized some important events were not logged. It seemed that the process was stuck in the middle of the startup phase.

The first triage is to run the exact same command manually, and when I ran the command on shell, it just worked normally. Here are the output: the former is from systemd; the latter is from manual run.

#### systemd

[ec2-user@ip-172-31-15-164 kinesis-dynamo]$ sudo systemctl start dynamo
[ec2-user@ip-172-31-15-164 kinesis-dynamo]$ sudo journalctl -u dynamo -f
Aug 27 23:39:20 ip-172-31-15-164.ec2.internal systemd[1]: dynamo.service: Deactivated successfully.
Aug 27 23:39:20 ip-172-31-15-164.ec2.internal systemd[1]: Stopped dynamo.service - "Dynamo node service".
Aug 27 23:39:20 ip-172-31-15-164.ec2.internal systemd[1]: dynamo.service: Consumed 42.170s CPU time.
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal systemd[1]: Started dynamo.service - "Dynamo node service".
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.132Z level=INFO msg="Loaded your wallet" addr=0xa5e07b0a3944dd9158a8a72a6794004201669468 file=/opt/dynamo/id_ecdsa
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.132Z level=INFO msg="Loaded AppCacheFile" file=/opt/dynamo/app-cache.json
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.132Z level=INFO msg="Loaded a valid certificate" file=/opt/dynamo/backend.crt
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.132Z level=INFO msg="Loaded config" file=/opt/dynamo/config.json
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.182Z level=ERROR msg="Cannot initialize Docker Manager" err="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
Aug 27 23:39:27 ip-172-31-15-164.ec2.internal dynamo[37562]: time=2025-08-27T23:39:27.183Z level=INFO msg="Serving gRPC" laddr=/tmp/kinesis-dynamo.sock
^C
[ec2-user@ip-172-31-15-164 kinesis-dynamo]$ ps -ef | grep node
root         778       2  0 22:38 ?        00:00:00 [xfs-inodegc/nvm]
ec2-user   37562       1  1 23:39 ?        00:00:00 /opt/dynamo/noded -config=/opt/dynamo/config.json
ec2-user   37577   36391  0 23:39 pts/0    00:00:00 grep --color=auto node

### manual run

[ec2-user@ip-172-31-15-164 kinesis-dynamo]$ /opt/dynamo/noded -config=/opt/dynamo/config.json
time=2025-08-27T23:40:03.620Z level=INFO msg="Loaded your wallet" addr=0xa5e07b0a3944dd9158a8a72a6794004201669468 file=/opt/dynamo/id_ecdsa
time=2025-08-27T23:40:03.620Z level=INFO msg="Loaded AppCacheFile" file=/opt/dynamo/app-cache.json
time=2025-08-27T23:40:03.621Z level=INFO msg="Loaded a valid certificate" file=/opt/dynamo/backend.crt
time=2025-08-27T23:40:03.621Z level=INFO msg="Loaded config" file=/opt/dynamo/config.json
time=2025-08-27T23:40:03.660Z level=ERROR msg="Cannot initialize Docker Manager" err="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
time=2025-08-27T23:40:03.661Z level=INFO msg="Serving gRPC" laddr=/tmp/kinesis-dynamo.sock

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
time=2025-08-27T23:40:03.767Z level=ERROR msg="Cannot initialize NVML" result=9
time=2025-08-27T23:40:03.802Z level=INFO msg="Successfully registered with Node Manager" address=0xa5e07b0a3944dd9158a8a72a6794004201669468 ip=""
^C

As you can see in the second output, we use NVMLarrow-up-right to collect information about NVIDIA GPU. The warning and the error of "Cannot initialize NVML” in the output are expected because this node doesn’t have any GPU. The problem is these warning and error were not logged from Dynamo running as a systemd service.

Debugging with Delve

Since Dynamo is written in Go, the second triage is to attach Delvearrow-up-right to a systemd process to debug. I could debug in interactive mode, but if the target process is a daemon like this case, I always start with batch mode i.e. running dlv to print all goroutines and dump it to a file. Why? If you pause the process for a long time, systemd may trigger restart. In general, you should keep your debug target fresh and untouched, like cooking fish. Dumping to a file also helps if the output is too long. You can use your favorite editor to analyze the output for unlimited time. Here are the commands to dump goroutines to a file.

I found a groutine that looks indeed stuck.

As mentioned above, Dynamo uses NVML. Since NVML is a C library (and Python bindings are available too), Dynamo implements a stub written in C to consume NVML APIs and compile/link it with CGOarrow-up-right. The callstack above indicates that stub code never returns. Below is the actual Go code of Dynamo. The frame #7 in the output above indicates the call to C.NvmlInit() below never returns. Well, there is nothing interesting here in Go code.

What does NvmlInit() do? You may be surprised. Here’s the actual C code of Dynamo.

It does nothing but calling a NVML function nvmlInit , which is actually an alias of a versioned function nvmlInit_v2 . NVIDIA’s referencearrow-up-right has some description and notes about this method, but no information mentions a possible “stuck”.

So what’s next? NVML is closed-source unfortunately. Calling NVIDIA support?

Debugging with GDB

Of course not. We dig deeper with GDBarrow-up-right (Or you could go with LLDBarrow-up-right if you prefer). No source code is no problem.

Let’s restart dynamo and attach gdb to the process.

If you want to dump callstacks of all threads as we did with dlv earlier, you can run

However, let’s try another approach this time. First, list all threads.

Probably we can ignore threads with runtime.futex because many are waiting there so it looks like Go’s management stuff. This means we’re interested in thread #4 or #7. Let’s check them one by one. In this case, thread #4 wasn’t interesting and #7 was what we were looking for.

It’s in the middle of our stub function NvmlInit , which is calling fgets . Let’s confirm if fgets really never returns. This is an important step because this stuck could be caused by an infinite loop calling fgets repeatedly.

I left it for seconds and this breakpoint wasn’t hit. So Dynamo was stuck because fgets never returned. What kind of data are we reading from what? Standard input?

We’ll get there, but before that, let’s step back a little bit and remember this happens only when we run Dynamo via systemd. Two possibilities. If we run Dynamo directly from shell, fgets just returns without any problem, or fgets isn’t called at all. Let’s find it out. I took this approach because debugging a normal process is much easier than debugging a systemd process. We don’t have to check the PID, and we don’t have to worry about automatic restart, etc.

I just ran dynamo with gdb and set a breakpoint at nvmlInit_v2.

So far so good. Next, I set a breakpoint on fgets and continued.

The breakpoint got hit nicely, but wait, did you notice the difference?

It’s obvious. We got NVML warning here. Remember we didn’t see this warning in the repro. So this call to fgets is not the one getting stuck. If you look at the stacktrace carefully, you’ll see the return address is different too. Here’s the callstack of the repro: d82440 vs d824a4.

The imagebase might be different because this is a different process? That’s a good point, but probably it’s not the case. Usually the imagebase is reused for efficiency, but in this case the address is only 0x64 bytes different. If the module is mapped onto a different address, the difference would be aligned with page size 4k.

So this is a different call to fgets. Good news is both calls are made from the same function that is called from nvmlInit_v2 . You can get it because the return address of a function calling fgets is 0xd982c0 in both cases.

Fun time to dive into assembly! Let’s start from nvmlInit_v2.

Right above 0xd982c0, there is a BL to 0xd823d0, which must be the function call we’re looking for.

One disadvantage of gdb compared to Windows Debugger (ntsd/cdb/kd/windbg) is the disas command (uf in Windows debuggers) doesn’t work if there is no matching symbol on the target address. We need to use x command instead.

Alright, we found the call to fgets that is stuck in the repro. Well, we could enjoy finding out why it went down a different path with/without systemd, but not today. Let’s focus on the original issue.

The next thing to find out is what’s this fgets call. Before asking AI for help, just spend some time on reading this assembly code. The hint is what happens before we call fgets. Around there, we can see calls to standard functions like getpid, sprintf, popen, strstr, etc. If you have some experience in C programming, you can easily draw a picture about the behavior. Here’s my thinking process:

  • There’s popen before fgets . Probably fgets is to read the output from a process spawned by this popen

  • Before popen, the function calls sprintf. This must be to construct a command-line string to pass to popen

  • getpid before popen is to embed the current process’s PID into the command-line string

What does this mean? Always step back and remember the original issue. The problem is fgets never returns. And we just found out nvmlInit_v2 spawns a process and reads the output via fgets , which gets stuck. No documentation about this behavior in NVIDIA’s reference. Who knows a simple initialization function spawns a child process!

The next question is of course, what process is spawned? You can go back to the shell and run ps command to find out, but hold on, let’s stick to code and figure out the command line string first. sprintf is our next target because it looks like constructing a command to execute.

I believe you know sprintf. To understand a command to be constructed, it’s better to see a format string, which is passed to sprintf as the second argument. By the way, one advantage of ARM64 compared to x86 is fixed-sized instructions. Another advantage is a straightforward calling convention. The second argument is stored in the R1 register. Let’s see that value. Here’s the assembly to call sprintf.

If you’re not familiar with ARM64 instructions, specs can be found herearrow-up-right. If you know ARM64 instructions already, you’ll easily see the second argument is 0x1233000 + 0xeb0. Let’s check what’s there.

This is apparently a template of a command-line string, and explains why we call getpid .

Now, it’s time to confirm this finding. Start Dynamo and see if we’re really running this command.

We’re running that command indeed! The next thing to do is of course, to run the exact same command manually.

It just worked. Weird. Anyway, we know the root cause is not in Dynamo but in lsof , which doesn’t finish if it’s spawned from Dynamo spawned via systemd. What’s next? Should we attach gdb to a lsof process to find out? Well, I could, but at this time, I took a different approach, strace. And it worked very well.

Root cause

Let’s restart dynamo and run strace for lsof.

Hmm, we found something really unusual. In the lsof process, a bunch of close are failing.

So, it’s finally time to debug lsof? Again, I could, but to be more practical and save time, I thought it’s time to ask AI about what’s a possible cause. And here’s the actual answer I got.

close(503905516) = -1 EBADF close(503905517) = -1 EBADF …

There is what I haven’t exposed yet: the service definition of Dynamo, which is this:

Do you see anything unusual? It must be LimitNOFILE=infinity. There was no particular reason I added it. According to the AI’s answer, an old implementation of lsof guesses the max fd from getrlimit(RLIMIT_NOFILE) . In this case, it must be RLIMIT_NOFILE = RLIM_INFINITY = 2^63-1 , meaning lsof is closing everything above its stdio fds below 2^63-1. This is the reason why Dynamo is stuck.

Alright, it seems that we found out the root cause, The next move is to confirm this. It’s easy. Let’s comment out LimitNOFILE=infinity and run Dynamo.

I don’t put the command output here because it’s obvious. The result is, it actually worked! Without LimitNOFILE=infinity , Dynamo just works normally.

What’s a conclusion here? It’s a bug in lsof? Basically yes, but remember what AI said: This is an old implementation detail of lsof. This implicates the latest lsof behaves differently. And remember I found this issue when I ran Dynamo on Amazon Linux ARM64, which recently started being supported. We’ve never seen this issue before, from x64 machines. Let’s check the version of lsof.

It’s 4.94.0. How about another machine of x64 where Dynamo runs without any problem?

Oops, it’s different: 4.94.0 vs 4.95.0!

Fortunately, unlike NVML, as you can see it in the output above, lsof is open-sourced. Actually the release note of 4.95.0arrow-up-right mentions a fix for a bug that sound pretty much like our issue.

So I built lsof from source to see if the version 4.95.0 really fixed our issue (with uncommenting the line of LimitNOFILE=infinity of course). Unfortunately it didn’t. Thus I tried several version to find out which version fixed this issue. By the way, here are the commands to build lsof.

Finally, I confirmed the issue is gone with 4.99.0 while the issue exists with 4.98.0. Let’s check the release note of 4.99.0arrow-up-right.

The issue 281arrow-up-right must be our issue. To really confirm, I could revert this patch and see, but I didn’t go that much further. Another mystery is why lsof 4.95.0 on Ubuntu x64 instance doesn’t hit this issue. That’s a very good question. I’ll dive into these another time.

Let’s recap:

  • We saw an issue where Dynamo got stuck if it’s run through systemd on Amazon Linux ARM64.

  • Dynamo calls nvmlInit_v2, which spawns lsof , which gets stuck

  • lsof v4.94.0 has a bug that it attempts to close everything above its stdio fds

  • Service definition of Dynamo specifies LimitNOFILE=infinity , meaning the upper limit of lsof’s close attempts is 2^63-1

Considering all the above, what’s a fix? I simply decided to specify LimitNOFILE=1048576 instead of infinity. Again, there was no particular reason why I chose infinity from the beginning. Another possible approach is to apply this setting only if the version of lsof on the system is lower than 4.99.0, or set a custom value only while calling nvmlInit_v2 , but I prefer the simplest fix.

This is just another day of debugging at Kinesis. Happy debugging!

Last updated

Was this helpful?