English / Tech
Back to sectionA long-running `nix build` got stuck after sleep, and would not die cleanly
My Mac woke up with a half-dead `compile-smoke` build still hanging around. Killing the obvious client process was not enough; the cleanup only finished once I reset the `launchd`-managed Nix service.
I do not trust macOS sleep very much around long-running builds, and this time it earned that distrust again.
I came back to a machine that had been asleep for a while and found this still sitting in a terminal:
nix build -L -vv -f . llvmPackages_21.flang.passthru.tests.compile-smoke
It was not making progress. It was not exiting. It was just there, half alive.
The next thing I touched in Nix started failing too. nix upgrade-nix gave me this:
error: opening lock file '/nix/var/nix/profiles/default.lock': Permission denied
At that point the immediate problem was not “understand the Nix process model”. It was “make this build die cleanly so the machine becomes usable again”.
First pass: kill the obvious thing
I started with the foreground nix build and the most obvious child processes.
That was partly laziness. I already suspected the real owner was further back in the daemon chain, but a quick local cleanup is cheaper than a full service reset if it works.
It did not work.
- The foreground
nix buildcould die. - Some builders stayed around.
- Some processes turned into zombies.
- A few
nix-daemonprocesses had the same name but were clearly not the same layer.
Once that starts happening, looking for the ugliest PID on screen is not a great strategy anymore.
The ps output that mattered
The useful command was this one:
ps -Ao pid,ppid,user,etime,state,command | rg 'compile-smoke|default-builder\.sh|nix-daemon'
At one point the output looked roughly like this:
12994 1 root Ss /nix/.../bin/nix-daemon
21459 12994 _nixbld2 Z <defunct>
68696 46807 _nixbld3 Ss bash -e ... default-builder.sh
That was enough.
Zmeans the child is already dead and has not been reaped yet._nixbld*plusdefault-builder.shis the layer that is still actually running build logic.- Multiple
nix-daemonlines do not automatically mean “one service, many copies”. Some of them are just leftover parents from different chains.
The only model I actually needed was this:
launchd -> nix-daemon -> _nixbldN -> default-builder.sh -> build/test subprocesses
Once I started looking at it that way, the earlier cleanup attempts made sense too:
- killing the client does not necessarily kill the builder
- killing one child builder does not guarantee the parent chain is gone
- zombies are not something you fix by sending one more signal
nix daemon was the wrong move
I also tried the dumb thing and ran nix daemon manually.
That got me:
error: cannot bind to socket at '/nix/var/nix/daemon-socket/socket': Address already in use
That error is not subtle. Something else already owns the socket.
On macOS, the real owner is supposed to be launchd, through org.nixos.nix-daemon, not a daemon process I start by hand in a random shell. So this was not a recovery path. It was just another sign that the cleanup had to happen at the service layer.
The layer that actually needed resetting
Once it was clear that I was dealing with a stuck daemon/builder chain, the useful commands were no longer the local ones around the foreground client. They were the ones that reset the service and clear the leftovers:
sudo launchctl bootout system /Library/LaunchDaemons/org.nixos.nix-daemon.plist
sudo pkill -9 -x nix-daemon
sudo pkill -9 -f default-builder.sh
sudo launchctl bootstrap system /Library/LaunchDaemons/org.nixos.nix-daemon.plist
The commands themselves are not the interesting part. The interesting part is the layer they act on.
My mistake was not that I failed to identify the service layer. I had a decent guess early on. The mistake was trying to save a step and hoping local kills would be enough.
That is the part I would keep from this incident: if a long-running Nix build wakes up from macOS sleep in a half-dead state, do not keep treating it like an ordinary foreground job. Clean up the obvious client if you want, but if the daemon/builder chain is still alive, go back to launchctl and reset the service properly.
Also: I still do not trust macOS sleep around builds that matter.