English / Tech

Back to section

A long-running `nix build` got stuck after sleep, and would not die cleanly

My Mac woke up with a half-dead `compile-smoke` build still hanging around. Killing the obvious client process was not enough; the cleanup only finished once I reset the `launchd`-managed Nix service.

  • nix
  • macos
  • launchd
  • debugging

I do not trust macOS sleep very much around long-running builds, and this time it earned that distrust again.

I came back to a machine that had been asleep for a while and found this still sitting in a terminal:

nix build -L -vv -f . llvmPackages_21.flang.passthru.tests.compile-smoke

It was not making progress. It was not exiting. It was just there, half alive.

The next thing I touched in Nix started failing too. nix upgrade-nix gave me this:

error: opening lock file '/nix/var/nix/profiles/default.lock': Permission denied

At that point the immediate problem was not “understand the Nix process model”. It was “make this build die cleanly so the machine becomes usable again”.

First pass: kill the obvious thing

I started with the foreground nix build and the most obvious child processes.

That was partly laziness. I already suspected the real owner was further back in the daemon chain, but a quick local cleanup is cheaper than a full service reset if it works.

It did not work.

  • The foreground nix build could die.
  • Some builders stayed around.
  • Some processes turned into zombies.
  • A few nix-daemon processes had the same name but were clearly not the same layer.

Once that starts happening, looking for the ugliest PID on screen is not a great strategy anymore.

The ps output that mattered

The useful command was this one:

ps -Ao pid,ppid,user,etime,state,command | rg 'compile-smoke|default-builder\.sh|nix-daemon'

At one point the output looked roughly like this:

12994     1 root      Ss   /nix/.../bin/nix-daemon
21459 12994 _nixbld2  Z    <defunct>
68696 46807 _nixbld3  Ss   bash -e ... default-builder.sh

That was enough.

  • Z means the child is already dead and has not been reaped yet.
  • _nixbld* plus default-builder.sh is the layer that is still actually running build logic.
  • Multiple nix-daemon lines do not automatically mean “one service, many copies”. Some of them are just leftover parents from different chains.

The only model I actually needed was this:

launchd -> nix-daemon -> _nixbldN -> default-builder.sh -> build/test subprocesses

Once I started looking at it that way, the earlier cleanup attempts made sense too:

  • killing the client does not necessarily kill the builder
  • killing one child builder does not guarantee the parent chain is gone
  • zombies are not something you fix by sending one more signal

nix daemon was the wrong move

I also tried the dumb thing and ran nix daemon manually.

That got me:

error: cannot bind to socket at '/nix/var/nix/daemon-socket/socket': Address already in use

That error is not subtle. Something else already owns the socket.

On macOS, the real owner is supposed to be launchd, through org.nixos.nix-daemon, not a daemon process I start by hand in a random shell. So this was not a recovery path. It was just another sign that the cleanup had to happen at the service layer.

The layer that actually needed resetting

Once it was clear that I was dealing with a stuck daemon/builder chain, the useful commands were no longer the local ones around the foreground client. They were the ones that reset the service and clear the leftovers:

sudo launchctl bootout system /Library/LaunchDaemons/org.nixos.nix-daemon.plist
sudo pkill -9 -x nix-daemon
sudo pkill -9 -f default-builder.sh
sudo launchctl bootstrap system /Library/LaunchDaemons/org.nixos.nix-daemon.plist

The commands themselves are not the interesting part. The interesting part is the layer they act on.

My mistake was not that I failed to identify the service layer. I had a decent guess early on. The mistake was trying to save a step and hoping local kills would be enough.

That is the part I would keep from this incident: if a long-running Nix build wakes up from macOS sleep in a half-dead state, do not keep treating it like an ordinary foreground job. Clean up the obvious client if you want, but if the daemon/builder chain is still alive, go back to launchctl and reset the service properly.

Also: I still do not trust macOS sleep around builds that matter.