Skip to content

Conversation

@Bownairo
Copy link
Contributor

Override systemd-fsck for var to drop the BindsTo on the underlying crypt device. We suspect the device flapping on udev can lead to the mount locking up.

Override systemd-fsck for var to drop the BindsTo on the underlying
crypt device. We suspect the device flapping on udev can lead to the
mount locking up.
@github-actions github-actions bot added the feat label Jan 15, 2026
@Bownairo Bownairo changed the title feat: Hopefully resolve udev race feat: [NODE-1810, NODE-1728] Hopefully resolve udev race Jan 15, 2026
@Bownairo Bownairo marked this pull request as ready for review January 15, 2026 07:54
@Bownairo Bownairo requested a review from a team as a code owner January 15, 2026 07:54
@github-actions github-actions bot added the @node label Jan 15, 2026

sleep 2

systemctl reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that "soft" reboot works? Can we maybe schedule some harder way like SysRq trigger if the reboot doesn't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been reproing with eero/systemd-udev...eero/systemd-udev-break, and bazel run //rs/tests/node:kill_start_long_test -- --quiet.

This (seems to always) work to unstick the node, I don't think it is a "soft" reboot unless we do something special? In the end, all we really need is to remount the drive now that fsck has cleared, but I did enough fighting with systemd that taking the time loss of the reboot seemed worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you know better :) It's just that systemctl reboot is really a soft reboot, in a sense that it tries to nicely shutdown everything and if you e.g. have some zombie process or whatnot - it will probably just hang in the process of rebooting...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I'm sorry, I was looking at the wrong thing. systemd has their own soft-reboot because of course they do :).

I will add more power to the reboot, if it does get stuck 👍.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --force is used, "shutdown of all running services is skipped, however all processes are killed and all file systems are unmounted or mounted read-only, immediately followed by the system reboot."

A second --force is even stronger, skipping all the process/mount cleanup too, but I'm not sure that we're too worried about getting into this state.

@@ -0,0 +1,16 @@
# This unit is overridden to remove BindsTo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an explanation why we want to remove BindsTo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this look?

I can add more detail into the unit if you think, too. The theory goes something like:
While we are opening the crypt device, udev processes a few triggers that bring the device on and offline. When the device first comes online, the fsck starts (early from the main path) then is killed, and something about this interaction locks the two up.

I basically want to avoid this effect from bringing the fsck down.

Units can suddenly, unexpectedly enter inactive state for different reasons: the main process of a service unit might terminate on its own choice, the backing device of a device unit might be unplugged or the mount point of a mount unit might be unmounted without involvement of the system and service manager.

"Var is tainted on startup" \
"gauge"

sleep 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone know how long it takes for a metric to be picked up?


sleep 2

systemctl reboot
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --force is used, "shutdown of all running services is skipped, however all processes are killed and all file systems are unmounted or mounted read-only, immediately followed by the system reboot."

A second --force is even stronger, skipping all the process/mount cleanup too, but I'm not sure that we're too worried about getting into this state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants