-
Notifications
You must be signed in to change notification settings - Fork 372
feat: [NODE-1810, NODE-1728] Hopefully resolve udev race #8368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Override systemd-fsck for var to drop the BindsTo on the underlying crypt device. We suspect the device flapping on udev can lead to the mount locking up.
|
|
||
| sleep 2 | ||
|
|
||
| systemctl reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that "soft" reboot works? Can we maybe schedule some harder way like SysRq trigger if the reboot doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been reproing with eero/systemd-udev...eero/systemd-udev-break, and bazel run //rs/tests/node:kill_start_long_test -- --quiet.
This (seems to always) work to unstick the node, I don't think it is a "soft" reboot unless we do something special? In the end, all we really need is to remount the drive now that fsck has cleared, but I did enough fighting with systemd that taking the time loss of the reboot seemed worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you know better :) It's just that systemctl reboot is really a soft reboot, in a sense that it tries to nicely shutdown everything and if you e.g. have some zombie process or whatnot - it will probably just hang in the process of rebooting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I'm sorry, I was looking at the wrong thing. systemd has their own soft-reboot because of course they do :).
I will add more power to the reboot, if it does get stuck 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When --force is used, "shutdown of all running services is skipped, however all processes are killed and all file systems are unmounted or mounted read-only, immediately followed by the system reboot."
A second --force is even stronger, skipping all the process/mount cleanup too, but I'm not sure that we're too worried about getting into this state.
| @@ -0,0 +1,16 @@ | |||
| # This unit is overridden to remove BindsTo | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an explanation why we want to remove BindsTo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this look?
I can add more detail into the unit if you think, too. The theory goes something like:
While we are opening the crypt device, udev processes a few triggers that bring the device on and offline. When the device first comes online, the fsck starts (early from the main path) then is killed, and something about this interaction locks the two up.
I basically want to avoid this effect from bringing the fsck down.
Units can suddenly, unexpectedly enter inactive state for different reasons: the main process of a service unit might terminate on its own choice, the backing device of a device unit might be unplugged or the mount point of a mount unit might be unmounted without involvement of the system and service manager.
| "Var is tainted on startup" \ | ||
| "gauge" | ||
|
|
||
| sleep 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone know how long it takes for a metric to be picked up?
|
|
||
| sleep 2 | ||
|
|
||
| systemctl reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When --force is used, "shutdown of all running services is skipped, however all processes are killed and all file systems are unmounted or mounted read-only, immediately followed by the system reboot."
A second --force is even stronger, skipping all the process/mount cleanup too, but I'm not sure that we're too worried about getting into this state.
Override systemd-fsck for var to drop the BindsTo on the underlying crypt device. We suspect the device flapping on udev can lead to the mount locking up.