FS#17389 - [openssh] SSH session hangs, when remote machine reboots.
Attached to Project:
Arch Linux
Opened by Leo Borealis (Architect) - Saturday, 05 December 2009, 03:36 GMT
Last edited by Dave Reisner (falconindy) - Sunday, 04 November 2012, 21:38 GMT
Opened by Leo Borealis (Architect) - Saturday, 05 December 2009, 03:36 GMT
Last edited by Dave Reisner (falconindy) - Sunday, 04 November 2012, 21:38 GMT
|
Details
Description:
SSH session hangs, when remote machine reboots. But must be disconnect from remote machine (Connection to 192.168.0.80 closed by remote host.) Proper disconnect in archlinux network install img 2009.08. Additional info: * package version(s) Name : openssh Version : 5.3p1-2 * config and/or log files etc. Steps to reproduce: Log in to remote machine with Archlinux via ssh. Become a superuser. Say reboot. |
This task depends upon
Closed by Dave Reisner (falconindy)
Sunday, 04 November 2012, 21:38 GMT
Reason for closing: Fixed
Additional comments about closing: Original bug is fixed. If it's systemd related, it's a dupe of FS#31250
Sunday, 04 November 2012, 21:38 GMT
Reason for closing: Fixed
Additional comments about closing: Original bug is fixed. If it's systemd related, it's a dupe of
and good tip:arch-general@archlinux.org/msg05408.html"> http://www.mail-archive.com/arch-general@archlinux.org/msg05408.html
If it is not a bug, why ssh session closes normally, when remote machine is archlinux installation media or debian?
The "why" is explained in the mails, basically because sshd child processes are not stopped (only the daemon), this is a feature of sshd, then the network is shutdown... is like cutting the wire.
Arch Linux installation media does not setup/start the network (you done it manually), finally when reboot the machine, there are a killall5 @ rc.shutdown commands that kills _all_ proceses (but network is still up). This is why your connection is disconnected by remote host ;)
S2: "Do not stop network in the loop, just omit them. And stop, after the killall5 commands. This also ensure that all daemons and your childs are
stopped, the shutdown the network."
I can't find now but some time I created a trivial patch for this :(
#!/bin/bash
SSH_USERS=`/usr/bin/who | /bin/awk '/pts\/[0-9]/ {print $1}' | /usr/bin/sort | /usr/bin/uniq`
for user in $SSH_USERS; do
/usr/bin/wall "Killing ssh user: $user"
/usr/bin/skill -KILL -u $user
done
Delaying or even entirely skipping network shutdown is something that might be desirable for a number of reasons, but must be implemented in a place where it belongs, like the network, net-profiles, net-auto and so on scripts.
Two questions we need to ask ourselves:
1) Why would anyone even want to shut down the network on shutdown?
2) Should our init scripts know the difference between boot and start, or between stop and shutdown?
1) I guess that is not necessary in all cases.
2) Can be useful in some scenarios.
This is somewhat surprising. One would expect that with skipping network tear down terminating all sshd-s should be enough for a graceful termination of sessions. Somehow that is not the case.
not robust enough and carries some external dependencies which could
make it fail in some cases. Let me explain.
The hanging ssh sessions problem occurs when, for whatever reason, the
network goes down during shutdown before all sshd sessions are
terminated. So any solution to it should guarantee that sshd sessions
are closed before network goes down. Also, it should prevent creating
new sessions so the master sshd should be stopped first. A solution
should be robust enough that is it should not depend on the order in
which daemons (including "network") are stopped.
Thomas Bächler asked two important questions earlier in this thread:
1) Why would anyone even want to shut down the network on shutdown?
As it is now, network is stopped during shutdown. There is an option
(NERTWORK_PERSIST) to prevent this for good reasons. Obviously, we cannot
rely on this option, since it is just an option.
2) Should our init scripts know the difference between boot and start,
or between stop and shutdown?
In my opinion, they should not. There is a well defined mechanism in
"initscripts" to attach additional actions to run level changes: hook
functions (see below).
I think it is fragile to depend on the order of daemons listed in
/run/daemons/. If (today) one uses "network", yes sshd will go down
before network with Florian's solution. But what about different
networking setups or future changes in this area.
In the spirit of my analysis above, I suggest a different solution: a
hook script installed in /etc/rc.d/functions.d/ registered to the
"shutdown_start" phase. I have attached such a script I have been
successfully using for a while. The script should be a new component of
the sshd package, that is why my attachment is not a diff.
The real question is why NETWORK_PERSIST has no effect (killall kills something before sshd?). And moreover, it is still specific to /etc/rc.d/network. Then again, everything started up by initscripts should go down at reboot/poweroff via same initscripts.
In my understanding, the only clean solution can be achieved using cgroups: if a server is woken up after net, it and all its descendants will go down before the net.
I agree with his analysis, and in the meantime Dobcsanyi's solution will do.
Leonid: I do not wish to kill all sshd processes in the stop case of /etc/rc.d/sshd as many users (including myself) make use of sshd's behavior to leave current sessions open even after you've killed the main daemon.
It does not solve the problem of killing user processes before daemons in general, but I don't think that is something we can easily do anyway.
I propose the following: Use some bash magic to provide a shutdown function to each rc.d script that defaults to just calling the stop function. Then, any rc.d script can override it. From rc.shutdown, we then call shutdown instead of stop.
I'll let you figure out the details.
1. The "bug" was filed against ssh, so why suddenly net management needs fixing? E.g. not shutting down network, etc. As Thomas already said, this all is not generic and is limited to /etc/rc.d/network. What about wireless servers with netcfg?
2. As long as there is no dep logic in the initscripts, and network (or netcfg) is started _before_ sshd, why should network (or netcfg) care at all about sshd with its users and forks?
3. Why this problem is thought to be reboot/shutdown related? It's a generic issue. From the point of view of sshd, /etc/rc.d/sshd stop ==== shutdown. If you want to kill the master daemon why don't kill it explicitly; if sessions are not cleaned up by stopping sshd, it's a real bug IMHO.
4. Is it architecturally sane to manage daemons through hooks in initscripts? Sshd has its own boot script. I agree with Tom, but really, why can't one just mount an empty cgroup hierarchy from rc.sysinit alongside with /run, which then can be populated/used by individual boot scripts as necessary (for instance, sshd/httpd, but not alsa/iptables/ntp)?
1. Because ssh might not be the only daemon that has problems when the network is shut down prior to killing its children.
2. Nobody suggested that.
3. It's not a bug, it's a feature. Really. A useful one at that.
4. Please provide a patch.
No, no, no, no, no, no! We did this once, and people almost got killed (I was one of the potential killers).
Let's say you upgrade your system, and you want to restart sshd, so it utilizes a bugfix in openssl (for example). So you run "rc.d restart openssh" or "rc.d stop openssh && rc.d start openssh". What happens is this: Your ssh session gets killed (along with everyone else's) and you don't see any output from the sshd start. What else happened to people? sshd failed to start (maybe they changed their config file and screwed up, maybe something else broke) and they got LOCKED OUT from their machine (their headless server that is several hundred kilometers away). No way to get back in. Doing this is pretty common, and a sysadmin expects that his sshd sessions will remain open during a restart of the master daemon. This functionality has priority over any inconvenience, like the problem in this bug.
Also, openssh keeps the paths where it was logged into busy, making them fail umount during reboot (for example "umount /boot: target is busy" if one reboots when pwd==/boot/* and /boot is on separate partition)