FS#17389 - [openssh] SSH session hangs, when remote machine reboots.

Attached to Project: Arch Linux
Opened by Leo Borealis (Architect) - Saturday, 05 December 2009, 03:36 GMT
Last edited by Dave Reisner (falconindy) - Sunday, 04 November 2012, 21:38 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Thomas Bächler (brain0)
Gaetan Bisson (vesath)
Tom Gundersen (tomegun)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 5
Private No

Details

Description:

SSH session hangs, when remote machine reboots. But must be disconnect from remote machine (Connection to 192.168.0.80 closed by remote host.)

Proper disconnect in archlinux network install img 2009.08.

Additional info:
* package version(s)

Name : openssh
Version : 5.3p1-2

* config and/or log files etc.


Steps to reproduce:
Log in to remote machine with Archlinux via ssh.
Become a superuser.
Say reboot.
This task depends upon

Closed by  Dave Reisner (falconindy)
Sunday, 04 November 2012, 21:38 GMT
Reason for closing:  Fixed
Additional comments about closing:  Original bug is fixed. If it's systemd related, it's a dupe of  FS#31250 
Comment by Leo Borealis (Architect) - Saturday, 05 December 2009, 03:37 GMT
Problem in sshd, because ssh session proper disconents, then remote host under debian.
Comment by Gerardo Exequiel Pozzi (djgera) - Saturday, 05 December 2009, 04:00 GMT
Please seearch-general@archlinux.org/msg05406.html"> http://www.mail-archive.com/arch-general@archlinux.org/msg05406.html
and good tip:arch-general@archlinux.org/msg05408.html"> http://www.mail-archive.com/arch-general@archlinux.org/msg05408.html
Comment by Leo Borealis (Architect) - Saturday, 05 December 2009, 12:30 GMT
Thanks for solution, but bug(feauture?) is still in openssh package.
If it is not a bug, why ssh session closes normally, when remote machine is archlinux installation media or debian?
Comment by Gerardo Exequiel Pozzi (djgera) - Monday, 07 December 2009, 02:23 GMT
No problem,

The "why" is explained in the mails, basically because sshd child processes are not stopped (only the daemon), this is a feature of sshd, then the network is shutdown... is like cutting the wire.

Arch Linux installation media does not setup/start the network (you done it manually), finally when reboot the machine, there are a killall5 @ rc.shutdown commands that kills _all_ proceses (but network is still up). This is why your connection is disconnected by remote host ;)
Comment by Leo Borealis (Architect) - Tuesday, 08 December 2009, 21:03 GMT
Ok, but Debian, CentOS, FreeBSD takes care about client machine and closes session normally if server reboots.
Comment by Gerardo Exequiel Pozzi (djgera) - Wednesday, 09 December 2009, 02:20 GMT
Yes, I know. But to solve this, there is nothing to do in sshd rc script. In the above links there are two proposed solutions that I proposed. S2 is from my point of view the better:
S2: "Do not stop network in the loop, just omit them. And stop, after the killall5 commands. This also ensure that all daemons and your childs are
stopped, the shutdown the network."

I can't find now but some time I created a trivial patch for this :(
Comment by Nathan Crandall (cactus.ed) - Thursday, 10 December 2009, 20:48 GMT
I had the same problem... My fix was to make a small script and call it from rc.local.shutdown. Solved my problem.

#!/bin/bash

SSH_USERS=`/usr/bin/who | /bin/awk '/pts\/[0-9]/ {print $1}' | /usr/bin/sort | /usr/bin/uniq`

for user in $SSH_USERS; do
/usr/bin/wall "Killing ssh user: $user"
/usr/bin/skill -KILL -u $user
done
Comment by Paul Mattal (paul) - Saturday, 06 February 2010, 14:47 GMT
djgera's S2 sounds like the best solution running. Can you create a patch?
Comment by Gerardo Exequiel Pozzi (djgera) - Tuesday, 02 March 2010, 15:25 GMT
@Paul: oops, sorry for the delay, I stopped watching this task by error. Yes, the patch is here :)
Comment by Paul Mattal (paul) - Saturday, 06 March 2010, 21:13 GMT
Is someone with access willing to commit this on initscripts?
Comment by Gerardo Exequiel Pozzi (djgera) - Saturday, 08 May 2010, 23:12 GMT
@Thomas: any posibility to push the patch to initscripts so can close this issue?
Comment by Thomas Bächler (brain0) - Saturday, 08 May 2010, 23:56 GMT
Definitely not. This is the ugliest solution I could think of and it doesn't work if your network daemon isn't called 'network', but ... let's say, 'net-profiles' or whatever.

Delaying or even entirely skipping network shutdown is something that might be desirable for a number of reasons, but must be implemented in a place where it belongs, like the network, net-profiles, net-auto and so on scripts.

Two questions we need to ask ourselves:
1) Why would anyone even want to shut down the network on shutdown?
2) Should our init scripts know the difference between boot and start, or between stop and shutdown?
Comment by Gerardo Exequiel Pozzi (djgera) - Sunday, 09 May 2010, 13:56 GMT
OK. Anyway under initscripts package network daemon is called "network". But agree, and what about adding something like "network hook" and make it independent from "rc.d" scripts? Then "network hook start" is called before "rc.d start" and "network hook stop" is called after "rc.d stop".

1) I guess that is not necessary in all cases.
2) Can be useful in some scenarios.
Comment by Gerardo Exequiel Pozzi (djgera) - Wednesday, 09 June 2010, 22:03 GMT Comment by Petrus (petrus) - Tuesday, 01 March 2011, 18:28 GMT
The problem still persists regardless of the setting of NETWORK_PERSIST, so the mentioned patch has no effect on this.

This is somewhat surprising. One would expect that with skipping network tear down terminating all sshd-s should be enough for a graceful termination of sessions. Somehow that is not the case.
Comment by Pieter Praet (praet) - Saturday, 11 June 2011, 21:46 GMT
Still an issue in 5.8p2-6
Comment by Florian Pritz (bluewind) - Saturday, 18 June 2011, 12:29 GMT
The attached patch will kill all ssh sessions when the system is shutting down, but keep them alive when you restart sshd yourself.
Comment by Petrus (petrus) - Saturday, 18 June 2011, 14:46 GMT
While Florian's solution would work most of the time, I believe it is
not robust enough and carries some external dependencies which could
make it fail in some cases. Let me explain.

The hanging ssh sessions problem occurs when, for whatever reason, the
network goes down during shutdown before all sshd sessions are
terminated. So any solution to it should guarantee that sshd sessions
are closed before network goes down. Also, it should prevent creating
new sessions so the master sshd should be stopped first. A solution
should be robust enough that is it should not depend on the order in
which daemons (including "network") are stopped.

Thomas Bächler asked two important questions earlier in this thread:

1) Why would anyone even want to shut down the network on shutdown?

As it is now, network is stopped during shutdown. There is an option
(NERTWORK_PERSIST) to prevent this for good reasons. Obviously, we cannot
rely on this option, since it is just an option.

2) Should our init scripts know the difference between boot and start,
or between stop and shutdown?

In my opinion, they should not. There is a well defined mechanism in
"initscripts" to attach additional actions to run level changes: hook
functions (see below).

I think it is fragile to depend on the order of daemons listed in
/run/daemons/. If (today) one uses "network", yes sshd will go down
before network with Florian's solution. But what about different
networking setups or future changes in this area.

In the spirit of my analysis above, I suggest a different solution: a
hook script installed in /etc/rc.d/functions.d/ registered to the
"shutdown_start" phase. I have attached such a script I have been
successfully using for a while. The script should be a new component of
the sshd package, that is why my attachment is not a diff.
Comment by Pieter Praet (praet) - Sunday, 19 June 2011, 07:00 GMT
+1 for Peter Dobcsanyi's solution.
Comment by Leonid Isaev (lisaev) - Thursday, 02 February 2012, 19:54 GMT
-1

The real question is why NETWORK_PERSIST has no effect (killall kills something before sshd?). And moreover, it is still specific to /etc/rc.d/network. Then again, everything started up by initscripts should go down at reboot/poweroff via same initscripts.

In my understanding, the only clean solution can be achieved using cgroups: if a server is woken up after net, it and all its descendants will go down before the net.
Comment by Gaetan Bisson (vesath) - Monday, 19 March 2012, 16:06 GMT
Tom (initscripts dev) said the best solution would be to have initscripts kill all user processes at shutdown before starting to touch system daemons, but unfortunately that's not possible with the current initscripts framework.

I agree with his analysis, and in the meantime Dobcsanyi's solution will do.

Leonid: I do not wish to kill all sshd processes in the stop case of /etc/rc.d/sshd as many users (including myself) make use of sshd's behavior to leave current sessions open even after you've killed the main daemon.
Comment by Thomas Bächler (brain0) - Monday, 19 March 2012, 16:34 GMT
Let us simply distinguish boot/start/stop/shutdown in initscripts, and we'll be just fine. There are some technical issues here, but nothing unsolvable.
Comment by Tom Gundersen (tomegun) - Monday, 19 March 2012, 16:40 GMT
brain0: not sure what you are referring to. It is true that we could add "shutdown" as an additional action in the rc script and let rc.shutdown call both "stop" and "shutdown" for each script. That will have the same effect as what Gaetan proposed (but will have the added benefit that other rc scripts could do the same).

It does not solve the problem of killing user processes before daemons in general, but I don't think that is something we can easily do anyway.
Comment by Gaetan Bisson (vesath) - Monday, 19 March 2012, 17:01 GMT
Does it make sense to distinguish stop and shutdown for other daemons? I much prefer adding a shutdown hook to fix this specific SSH issue until initscripts kills user processes before system daemons, rather than having you add a shutdown case that only SSH will use...
Comment by Thomas Bächler (brain0) - Monday, 19 March 2012, 17:06 GMT
Tom, I refer to the problem of this bug report. If we kill all sshd processes on shutdown, but not on regular sshd stop/restart, then we win.

I propose the following: Use some bash magic to provide a shutdown function to each rc.d script that defaults to just calling the stop function. Then, any rc.d script can override it. From rc.shutdown, we then call shutdown instead of stop.

I'll let you figure out the details.
Comment by Leonid Isaev (lisaev) - Monday, 19 March 2012, 17:15 GMT
After re-reading this, let me recap what I don't understand:

1. The "bug" was filed against ssh, so why suddenly net management needs fixing? E.g. not shutting down network, etc. As Thomas already said, this all is not generic and is limited to /etc/rc.d/network. What about wireless servers with netcfg?

2. As long as there is no dep logic in the initscripts, and network (or netcfg) is started _before_ sshd, why should network (or netcfg) care at all about sshd with its users and forks?

3. Why this problem is thought to be reboot/shutdown related? It's a generic issue. From the point of view of sshd, /etc/rc.d/sshd stop ==== shutdown. If you want to kill the master daemon why don't kill it explicitly; if sessions are not cleaned up by stopping sshd, it's a real bug IMHO.

4. Is it architecturally sane to manage daemons through hooks in initscripts? Sshd has its own boot script. I agree with Tom, but really, why can't one just mount an empty cgroup hierarchy from rc.sysinit alongside with /run, which then can be populated/used by individual boot scripts as necessary (for instance, sshd/httpd, but not alsa/iptables/ntp)?
Comment by Gaetan Bisson (vesath) - Monday, 19 March 2012, 17:27 GMT
Leonid:

1. Because ssh might not be the only daemon that has problems when the network is shut down prior to killing its children.

2. Nobody suggested that.

3. It's not a bug, it's a feature. Really. A useful one at that.

4. Please provide a patch.
Comment by Thomas Bächler (brain0) - Monday, 19 March 2012, 17:35 GMT
> 3. Why this problem is thought to be reboot/shutdown related? It's a generic issue. From the point of view of sshd, /etc/rc.d/sshd stop ==== shutdown. If you want to kill the master daemon why don't kill it explicitly; if sessions are not cleaned up by stopping sshd, it's a real bug IMHO.

No, no, no, no, no, no! We did this once, and people almost got killed (I was one of the potential killers).

Let's say you upgrade your system, and you want to restart sshd, so it utilizes a bugfix in openssl (for example). So you run "rc.d restart openssh" or "rc.d stop openssh && rc.d start openssh". What happens is this: Your ssh session gets killed (along with everyone else's) and you don't see any output from the sshd start. What else happened to people? sshd failed to start (maybe they changed their config file and screwed up, maybe something else broke) and they got LOCKED OUT from their machine (their headless server that is several hundred kilometers away). No way to get back in. Doing this is pretty common, and a sysadmin expects that his sshd sessions will remain open during a restart of the master daemon. This functionality has priority over any inconvenience, like the problem in this bug.
Comment by Gaetan Bisson (vesath) - Monday, 19 March 2012, 19:10 GMT
Alright, I'm pushing a new openssh package with Dobcsanyi's fix to [testing]. The fix will be removed as soon as initscripts offers a better solution, but that's heavier work that isn't IMHO warranted just by this specific issue.
Comment by Eric Belanger (Snowman) - Tuesday, 20 March 2012, 05:11 GMT
The sshd_close_sessions function in /etc/rc.d/functions.d/sshd-close-sessions tries to stop the sshd daemon without checking first if it's running. This create a failure message if you don't run the daemon. There's a function in the initscript to check if a certain daemon is running.
Comment by Gaetan Bisson (vesath) - Tuesday, 20 March 2012, 07:01 GMT
Thanks Eric. I'm pushing a fix to [testing].
Comment by Andrzej Giniewicz (Giniu) - Sunday, 04 November 2012, 21:31 GMT
  • Field changed: Percent Complete (100% → 0%)
This fails again when using pure systemd. Looks like there is no sshd.close-sessions counterpart for it.

Also, openssh keeps the paths where it was logged into busy, making them fail umount during reboot (for example "umount /boot: target is busy" if one reboots when pwd==/boot/* and /boot is on separate partition)
Comment by Dave Reisner (falconindy) - Sunday, 04 November 2012, 21:38 GMT
Please don't reopen this. There's been other similar bug reports closed.

Loading...