FS#52129 - [openssh] segfault

Attached to Project: Arch Linux
Opened by Felix Krohn (kro) - Monday, 12 December 2016, 14:19 GMT
Last edited by Gaetan Bisson (vesath) - Tuesday, 17 January 2017, 22:05 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Gaetan Bisson (vesath)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description: openssh dies on a fresh install of ArchLinux

# pacman -Ss openssh
core/openssh 7.3p1-2 [installed]
Free version of the SSH connectivity tools
# /usr/sbin/sshd -D
Segmentation fault
# ldd /usr/bin/sshd
/usr/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
linux-vdso.so.1 (0x00007ffee76c9000)
libpam.so.0 => /usr/lib/libpam.so.0 (0x00007f383a0e0000)
libcrypto.so.1.0.0 => /usr/lib/libcrypto.so.1.0.0 (0x00007f3839c68000)
libutil.so.1 => /usr/lib/libutil.so.1 (0x00007f3839a65000)
libz.so.1 => /usr/lib/libz.so.1 (0x00007f383984f000)
libcrypt.so.1 => /usr/lib/libcrypt.so.1 (0x00007f3839617000)
libgssapi_krb5.so.2 => /usr/lib/libgssapi_krb5.so.2 (0x00007f38393c9000)
libkrb5.so.3 => /usr/lib/libkrb5.so.3 (0x00007f38390e4000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f3838d46000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f3838b42000)
libk5crypto.so.3 => /usr/lib/libk5crypto.so.3 (0x00007f3838911000)
libcom_err.so.2 => /usr/lib/libcom_err.so.2 (0x00007f383870d000)
libkrb5support.so.0 => /usr/lib/libkrb5support.so.0 (0x00007f3838500000)
libkeyutils.so.1 => /usr/lib/libkeyutils.so.1 (0x00007f38382fc000)
libresolv.so.2 => /usr/lib/libresolv.so.2 (0x00007f38380e5000)
/lib64/ld-linux-x86-64.so.2 (0x00007f383a2ee000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f3837ec8000)


Additional info:
* complete strace output ("strace /usr/sbin/sshd -D") in attachment

Steps to reproduce:
- install most recent Archlinux with openssh
This task depends upon

Closed by  Gaetan Bisson (vesath)
Tuesday, 17 January 2017, 22:05 GMT
Reason for closing:  No response
Comment by Felix Krohn (kro) - Monday, 12 December 2016, 14:47 GMT
To be more precise it seems to be a problem with libc, not sshd, according to messages in dmesg and gdb.
(see attached screenshot)
Comment by Doug Newgard (Scimmia) - Monday, 12 December 2016, 15:02 GMT
I doubt it's the problem, but fix your locale first.
Comment by Felix Krohn (kro) - Tuesday, 13 December 2016, 11:02 GMT
I confirm the behaviour doesn't change at all when fixing locales.
Comment by Gaetan Bisson (vesath) - Wednesday, 14 December 2016, 03:19 GMT
Is sshd the only program that segfaults? What if you compile it manually from source? Could you also post the contents of your /proc/cpuinfo?
I'm adding our glibc experts in case they can shed any lights on this.
Comment by Allan McRae (Allan) - Wednesday, 14 December 2016, 03:24 GMT
My guess... nothing to do with glibc.

Related to  FS#51709  maybe?
Comment by Felix Krohn (kro) - Wednesday, 14 December 2016, 18:12 GMT
Thanks for you replies.
- I tried recompiling openssh and for good measure also glibc - no change
- I'm currently trying this on a "Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz", but have the same issue across many different hardware.
- I had only few processes with the same issues: strace itself, sshd and a couple 'ld' while recompiling glibc
- systemd-coredump output:
# strace /usr/sbin/sshd -D 2>ssh.strace
Segmentation fault (core dumped)

Dec 14 18:54:51 ns229132 systemd[1]: Started Process Core Dump (PID 30677/UID 0).
Dec 14 18:54:51 ns229132 systemd[1]: Started Process Core Dump (PID 30681/UID 0).
Dec 14 18:54:51 ns229132 systemd-coredump[30682]: Resource limits disable core dumping for process 30674 (strace).
Dec 14 18:54:51 ns229132 systemd-coredump[30682]: Process 30674 (strace) of user 0 dumped core.
Dec 14 18:54:51 ns229132 systemd-coredump[30678]: Process 30676 (sshd) of user 0 dumped core.

Stack trace of thread 30676:
#0 0x00007f11064bf107 __memset_sse2_unaligned_erms (libc.so.6)
#1 0x0000556394151c98 n/a (sshd)
#2 0x00007f110645b291 __libc_start_main (libc.so.6)
#3 0x000055639415461a n/a (sshd)
- I'm attaching the dump file to this thread accordingly
- I'm now trying out some of the hints given in in  FS#51709  (nsswitch.conf, systemd-resolve), but no luck so far
- I'm using a dropbear sshd to access the server in question and can provide access if helpful.
Comment by Allan McRae (Allan) - Wednesday, 14 December 2016, 22:16 GMT
Have you run memcheck?
Comment by Felix Krohn (kro) - Monday, 19 December 2016, 12:59 GMT
yes, all good on the memory side.
Comment by Gaetan Bisson (vesath) - Tuesday, 20 December 2016, 05:05 GMT
Could you try openssh-7.4p1 from [testing]? It's the latest upstream release, you never know...
Comment by Adam Rosadziński (adamros) - Tuesday, 20 December 2016, 14:56 GMT
Same problem here. I've tried openssh-7.4p1 from testing - also crashed.
Comment by Felix Krohn (kro) - Tuesday, 20 December 2016, 15:20 GMT
Adam, I'm sorry for you, but actually very releived I'm not the only one with this problem :)
Same for me, testing/openssh doesn't change anything.
Comment by Gaetan Bisson (vesath) - Tuesday, 20 December 2016, 21:34 GMT
Sorry but I have no idea what the problem might be and/or why your configuration differs to mine (no segfaults here). Could you please report this issue upstream? https://bugzilla.mindrot.org/
Comment by Adam Rosadziński (adamros) - Tuesday, 20 December 2016, 22:52 GMT
Tried recompiling both OpenSSH and Glibc from sources with no luck. After small research I found, that crash is caused by call:
explicit_bzero(privsep_pw->pw_passwd,
strlen(privsep_pw->pw_passwd));
in sshd.c, line 1643.

Valgrind result:
==18568== Process terminating with default action of signal 11 (SIGSEGV)
==18568== Bad permissions for mapped region at address 0x40B1E35
==18568== at 0x6B74107: __memset_sse2_unaligned_erms (in /usr/lib/libc-2.24.so)
==18568== by 0x1129A1: main (sshd.c:1643)

This behavior is easily replicable with simple application, which I attach along with /proc/cpuinfo dump to this comment.
I hope this will help
Comment by Gaetan Bisson (vesath) - Wednesday, 21 December 2016, 07:03 GMT
Great work! Here crash.c works just fine, but I'll let Allan comment.
Comment by Felix Krohn (kro) - Wednesday, 21 December 2016, 09:38 GMT
Gaetan, if you want I can provide you with root access (ssh through dropbear) on a freshly (arch-bootstrap.sh-) installed server in order to reproduce and debug.
Comment by Allan McRae (Allan) - Wednesday, 21 December 2016, 11:30 GMT
Do you have microcode updates applied?
Comment by Adam Rosadziński (adamros) - Wednesday, 21 December 2016, 15:20 GMT
Yes, from latest intel-ucode package:
extra/intel-ucode 20161104-1 [installed]
Microcode update files for Intel CPUs

Microcode update is being applied at boot:
[ 0.892448] microcode: CPU0 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892495] microcode: CPU1 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892544] microcode: CPU2 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892555] microcode: CPU3 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892604] microcode: CPU4 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892651] microcode: CPU5 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892697] microcode: CPU6 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892745] microcode: CPU7 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892795] microcode: CPU8 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892812] microcode: CPU9 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892860] microcode: CPU10 sig=0x206d7, pf=0x1, revision=0x710
[ 0.892908] microcode: CPU11 sig=0x206d7, pf=0x1, revision=0x710
[ 0.893022] microcode: Microcode Update Driver: v2.01 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
Comment by Felix Krohn (kro) - Thursday, 22 December 2016, 14:21 GMT
Surprisingly, I now managed to reproduce a setup where the sshd segfault doesn't happen, but the system still has the exactly identical package list.
same applies to crash.c attached above by Adam.

broken: fresh installation using arch-bootstrap.sh and the up-to-date package repository.

functional: fresh installation using arch-bootstrap.sh and a package repository snapshot from December 8th, then run "pacman -Sy; pacman -Su" which installs 24 package updates (bash-4.4.005-2 coreutils-8.26-1 filesystem-2016.12-2 geoip-database-20161206-1 gnupg-2.1.16-2 gnutls-3.4.17-1 icu-58.2-1 iproute2-4.9.0-1 libarchive-3.2.2-1 libgcrypt-1.7.5-1 libsystemd-232-6 libunistring-0.9.7-1 linux-lts-4.4.39-1 logrotate-3.11.0-1 man-db-2.7.6.1-2 man-pages-4.09-1 man-pages-de-1.18-1 nano-2.7.2-1 ncurses-6.0+20161203-1 openssh-7.4p1-1 pacman-mirrorlist-20161214-1 readline-7.0.001-1 systemd-232-6 systemd-sysvcompat-232-6)

- the package list (output of 'pacman -Qs|grep -v "^ "|cut -d/ -f2-') is 100% identical between both installations
- the installation script (arch-bootstrap.sh) is also exactly identical, only the given mirror repo differs (official mirror versus snapshot of official mirror)

My conclusion is that one of the updated packages behaves differently if it is installed in chroot by arch-bootstrap.sh, or on the booted system by pacman -Su. probably some ldconfig hooks?
My intuition and prior experiences tell me I should automatically blame systemd :-), but I can't prove it (yet).
Comment by Timur Aydin (taydin) - Thursday, 22 December 2016, 15:50 GMT
This is most likely the same problem I saw when I attempted to install Arch Linux into my dedicated server at OVH. The automated installer finishes the installation and right at the end, an error message is issues saying that "ssh didn't start". I know that this setup does not lend itself to any debugging, but just wanted to mention it for additional statistical information. The system is an Intel(R) Xeon(R) CPU D-1521 @ 2.40GHz, with 8 cores.
Comment by Felix Krohn (kro) - Thursday, 22 December 2016, 16:28 GMT
@ Timur: It is exactly this problem :-)
The above workaround is active at OVH and you can now relaunch your installation.
Comment by Gaetan Bisson (vesath) - Thursday, 22 December 2016, 18:24 GMT
So you are saying that running "sudo pacman -Syu `pacman -Qq`" would fix the issue? Adam, could you perhaps confirm this? Felix, are your CPUs also Xeons?
Comment by Allan McRae (Allan) - Thursday, 22 December 2016, 23:54 GMT
I'm assuming that nothing has been done at the OVH end to the packages that are installed from their snapshot mirror.

Otherwise, it looks like a post_install scriptlet has not run in the first install.
Comment by Allan McRae (Allan) - Thursday, 22 December 2016, 23:55 GMT
Also, can you install new packages one by one and locate what package update is fixing this?
Comment by Felix Krohn (kro) - Friday, 23 December 2016, 10:36 GMT
@Gaetan: I couldn't see any correlation between CPU models and this error - I had it on all kinds of (Intel) CPUs, for example also old Core2 Q6600. The pacman command you gave doesn't change behaviour, neither in portinstall (chroot) nor after boot on the freshly installed system. I'd like to completely purge and reinstall the mentioned packages, but of course this is not possible for the critical ones like systemd, filesystem etc.

@Alan: yes, the snapshot used is unmodified, and really just a snapshot of our official ArchLinux mirror on this date. A simple reinstall of the mentioned packages also doesn't change behaviour.

I'm not an Arch/pacman expert, so please get in touch on #archlinux-bugs or per email (firstname dot lastname @ovh.net) if you want shell access to do some tests on your own. I can easily re-install the servers in question, there's nothing to lose :)
Comment by Andrey Vihrov (andreyv) - Sunday, 25 December 2016, 16:53 GMT
FWIW, crash.c seems to be an invalid program. "man 3p getpwnam" says this:

> The application shall not modify the structure to which the return value points, nor any storage areas pointed to by pointers within the structure.
Comment by Adam Rosadziński (adamros) - Monday, 26 December 2016, 22:40 GMT
@Vihrov: To be honiest, I didn't check if this behavior is correct or not. The only purpose of this application was to show OpenSSH's behaviour, which leads to crash.
Comment by Gaetan Bisson (vesath) - Friday, 06 January 2017, 07:28 GMT
I suggest you recompile openssh without the bzero call. If this solves your issue then the proper course of action would be to contact upstream, mention the segfaults you are observing, and quote the Linux bzero man page "The return value may point to a static area." They will be better qualified to decide if and how the ssh code needs fixing. Cheers.

Loading...