FS#67715 - [glibc] libm-2.32.so SIGILL in pow() due to FMA4 instruction on non-FMA4 system

Attached to Project: Arch Linux
Opened by Ondřej Hošek (RavuAlHemio) - Tuesday, 25 August 2020, 10:54 GMT
Last edited by Bartłomiej Piotrowski (Barthalion) - Friday, 04 September 2020, 06:08 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Bartłomiej Piotrowski (Barthalion)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

/usr/lib/libm-2.32.so (glibc 2.32-2), built with gcc 10.2.0 (gcc 10.2.0-1), exits with SIGILL when calling the pow() function because it executes the FMA4 instruction "vfmaddsd %xmm4,0x8(%rdx),%xmm6,%xmm0" on a system that does not support FMA4.

When glibc is built from ABS with debug symbols, the debugger points to sysdeps/ieee754/dbl-64/e_pow.c:77 as the culprit:

r = __builtin_fma (z, invc, -1.0);

I assume that something is not entirely okay with glibc's multi-arch support (i.e. detecting supported instruction set extensions of the running system and swizzling in the optimal codepath on first call), but I don't know whether this is a glibc, gcc or binutils issue. It appears that a non-FMA4 implementation of pow() is chosen, but this implementation was compiled with FMA4 support for some reason, which means __builtin_fma is compiled to a FMA4 instruction, which leads to the illegal instruction signal on execution.

This issue crashes most nontrivial Python scripts, so I have increased the severity to High.

I will now attempt to build glibc with --disable-multi-arch and report back.
This task depends upon

Closed by  Bartłomiej Piotrowski (Barthalion)
Friday, 04 September 2020, 06:08 GMT
Reason for closing:  Fixed
Additional comments about closing:  glibc 2.32-4
Comment by Ondřej Hošek (RavuAlHemio) - Tuesday, 25 August 2020, 11:35 GMT
Building with --disable-multi-arch makes the issue disappear.
Comment by Allan McRae (Allan) - Tuesday, 25 August 2020, 11:54 GMT
What processor do you have?
Comment by Ondřej Hošek (RavuAlHemio) - Tuesday, 25 August 2020, 12:04 GMT
WMware-hypervised Intel Xeon Gold 6150. Excerpt from /proc/cpuinfo:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
stepping : 0
microcode : 0x2000069
cpu MHz : 2693.672
cache size : 25344 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase smep arat md_clear flush_l1d arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5389.81
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

(+ processor 1 with the same specs)
Comment by Ondřej Hošek (RavuAlHemio) - Tuesday, 25 August 2020, 13:00 GMT
I've looked deeper into the build artifacts of a multi-arch debug build. The FMA4 instructions only appear in the fma4 variants of the relevant functions (e.g. __ieee754_pow_fma4), so the issue is not with the non-FMA4 implementations suddenly getting FMA4 instructions, but the FMA4 implementations being chosen erroneously.

(Fortunately, this is not a Heisenbug: it happens both with release and debug builds, and independent of whether a debugger is attached or not.)
Comment by Ondřej Hošek (RavuAlHemio) - Tuesday, 25 August 2020, 13:06 GMT Comment by Allan McRae (Allan) - Tuesday, 25 August 2020, 13:32 GMT
Great - that does look to be the issue. I'll ping upstream on their IRC channel to confirm.
Comment by Eli Schwartz (eschwartz) - Tuesday, 25 August 2020, 14:55 GMT
see commit https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=107e6a3c2212ba7a3a4ec7cae8d82d73f7c95d0b

- if (CPU_FEATURES_ARCH_P (cpu_features, FMA_Usable)
- && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable))
+ if (CPU_FEATURE_USABLE_P (cpu_features, FMA)
+ && CPU_FEATURE_USABLE_P (cpu_features, AVX2))
return OPTIMIZE (fma);


- if (CPU_FEATURES_ARCH_P (cpu_features, FMA4_Usable))
+ if (CPU_FEATURE_USABLE_P (cpu_features, FMA))
return OPTIMIZE (fma4);

Seems like the second diff hunk has a clear and obvious typo. :)
Comment by Ondřej Hošek (RavuAlHemio) - Tuesday, 25 August 2020, 14:55 GMT Comment by Kai (b4lt1c3r) - Monday, 31 August 2020, 19:33 GMT
This bug also breaks matrix-synapse-1.19.1-1. :-(

regards
Kai
Comment by mirh (mirh) - Tuesday, 01 September 2020, 12:21 GMT
This was a hell of a bug.
Not only it prevented X from booting at all on my VM, it *also* made gcc crash when trying to compile a fixed version.

I had to build from a live cd to get it working.
Comment by Kai (b4lt1c3r) - Tuesday, 01 September 2020, 14:23 GMT
Making a custom build with the suggested "--disable-multi-arch" fixed it for now. Had to made it on another machine. Beside matrix-synapse another application which requiring dotnet also runs again.
Possible to backport the upstream fix and deploy a new build revision somehow?

regards
Kai
Comment by Ondřej Hošek (RavuAlHemio) - Wednesday, 02 September 2020, 09:52 GMT
Barthalion has integrated the patch into glibc 2.32-4 (thanks!).
Comment by Ondřej Hošek (RavuAlHemio) - Wednesday, 02 September 2020, 13:01 GMT Comment by Bartłomiej Piotrowski (Barthalion) - Friday, 04 September 2020, 06:08 GMT
Thanks Ondřej for the patch!

Loading...