FS#75747 - [arrow] Crashes with SIGILL (illegal instruction) on CPUs without AVX2 support

Attached to Project: Community Packages
Opened by Thomas (thomastc) - Tuesday, 30 August 2022, 13:36 GMT
Last edited by Bruno Pagani (ArchangeGabriel) - Monday, 02 January 2023, 22:55 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Bruno Pagani (ArchangeGabriel)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Version: arrow 8.0.1-1

Description:

On x86_64, Arrow is compiled with AVX2 forcibly enabled. This causes programs using Arrow to crash with SIGILL (illegal instruction) on CPUs that don't support this instruction set.

AVX2 is explicitly enabled in the PKGBUILD file using `-DARROW_SIMD_LEVEL=AVX2`:
https://github.com/archlinux/svntogit-community/blob/48857f8373ddf21f3cffb4f039df07804fc827d3/trunk/PKGBUILD#L54

Note that this is an unconditional switch, as described in the Arrow manual:
https://arrow.apache.org/docs/cpp/env_vars.html?highlight=arrow_user_simd_level#envvar-ARROW_USER_SIMD_LEVEL

> In addition to runtime dispatch, the compile-time SIMD level can be set using the ARROW_SIMD_LEVEL CMake configuration variable. Unlike runtime dispatch, compile-time SIMD optimizations cannot be changed at runtime (for example, if you compile Arrow C++ with AVX512 enabled, the resulting binary will only run on AVX512-enabled CPUs).

As far as I know, Arch Linux has not dropped support for any x86_64 compatible CPU, so this setting should probably be left at the default.

Steps to reproduce:

* Be on a CPU without AVX2 support. Mine is an Intel Core i7 920, and gives no output for `cat /proc/cpuinfo | grep avx`. See https://stackoverflow.com/questions/55762372/disabling-avx2-in-cpu-for-testing-purposes for emulation options if needed.
* Run `plasma-store-server`, an executable included in the `arrow` package.
* See `Illegal instruction (core dumped)`.

gdb points to `libarrow.so.800` for the location of the instruction:

$ gdb plasma-store-server
Reading symbols from plasma-store-server...
(No debugging symbols found in plasma-store-server)
(gdb) run
Starting program: /usr/bin/plasma-store-server
...
Program received signal SIGILL, Illegal instruction.
0x00007ffff6388309 in ?? () from /usr/lib/libarrow.so.800
(gdb) bt
#0 0x00007ffff6388309 in ?? () from /usr/lib/libarrow.so.800
#1 0x00007ffff7fcecee in call_init (env=0x7fffffffe228, argv=0x7fffffffe218, argc=1, l=<optimized out>) at dl-init.c:70
#2 call_init (l=<optimized out>, argc=1, argv=0x7fffffffe218, env=0x7fffffffe228) at dl-init.c:26
#3 0x00007ffff7fceddc in _dl_init (main_map=0x7ffff7ffe2c0, argc=1, argv=0x7fffffffe218, env=0x7fffffffe228) at dl-init.c:117
#4 0x00007ffff7fe56a0 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#5 0x0000000000000001 in ?? ()
#6 0x00007fffffffe5e3 in ?? ()
#7 0x0000000000000000 in ?? ()

Related report from Manjaro:

https://forum.manjaro.org/t/issues-with-gdal-and-arrow-after-update/117670
This task depends upon

Closed by  Bruno Pagani (ArchangeGabriel)
Monday, 02 January 2023, 22:55 GMT
Reason for closing:  Fixed
Additional comments about closing:  Done in arrow 10.0.1-3. Sorry for not being able to handle it earlier.
Comment by loqs (loqs) - Wednesday, 31 August 2022, 01:12 GMT
@thomastc can you try applying the attached diff and building arrow locally. Note as the version is updated to 9.0.0 with an soname bump anything depending on arrow will need to be rebuilt as well.
The version update is due to a failing test which is fixed in 9.0.0. Unfortunately 9.0.0 does not support xsimd 9.0.0 [1]. So that is fixed up in prepare.
ARROW_SIMD_LEVEL=NONE as the default is SSE4.2 which I believe is not in baseline X86_64 support.
ARROW_RUNTIME_SIMD_LEVEL=MAX which is the current default is for documentation, currently allows runtime support up to AVX512.

[1] https://github.com/apache/arrow/pull/14005
Comment by Thomas (thomastc) - Wednesday, 31 August 2022, 11:26 GMT
Thanks! I can confirm that arrow builds successfully with the patch, and that `plasma-store-server` now runs without SIGILL.
Comment by Lukas (LukeLR) - Tuesday, 18 October 2022, 18:48 GMT
Hi,
I think this is still relevant in 9.0.0. Installed arrow-9.0.0-1 from community, and still get "Illegal instruction (core dumped)" when running

```
python -c 'from osgeo import gdal'
```

or

```
Rscript -e 'library(sf)'
```

which both depend on arrow. When simply removing the line `-DARROW_SIMD_LEVEL=AVX2`from the PKGBUILD, everything works fine. So how can we fix this regression?
Comment by Thomas (thomastc) - Tuesday, 25 October 2022, 09:22 GMT
Attached an updated patch for 9.0.0. It's very simple and identical to @LukeLR's proposal, because the switch to xsimd 9.0.1 has already been applied in the new PKGBUILD, and the default value for ARROW_SIMD_LEVEL already does the right thing (even on ARM now): https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/cpp/cmake_modules/DefineOptions.cmake#L122

I tested before/after that this fixes the issue. Conveniently, the makepkg tests even catch the crash at build time.

Is there anything else I can do to help getting this merged ASAP?
Comment by loqs (loqs) - Tuesday, 25 October 2022, 12:27 GMT
@thomastc the default value for ARROW_SIMD_LEVEL is SSE4_2 is that not above x86-64 baseline?
gcc -c -Q -march=x86-64 --help=target
Comment by Thomas (thomastc) - Tuesday, 25 October 2022, 12:48 GMT
@loqs Well spotted! I would expect requiring SSE4.2 to be fine on the vast majority of x86_64 systems running today, including my Intel i7-920 from 2008, but only up to SSE2 is actually guaranteed [1]. So we should set -DARROW_SIMD_LEVEL=NONE to be perfectly safe. New patch attached.

[1] https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels
Comment by Thomas (thomastc) - Thursday, 17 November 2022, 08:33 GMT
*gentle nudge* We have a trivial 1-line patch, which does the right thing according to the documentation, and is verified to work in practice. I don't know that there's anything else I can do. Could this please be merged?
Comment by Conor (TheMacDads) - Sunday, 20 November 2022, 22:45 GMT
Thanks for the suggestion in this thread. Just to add - when I built arrow from source, v9.0.0, with the suggested modification, I have one failed test in check() - StreamingReaderTests.InvalidRowsSkippedAsync. Skipping this test I was able to use the package and avoid the illegal instruction segfaults. Will try to reproduce on another machine, curious if others have encountered this though

EDIT: tried this on a second machine and wasn't reproduced. All tests passed

Loading...