Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#79725 - [python-pytorch-rocm][python-pytorch-opt-rocm] 2.0.1-10 segfaults

Attached to Project: Arch Linux
Opened by 65a (65a) - Sunday, 17 September 2023, 23:12 GMT
Last edited by Toolybird (Toolybird) - Friday, 29 September 2023, 08:49 GMT
Task Type Bug Report
Category Packages: Extra
Status Assigned
Assigned To Sven-Hendrik Haase (Svenstaro)
Konstantin Gizdov (kgizdov)
Torsten Keßler (tpkessler)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 2
Private No

Details

Description:
Pytorch-rocm segfaults in libamdhip64.so after a recent update. It no longer works for at least Radeon w7900 or rx7900xtx (gfx1100).

Additional info:
* package version(s)
* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:
Install python-pytorch-rocm or python-pytorch-opt-rocm version 2.0.1-9, and use with gfx1100 (w7900, rx7900xtx). Any attempt to use these pytorch will segfault. A good example might be the trying to makepkg python-safetensors, which runs benchmarks in various frameworks. The pytorch one segfaults almost instantly.
Reverting to 2.0.1-8 restores functionality. Looking at the PKGBUILD I am not sure why this is would be the case given the changes in 9 seem to be minimal and packaging/building only. C++ applications linked directly to rocm libraries work fine, as does tensorflow-rocm.
This task depends upon

Comment by Toolybird (Toolybird) - Sunday, 17 September 2023, 23:34 GMT
> segfaults in libamdhip64.so

Belongs to pkg "hip-runtime-amd". Please supply a backtrace containing debugging information [1]. It's usually as simple as:

$ coredumpctl gdb (then answer y when it asks "Enable debuginfod for this session?")
(gdb) set logging enabled
(gdb) bt (or bt full)

Then post gdb.txt

[1] https://wiki.archlinux.org/title/Debugging/Getting_traces
Comment by 65a (65a) - Monday, 18 September 2023, 00:07 GMT
I think this is actually a bad kernel crashing hip runtime and not a hip-runtime-amd problem, as all other uses of hip-runtime-amd do not have issues, only pytorch segfaults. The python-safetensors tests are a good demonstration of this. I am going to try to reproduce on a different CPU and AMD combo.

Here is the stack trace:

Message: Process 1078607 (pytest) of user 1806200001 dumped core.

Module [dso] without build-id.
Module librocsolver.so.0 without build-id.
Module libhipblas.so.1 without build-id.
Module librocsparse.so.0 without build-id.
Module librocrand.so.1 without build-id.
Module librocfft.so.0 without build-id.
Module libmagma.so without build-id.
Module librccl.so.1 without build-id.
Module libhipsparse.so.0 without build-id.
Module libhiprand.so.1 without build-id.
Module libhipfft.so without build-id.
Module librocblas.so.3 without build-id.
Module libMIOpen.so.1 without build-id.
Stack trace of thread 1078607:
#0 0x00007f24fa68e83c n/a (libc.so.6 + 0x8e83c)
#1 0x00007f24fa63e668 raise (libc.so.6 + 0x3e668)
#2 0x00007f24fa63e710 n/a (libc.so.6 + 0x3e710)
#3 0x00007f2483904d5d n/a (libamdhip64.so.5 + 0x104d5d)
#4 0x00007f24838cd050 n/a (libamdhip64.so.5 + 0xcd050)
#5 0x00007f2483a48e71 n/a (libamdhip64.so.5 + 0x248e71)
#6 0x00007f2483a1ee2e n/a (libamdhip64.so.5 + 0x21ee2e)
#7 0x00007f2483a2178f hipLaunchKernel (libamdhip64.so.5 + 0x22178f)
#8 0x00007f2485a6fdea n/a (libtorch_hip.so + 0xa6fdea)
#9 0x00007f2485a63d43 n/a (libtorch_hip.so + 0xa63d43)
#10 0x00007f2485a62b9f n/a (libtorch_hip.so + 0xa62b9f)
#11 0x00007f2486ebacd2 n/a (libtorch_hip.so + 0x1ebacd2)
#12 0x00007f2486ebae57 n/a (libtorch_hip.so + 0x1ebae57)
#13 0x00007f24d9187efb _ZN2at4_ops9eq_Tensor4callERKNS_6TensorES4_ (libtorch_cpu.so + 0x1f87efb)
#14 0x00007f24d8c59f87 _ZN2at6native7iscloseERKNS_6TensorES3_ddb (libtorch_cpu.so + 0x1a59f87)
#15 0x00007f24d9b8e30c n/a (libtorch_cpu.so + 0x298e30c)
#16 0x00007f24d9782927 _ZN2at4_ops7isclose4callERKNS_6TensorES4_ddb (libtorch_cpu.so + 0x2582927)
#17 0x00007f24d8c59188 _ZN2at6native8allcloseERKNS_6TensorES3_ddb (libtorch_cpu.so + 0x1a59188)
#18 0x00007f24db9cc37c n/a (libtorch_cpu.so + 0x47cc37c)
#19 0x00007f24d917f87d _ZN2at4_ops8allclose4callERKNS_6TensorES4_ddb (libtorch_cpu.so + 0x1f7f87d)
#20 0x00007f24e3f6d7c2 n/a (libtorch_python.so + 0x56d7c2)
#21 0x00007f24fabf9ea1 n/a (libpython3.11.so.1.0 + 0x1f9ea1)
#22 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#23 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#24 0x00007f24fac2b583 n/a (libpython3.11.so.1.0 + 0x22b583)
#25 0x00007f24fac2aabb n/a (libpython3.11.so.1.0 + 0x22aabb)
#26 0x00007f24fac13bca PyObject_Call (libpython3.11.so.1.0 + 0x213bca)
#27 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#28 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#29 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#30 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#31 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#32 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#33 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#34 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#35 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#36 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#37 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#38 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#39 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#40 0x00007f24fac13d35 PyObject_Call (libpython3.11.so.1.0 + 0x213d35)
#41 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#42 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#43 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#44 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#45 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#46 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#47 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#48 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#49 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#50 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#51 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#52 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#53 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#54 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#55 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#56 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#57 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#58 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#59 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#60 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#61 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#62 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#63 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
Comment by 65a (65a) - Monday, 18 September 2023, 00:16 GMT
I can also reproduce this with python-pytorch-rocm on a Broadwell-EX (vs SPR-SP) and gfx900 (vs gfx1100). Rolling back python-pytorch-rocm fixes it there as well.
Comment by 65a (65a) - Monday, 18 September 2023, 00:29 GMT
Ah, debuginfod env wasn't getting passed through sudo, here is the same trace with more symbols attached.
   trace.log (286.7 KiB)
Comment by Torsten Keßler (tpkessler) - Monday, 18 September 2023, 08:25 GMT
When you say you restored pytorch 2.0.1-8, did you also rolled back to ROCm 5.6.0? The only difference between 5.6.1 and 5.6.0 is an updated hip runtime library.
Comment by 65a (65a) - Monday, 18 September 2023, 14:51 GMT
No. I have rocm 5.6.1 installed. If I install pytorch 2.0.1-8, inference on the card works. If I install pytorch 2.0.1-9, it will reliably segfault every time.
Comment by 65a (65a) - Monday, 18 September 2023, 14:52 GMT
I should add, I can rollback to 5.6.0 and the pytorch version still dictates whether the crash occurs.
Comment by Torsten Keßler (tpkessler) - Wednesday, 20 September 2023, 06:19 GMT
I can reproduce the issue. pytorch even crashes with simple test.py script we use for testing, https://gitlab.archlinux.org/archlinux/packaging/packages/python-pytorch/-/blob/main/test.py?ref_type=heads.
An assertion fails in HIP,

hip::FatBinaryInfo::DeviceIdCheck (device_id=0, this=0x0) at /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_fatbin.hpp:51
Downloading source file /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_fatbin.hpp
51 guarantee(static_cast<size_t>(device_id) < fatbin_dev_info_.size(), "Invalid DeviceId, greater than no of fatbin device info!");

When trying to reference fatbin_dev_info_ in the debugger, I'm hitting a python memory error. This indicates that the problem is on the python side, not within ROCm. This matches with what you already reported.
Comment by c (grinness) - Wednesday, 20 September 2023, 10:05 GMT
I can reproduce the issue on a rx 6800 -- segfault when using GPU device in torch (CPU works fine)

downgrading to python-pytorch-rocm-2.0.1-8 solves the issue (current version is python-pytorch-rocm-2.0.1-9)

I have version 5.6.1-1 of the ROCM stack (including HIP and all libs)
Comment by 65a (65a) - Wednesday, 20 September 2023, 15:07 GMT
If you look at the attached trace.log, the payload struct (search __api_tracer) that is sent to ROCm seems to be full of garbage, e.g "pciDomainID = 0, pciBusID = -368848272, pciDeviceID = 22090, maxSharedMemoryPerMultiProcessor = 140733423165720, isMultiGpuBoard = 229775480, canMapHostMemory = 32767, gcnArch = 229776384, gcnArchName = "\377\177\000\000\220\032\262\r\377\177\000\000#JV\330$\177\000\000\000\0" which is emitted from libtorch_hip.so, which makes me suspect there's something bad happening to memory inside or before that point.
Comment by c (grinness) - Thursday, 21 September 2023, 06:49 GMT
Hi,

installing upgrades from yesterday old version of python-pytorch-rocm-2.0.1-8 throws an error:

File "/usr/lib/python3.11/site-packages/torch/__init__.py", line 229, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libabsl_log_internal_check_op.so.2301.0.0: cannot open shared object file: No such file or directory

and python-pytorch-rocm-2.0.1-10 segfaults
Comment by Paul Sargent (psarge) - Thursday, 21 September 2023, 23:44 GMT
The same failure mechanism is in this bug <https://github.com/ROCm-Developer-Tools/clr/issues/4> with a small reproduction case. The user in that thread seems to think it's down to trying to execute a function which isn't available for the graphics architecture. Indeed, running the small test.py statements by hand with AMD_LOG_LEVEL set to 1, I get...

➜ AMD_LOG_LEVEL=1 python
Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> d = torch.device('cuda')
>>> a = torch.rand(1, 2).to(d)
:1:hip_code_object.cpp :505 : 362894524736 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 362894524748 us: 555926: [tid:0x7f4f504b1740] Devices:
:1:hip_code_object.cpp :509 : 362894524751 us: 555926: [tid:0x7f4f504b1740] amdgcn-amd-amdhsa--gfx1102 - [Not Found]
:1:hip_code_object.cpp :514 : 362894524753 us: 555926: [tid:0x7f4f504b1740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 362894524755 us: 555926: [tid:0x7f4f504b1740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 362894524758 us: 555926: [tid:0x7f4f504b1740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 362894524768 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 362894524774 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
>>> print(a + 0)
[1] 555926 segmentation fault (core dumped) AMD_LOG_LEVEL=1 python


Why it's trying to load a gfx906 code object on a gfx1102, I'm not sure.
Comment by 65a (65a) - Friday, 22 September 2023, 04:19 GMT
I'm not convinced that is the same issue, since gfx1100 is included and for me *ONLY* pytorch is broken, llama.cpp or tensorflow can use ROCm just fine, so it seems like while the symptom (segfault) is the same, there are many things that can cause a segfault with different causes. I'll try AMD_LOG_LEVEL=1 and post my log though.
Comment by 65a (65a) - Friday, 22 September 2023, 04:28 GMT
It is the same. Is torch not being built for GPU_TARGETS=gfx1100...etc?

:1:hip_code_object.cpp :505 : 292689222585 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 292689222602 us: 412717: [tid:0x7f5b1b4c5740] Devices:
:1:hip_code_object.cpp :509 : 292689222604 us: 412717: [tid:0x7f5b1b4c5740] amdgcn-amd-amdhsa--gfx1100 - [Not Found]
:1:hip_code_object.cpp :514 : 292689222605 us: 412717: [tid:0x7f5b1b4c5740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 292689222607 us: 412717: [tid:0x7f5b1b4c5740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 292689222609 us: 412717: [tid:0x7f5b1b4c5740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 292689222611 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 292689222614 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
Comment by 65a (65a) - Friday, 22 September 2023, 04:32 GMT
Note this also fails on gfx900, so it seems like it's maybe only working for gfx906 (aka Vega20)?
Comment by 65a (65a) - Friday, 22 September 2023, 04:39 GMT
https://github.com/ROCm-Developer-Tools/clr/issues/4#issuecomment-1656707401 This comment is the relevant one, this occurs when pytorch is compiled without card support. So I think the hypothesis is that the PKGBUILD needs to include some environment variable pytorch build understands that sets the graphics targets to the common cards, something like export AMDGPU_TARGETS=gfx900,gfx906,gfx1030,gfx1031,gfx1100,gfx1102 (etc) works for my local llama.cpp build.
EDIT: I see this is set as PYTORCH_ROCM_ARCH in _prepare, so I suspect either that is broken or no longer read?
Comment by 65a (65a) - Friday, 22 September 2023, 04:40 GMT Comment by c (grinness) - Friday, 22 September 2023, 07:49 GMT
Hi,

the issue happens also on gfx1030 (rx 6800)

Running the following simple script with AMD_LOG_LEVEL=1:

---
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
else: print('NO GPU!')
---

I see that pytorch is not compiled enabling the target gpu architecture:

> AMD_LOG_LEVEL=1 python ./test-init-torch.py
:1:hip_code_object.cpp :505 : 0998946758 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 0998946770 us: 5531 : [tid:0x7f7cae803740] Devices:
:1:hip_code_object.cpp :509 : 0998946773 us: 5531 : [tid:0x7f7cae803740] amdgcn-amd-amdhsa--gfx1030 - [Not Found]
:1:hip_code_object.cpp :514 : 0998946775 us: 5531 : [tid:0x7f7cae803740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 0998946777 us: 5531 : [tid:0x7f7cae803740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 0998946779 us: 5531 : [tid:0x7f7cae803740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 0998946785 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 0998946791 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
AMD Radeon RX 6800


The above is with python-pytorch-rocm-2.0.1-10
Comment by 65a (65a) - Friday, 22 September 2023, 09:37 GMT
I think AMDGPU_TARGETS, or passing -DAMDGPU_TARGETS to cmake is necessary. I don't know why PYTORCH_ROCM_ARCH environment is not working, it does look correctly specified, but other HIPBLAS/ROCM stuff tends to require AMDGPU_TARGETS. I am also testing with the latest one.
Comment by 65a (65a) - Saturday, 23 September 2023, 03:03 GMT
I am trying a local build with 01cd50b12b1a59d0a89531adbdc5f96a8e702fc3 rolled back. I have also noticed some strange issues with mkl dnnl, so I disabled it due to multiple imports of identical symbols. Build has not completed, but I observed it generating code for gfx1100 without other PKGBUILD changes.

Comment by 65a (65a) - Saturday, 23 September 2023, 03:48 GMT
git revert 01cd50b12b1a59d0a89531adbdc5f96a8e702fc3 + disable mkl dnnl works fine for me, so I would recommend doing so upstream until the change can be stabilized.
Comment by Toolybird (Toolybird) - Friday, 29 September 2023, 08:50 GMT
Merging  FS#79815  here.

Loading...