Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
FS#79725 - [python-pytorch-rocm][python-pytorch-opt-rocm] 2.0.1-10 segfaults
Attached to Project:
Arch Linux
Opened by 65a (65a) - Sunday, 17 September 2023, 23:12 GMT
Last edited by Toolybird (Toolybird) - Friday, 29 September 2023, 08:49 GMT
Opened by 65a (65a) - Sunday, 17 September 2023, 23:12 GMT
Last edited by Toolybird (Toolybird) - Friday, 29 September 2023, 08:49 GMT
|
DetailsDescription:
Pytorch-rocm segfaults in libamdhip64.so after a recent update. It no longer works for at least Radeon w7900 or rx7900xtx (gfx1100). Additional info: * package version(s) * config and/or log files etc. * link to upstream bug report, if any Steps to reproduce: Install python-pytorch-rocm or python-pytorch-opt-rocm version 2.0.1-9, and use with gfx1100 (w7900, rx7900xtx). Any attempt to use these pytorch will segfault. A good example might be the trying to makepkg python-safetensors, which runs benchmarks in various frameworks. The pytorch one segfaults almost instantly. Reverting to 2.0.1-8 restores functionality. Looking at the PKGBUILD I am not sure why this is would be the case given the changes in 9 seem to be minimal and packaging/building only. C++ applications linked directly to rocm libraries work fine, as does tensorflow-rocm. |
This task depends upon
Belongs to pkg "hip-runtime-amd". Please supply a backtrace containing debugging information [1]. It's usually as simple as:
$ coredumpctl gdb (then answer y when it asks "Enable debuginfod for this session?")
(gdb) set logging enabled
(gdb) bt (or bt full)
Then post gdb.txt
[1] https://wiki.archlinux.org/title/Debugging/Getting_traces
Here is the stack trace:
Message: Process 1078607 (pytest) of user 1806200001 dumped core.
Module [dso] without build-id.
Module librocsolver.so.0 without build-id.
Module libhipblas.so.1 without build-id.
Module librocsparse.so.0 without build-id.
Module librocrand.so.1 without build-id.
Module librocfft.so.0 without build-id.
Module libmagma.so without build-id.
Module librccl.so.1 without build-id.
Module libhipsparse.so.0 without build-id.
Module libhiprand.so.1 without build-id.
Module libhipfft.so without build-id.
Module librocblas.so.3 without build-id.
Module libMIOpen.so.1 without build-id.
Stack trace of thread 1078607:
#0 0x00007f24fa68e83c n/a (libc.so.6 + 0x8e83c)
#1 0x00007f24fa63e668 raise (libc.so.6 + 0x3e668)
#2 0x00007f24fa63e710 n/a (libc.so.6 + 0x3e710)
#3 0x00007f2483904d5d n/a (libamdhip64.so.5 + 0x104d5d)
#4 0x00007f24838cd050 n/a (libamdhip64.so.5 + 0xcd050)
#5 0x00007f2483a48e71 n/a (libamdhip64.so.5 + 0x248e71)
#6 0x00007f2483a1ee2e n/a (libamdhip64.so.5 + 0x21ee2e)
#7 0x00007f2483a2178f hipLaunchKernel (libamdhip64.so.5 + 0x22178f)
#8 0x00007f2485a6fdea n/a (libtorch_hip.so + 0xa6fdea)
#9 0x00007f2485a63d43 n/a (libtorch_hip.so + 0xa63d43)
#10 0x00007f2485a62b9f n/a (libtorch_hip.so + 0xa62b9f)
#11 0x00007f2486ebacd2 n/a (libtorch_hip.so + 0x1ebacd2)
#12 0x00007f2486ebae57 n/a (libtorch_hip.so + 0x1ebae57)
#13 0x00007f24d9187efb _ZN2at4_ops9eq_Tensor4callERKNS_6TensorES4_ (libtorch_cpu.so + 0x1f87efb)
#14 0x00007f24d8c59f87 _ZN2at6native7iscloseERKNS_6TensorES3_ddb (libtorch_cpu.so + 0x1a59f87)
#15 0x00007f24d9b8e30c n/a (libtorch_cpu.so + 0x298e30c)
#16 0x00007f24d9782927 _ZN2at4_ops7isclose4callERKNS_6TensorES4_ddb (libtorch_cpu.so + 0x2582927)
#17 0x00007f24d8c59188 _ZN2at6native8allcloseERKNS_6TensorES3_ddb (libtorch_cpu.so + 0x1a59188)
#18 0x00007f24db9cc37c n/a (libtorch_cpu.so + 0x47cc37c)
#19 0x00007f24d917f87d _ZN2at4_ops8allclose4callERKNS_6TensorES4_ddb (libtorch_cpu.so + 0x1f7f87d)
#20 0x00007f24e3f6d7c2 n/a (libtorch_python.so + 0x56d7c2)
#21 0x00007f24fabf9ea1 n/a (libpython3.11.so.1.0 + 0x1f9ea1)
#22 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#23 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#24 0x00007f24fac2b583 n/a (libpython3.11.so.1.0 + 0x22b583)
#25 0x00007f24fac2aabb n/a (libpython3.11.so.1.0 + 0x22aabb)
#26 0x00007f24fac13bca PyObject_Call (libpython3.11.so.1.0 + 0x213bca)
#27 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#28 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#29 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#30 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#31 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#32 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#33 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#34 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#35 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#36 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#37 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#38 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#39 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#40 0x00007f24fac13d35 PyObject_Call (libpython3.11.so.1.0 + 0x213d35)
#41 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#42 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#43 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#44 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#45 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#46 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#47 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#48 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#49 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#50 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#51 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#52 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#53 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#54 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#55 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#56 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#57 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#58 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#59 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#60 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#61 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#62 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#63 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
An assertion fails in HIP,
hip::FatBinaryInfo::DeviceIdCheck (device_id=0, this=0x0) at /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_fatbin.hpp:51
Downloading source file /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_fatbin.hpp
51 guarantee(static_cast<size_t>(device_id) < fatbin_dev_info_.size(), "Invalid DeviceId, greater than no of fatbin device info!");
When trying to reference fatbin_dev_info_ in the debugger, I'm hitting a python memory error. This indicates that the problem is on the python side, not within ROCm. This matches with what you already reported.
downgrading to python-pytorch-rocm-2.0.1-8 solves the issue (current version is python-pytorch-rocm-2.0.1-9)
I have version 5.6.1-1 of the ROCM stack (including HIP and all libs)
installing upgrades from yesterday old version of python-pytorch-rocm-2.0.1-8 throws an error:
File "/usr/lib/python3.11/site-packages/torch/__init__.py", line 229, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libabsl_log_internal_check_op.so.2301.0.0: cannot open shared object file: No such file or directory
and python-pytorch-rocm-2.0.1-10 segfaults
➜ AMD_LOG_LEVEL=1 python
Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> d = torch.device('cuda')
>>> a = torch.rand(1, 2).to(d)
:1:hip_code_object.cpp :505 : 362894524736 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 362894524748 us: 555926: [tid:0x7f4f504b1740] Devices:
:1:hip_code_object.cpp :509 : 362894524751 us: 555926: [tid:0x7f4f504b1740] amdgcn-amd-amdhsa--gfx1102 - [Not Found]
:1:hip_code_object.cpp :514 : 362894524753 us: 555926: [tid:0x7f4f504b1740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 362894524755 us: 555926: [tid:0x7f4f504b1740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 362894524758 us: 555926: [tid:0x7f4f504b1740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 362894524768 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 362894524774 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
>>> print(a + 0)
[1] 555926 segmentation fault (core dumped) AMD_LOG_LEVEL=1 python
Why it's trying to load a gfx906 code object on a gfx1102, I'm not sure.
:1:hip_code_object.cpp :505 : 292689222585 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 292689222602 us: 412717: [tid:0x7f5b1b4c5740] Devices:
:1:hip_code_object.cpp :509 : 292689222604 us: 412717: [tid:0x7f5b1b4c5740] amdgcn-amd-amdhsa--gfx1100 - [Not Found]
:1:hip_code_object.cpp :514 : 292689222605 us: 412717: [tid:0x7f5b1b4c5740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 292689222607 us: 412717: [tid:0x7f5b1b4c5740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 292689222609 us: 412717: [tid:0x7f5b1b4c5740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 292689222611 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 292689222614 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
EDIT: I see this is set as PYTORCH_ROCM_ARCH in _prepare, so I suspect either that is broken or no longer read?
the issue happens also on gfx1030 (rx 6800)
Running the following simple script with AMD_LOG_LEVEL=1:
---
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
else: print('NO GPU!')
---
I see that pytorch is not compiled enabling the target gpu architecture:
> AMD_LOG_LEVEL=1 python ./test-init-torch.py
:1:hip_code_object.cpp :505 : 0998946758 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 0998946770 us: 5531 : [tid:0x7f7cae803740] Devices:
:1:hip_code_object.cpp :509 : 0998946773 us: 5531 : [tid:0x7f7cae803740] amdgcn-amd-amdhsa--gfx1030 - [Not Found]
:1:hip_code_object.cpp :514 : 0998946775 us: 5531 : [tid:0x7f7cae803740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 0998946777 us: 5531 : [tid:0x7f7cae803740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 0998946779 us: 5531 : [tid:0x7f7cae803740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 0998946785 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 0998946791 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
AMD Radeon RX 6800
The above is with python-pytorch-rocm-2.0.1-10
FS#79815here.