FS#79725 - [python-pytorch-rocm][python-pytorch-opt-rocm] 2.0.1-10 segfaults

Attached to Project: Arch Linux
Opened by 65a (65a) - Sunday, 17 September 2023, 23:12 GMT
Last edited by Torsten Keßler (tpkessler) - Friday, 17 November 2023, 07:16 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Konstantin Gizdov (kgizdov)
Torsten Keßler (tpkessler)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 4
Private No

Details

Description:
Pytorch-rocm segfaults in libamdhip64.so after a recent update. It no longer works for at least Radeon w7900 or rx7900xtx (gfx1100).

Additional info:
* package version(s)
* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:
Install python-pytorch-rocm or python-pytorch-opt-rocm version 2.0.1-9, and use with gfx1100 (w7900, rx7900xtx). Any attempt to use these pytorch will segfault. A good example might be the trying to makepkg python-safetensors, which runs benchmarks in various frameworks. The pytorch one segfaults almost instantly.
Reverting to 2.0.1-8 restores functionality. Looking at the PKGBUILD I am not sure why this is would be the case given the changes in 9 seem to be minimal and packaging/building only. C++ applications linked directly to rocm libraries work fine, as does tensorflow-rocm.
This task depends upon

Closed by  Torsten Keßler (tpkessler)
Friday, 17 November 2023, 07:16 GMT
Reason for closing:  Fixed
Comment by Toolybird (Toolybird) - Sunday, 17 September 2023, 23:34 GMT
> segfaults in libamdhip64.so

Belongs to pkg "hip-runtime-amd". Please supply a backtrace containing debugging information [1]. It's usually as simple as:

$ coredumpctl gdb (then answer y when it asks "Enable debuginfod for this session?")
(gdb) set logging enabled
(gdb) bt (or bt full)

Then post gdb.txt

[1] https://wiki.archlinux.org/title/Debugging/Getting_traces
Comment by 65a (65a) - Monday, 18 September 2023, 00:07 GMT
I think this is actually a bad kernel crashing hip runtime and not a hip-runtime-amd problem, as all other uses of hip-runtime-amd do not have issues, only pytorch segfaults. The python-safetensors tests are a good demonstration of this. I am going to try to reproduce on a different CPU and AMD combo.

Here is the stack trace:

Message: Process 1078607 (pytest) of user 1806200001 dumped core.

Module [dso] without build-id.
Module librocsolver.so.0 without build-id.
Module libhipblas.so.1 without build-id.
Module librocsparse.so.0 without build-id.
Module librocrand.so.1 without build-id.
Module librocfft.so.0 without build-id.
Module libmagma.so without build-id.
Module librccl.so.1 without build-id.
Module libhipsparse.so.0 without build-id.
Module libhiprand.so.1 without build-id.
Module libhipfft.so without build-id.
Module librocblas.so.3 without build-id.
Module libMIOpen.so.1 without build-id.
Stack trace of thread 1078607:
#0 0x00007f24fa68e83c n/a (libc.so.6 + 0x8e83c)
#1 0x00007f24fa63e668 raise (libc.so.6 + 0x3e668)
#2 0x00007f24fa63e710 n/a (libc.so.6 + 0x3e710)
#3 0x00007f2483904d5d n/a (libamdhip64.so.5 + 0x104d5d)
#4 0x00007f24838cd050 n/a (libamdhip64.so.5 + 0xcd050)
#5 0x00007f2483a48e71 n/a (libamdhip64.so.5 + 0x248e71)
#6 0x00007f2483a1ee2e n/a (libamdhip64.so.5 + 0x21ee2e)
#7 0x00007f2483a2178f hipLaunchKernel (libamdhip64.so.5 + 0x22178f)
#8 0x00007f2485a6fdea n/a (libtorch_hip.so + 0xa6fdea)
#9 0x00007f2485a63d43 n/a (libtorch_hip.so + 0xa63d43)
#10 0x00007f2485a62b9f n/a (libtorch_hip.so + 0xa62b9f)
#11 0x00007f2486ebacd2 n/a (libtorch_hip.so + 0x1ebacd2)
#12 0x00007f2486ebae57 n/a (libtorch_hip.so + 0x1ebae57)
#13 0x00007f24d9187efb _ZN2at4_ops9eq_Tensor4callERKNS_6TensorES4_ (libtorch_cpu.so + 0x1f87efb)
#14 0x00007f24d8c59f87 _ZN2at6native7iscloseERKNS_6TensorES3_ddb (libtorch_cpu.so + 0x1a59f87)
#15 0x00007f24d9b8e30c n/a (libtorch_cpu.so + 0x298e30c)
#16 0x00007f24d9782927 _ZN2at4_ops7isclose4callERKNS_6TensorES4_ddb (libtorch_cpu.so + 0x2582927)
#17 0x00007f24d8c59188 _ZN2at6native8allcloseERKNS_6TensorES3_ddb (libtorch_cpu.so + 0x1a59188)
#18 0x00007f24db9cc37c n/a (libtorch_cpu.so + 0x47cc37c)
#19 0x00007f24d917f87d _ZN2at4_ops8allclose4callERKNS_6TensorES4_ddb (libtorch_cpu.so + 0x1f7f87d)
#20 0x00007f24e3f6d7c2 n/a (libtorch_python.so + 0x56d7c2)
#21 0x00007f24fabf9ea1 n/a (libpython3.11.so.1.0 + 0x1f9ea1)
#22 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#23 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#24 0x00007f24fac2b583 n/a (libpython3.11.so.1.0 + 0x22b583)
#25 0x00007f24fac2aabb n/a (libpython3.11.so.1.0 + 0x22aabb)
#26 0x00007f24fac13bca PyObject_Call (libpython3.11.so.1.0 + 0x213bca)
#27 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#28 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#29 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#30 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#31 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#32 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#33 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#34 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#35 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#36 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#37 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#38 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#39 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#40 0x00007f24fac13d35 PyObject_Call (libpython3.11.so.1.0 + 0x213d35)
#41 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#42 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#43 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#44 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#45 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#46 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#47 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#48 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#49 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#50 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#51 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#52 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#53 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#54 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#55 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
#56 0x00007f24fac119cd _PyObject_Call_Prepend (libpython3.11.so.1.0 + 0x2119cd)
#57 0x00007f24facd8b92 n/a (libpython3.11.so.1.0 + 0x2d8b92)
#58 0x00007f24fabd953c _PyObject_MakeTpCall (libpython3.11.so.1.0 + 0x1d953c)
#59 0x00007f24fabe3839 _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e3839)
#60 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#61 0x00007f24fabe722b _PyEval_EvalFrameDefault (libpython3.11.so.1.0 + 0x1e722b)
#62 0x00007f24fac09560 _PyFunction_Vectorcall (libpython3.11.so.1.0 + 0x209560)
#63 0x00007f24fabdceb3 _PyObject_FastCallDictTstate (libpython3.11.so.1.0 + 0x1dceb3)
Comment by 65a (65a) - Monday, 18 September 2023, 00:16 GMT
I can also reproduce this with python-pytorch-rocm on a Broadwell-EX (vs SPR-SP) and gfx900 (vs gfx1100). Rolling back python-pytorch-rocm fixes it there as well.
Comment by 65a (65a) - Monday, 18 September 2023, 00:29 GMT
Ah, debuginfod env wasn't getting passed through sudo, here is the same trace with more symbols attached.
   trace.log (286.7 KiB)
Comment by Torsten Keßler (tpkessler) - Monday, 18 September 2023, 08:25 GMT
When you say you restored pytorch 2.0.1-8, did you also rolled back to ROCm 5.6.0? The only difference between 5.6.1 and 5.6.0 is an updated hip runtime library.
Comment by 65a (65a) - Monday, 18 September 2023, 14:51 GMT
No. I have rocm 5.6.1 installed. If I install pytorch 2.0.1-8, inference on the card works. If I install pytorch 2.0.1-9, it will reliably segfault every time.
Comment by 65a (65a) - Monday, 18 September 2023, 14:52 GMT
I should add, I can rollback to 5.6.0 and the pytorch version still dictates whether the crash occurs.
Comment by Torsten Keßler (tpkessler) - Wednesday, 20 September 2023, 06:19 GMT
I can reproduce the issue. pytorch even crashes with simple test.py script we use for testing, https://gitlab.archlinux.org/archlinux/packaging/packages/python-pytorch/-/blob/main/test.py?ref_type=heads.
An assertion fails in HIP,

hip::FatBinaryInfo::DeviceIdCheck (device_id=0, this=0x0) at /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_fatbin.hpp:51
Downloading source file /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.1/hipamd/src/hip_fatbin.hpp
51 guarantee(static_cast<size_t>(device_id) < fatbin_dev_info_.size(), "Invalid DeviceId, greater than no of fatbin device info!");

When trying to reference fatbin_dev_info_ in the debugger, I'm hitting a python memory error. This indicates that the problem is on the python side, not within ROCm. This matches with what you already reported.
Comment by c (grinness) - Wednesday, 20 September 2023, 10:05 GMT
I can reproduce the issue on a rx 6800 -- segfault when using GPU device in torch (CPU works fine)

downgrading to python-pytorch-rocm-2.0.1-8 solves the issue (current version is python-pytorch-rocm-2.0.1-9)

I have version 5.6.1-1 of the ROCM stack (including HIP and all libs)
Comment by 65a (65a) - Wednesday, 20 September 2023, 15:07 GMT
If you look at the attached trace.log, the payload struct (search __api_tracer) that is sent to ROCm seems to be full of garbage, e.g "pciDomainID = 0, pciBusID = -368848272, pciDeviceID = 22090, maxSharedMemoryPerMultiProcessor = 140733423165720, isMultiGpuBoard = 229775480, canMapHostMemory = 32767, gcnArch = 229776384, gcnArchName = "\377\177\000\000\220\032\262\r\377\177\000\000#JV\330$\177\000\000\000\0" which is emitted from libtorch_hip.so, which makes me suspect there's something bad happening to memory inside or before that point.
Comment by c (grinness) - Thursday, 21 September 2023, 06:49 GMT
Hi,

installing upgrades from yesterday old version of python-pytorch-rocm-2.0.1-8 throws an error:

File "/usr/lib/python3.11/site-packages/torch/__init__.py", line 229, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libabsl_log_internal_check_op.so.2301.0.0: cannot open shared object file: No such file or directory

and python-pytorch-rocm-2.0.1-10 segfaults
Comment by Paul Sargent (psarge) - Thursday, 21 September 2023, 23:44 GMT
The same failure mechanism is in this bug <https://github.com/ROCm-Developer-Tools/clr/issues/4> with a small reproduction case. The user in that thread seems to think it's down to trying to execute a function which isn't available for the graphics architecture. Indeed, running the small test.py statements by hand with AMD_LOG_LEVEL set to 1, I get...

➜ AMD_LOG_LEVEL=1 python
Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> d = torch.device('cuda')
>>> a = torch.rand(1, 2).to(d)
:1:hip_code_object.cpp :505 : 362894524736 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 362894524748 us: 555926: [tid:0x7f4f504b1740] Devices:
:1:hip_code_object.cpp :509 : 362894524751 us: 555926: [tid:0x7f4f504b1740] amdgcn-amd-amdhsa--gfx1102 - [Not Found]
:1:hip_code_object.cpp :514 : 362894524753 us: 555926: [tid:0x7f4f504b1740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 362894524755 us: 555926: [tid:0x7f4f504b1740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 362894524758 us: 555926: [tid:0x7f4f504b1740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 362894524768 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 362894524774 us: 555926: [tid:0x7f4f504b1740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
>>> print(a + 0)
[1] 555926 segmentation fault (core dumped) AMD_LOG_LEVEL=1 python


Why it's trying to load a gfx906 code object on a gfx1102, I'm not sure.
Comment by 65a (65a) - Friday, 22 September 2023, 04:19 GMT
I'm not convinced that is the same issue, since gfx1100 is included and for me *ONLY* pytorch is broken, llama.cpp or tensorflow can use ROCm just fine, so it seems like while the symptom (segfault) is the same, there are many things that can cause a segfault with different causes. I'll try AMD_LOG_LEVEL=1 and post my log though.
Comment by 65a (65a) - Friday, 22 September 2023, 04:28 GMT
It is the same. Is torch not being built for GPU_TARGETS=gfx1100...etc?

:1:hip_code_object.cpp :505 : 292689222585 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 292689222602 us: 412717: [tid:0x7f5b1b4c5740] Devices:
:1:hip_code_object.cpp :509 : 292689222604 us: 412717: [tid:0x7f5b1b4c5740] amdgcn-amd-amdhsa--gfx1100 - [Not Found]
:1:hip_code_object.cpp :514 : 292689222605 us: 412717: [tid:0x7f5b1b4c5740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 292689222607 us: 412717: [tid:0x7f5b1b4c5740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 292689222609 us: 412717: [tid:0x7f5b1b4c5740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 292689222611 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 292689222614 us: 412717: [tid:0x7f5b1b4c5740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
Comment by 65a (65a) - Friday, 22 September 2023, 04:32 GMT
Note this also fails on gfx900, so it seems like it's maybe only working for gfx906 (aka Vega20)?
Comment by 65a (65a) - Friday, 22 September 2023, 04:39 GMT
https://github.com/ROCm-Developer-Tools/clr/issues/4#issuecomment-1656707401 This comment is the relevant one, this occurs when pytorch is compiled without card support. So I think the hypothesis is that the PKGBUILD needs to include some environment variable pytorch build understands that sets the graphics targets to the common cards, something like export AMDGPU_TARGETS=gfx900,gfx906,gfx1030,gfx1031,gfx1100,gfx1102 (etc) works for my local llama.cpp build.
EDIT: I see this is set as PYTORCH_ROCM_ARCH in _prepare, so I suspect either that is broken or no longer read?
Comment by 65a (65a) - Friday, 22 September 2023, 04:40 GMT Comment by c (grinness) - Friday, 22 September 2023, 07:49 GMT
Hi,

the issue happens also on gfx1030 (rx 6800)

Running the following simple script with AMD_LOG_LEVEL=1:

---
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
else: print('NO GPU!')
---

I see that pytorch is not compiled enabling the target gpu architecture:

> AMD_LOG_LEVEL=1 python ./test-init-torch.py
:1:hip_code_object.cpp :505 : 0998946758 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :507 : 0998946770 us: 5531 : [tid:0x7f7cae803740] Devices:
:1:hip_code_object.cpp :509 : 0998946773 us: 5531 : [tid:0x7f7cae803740] amdgcn-amd-amdhsa--gfx1030 - [Not Found]
:1:hip_code_object.cpp :514 : 0998946775 us: 5531 : [tid:0x7f7cae803740] Bundled Code Objects:
:1:hip_code_object.cpp :530 : 0998946777 us: 5531 : [tid:0x7f7cae803740] host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp :527 : 0998946779 us: 5531 : [tid:0x7f7cae803740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp :534 : 0998946785 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :267 : 0998946791 us: 5531 : [tid:0x7f7cae803740] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
AMD Radeon RX 6800


The above is with python-pytorch-rocm-2.0.1-10
Comment by 65a (65a) - Friday, 22 September 2023, 09:37 GMT
I think AMDGPU_TARGETS, or passing -DAMDGPU_TARGETS to cmake is necessary. I don't know why PYTORCH_ROCM_ARCH environment is not working, it does look correctly specified, but other HIPBLAS/ROCM stuff tends to require AMDGPU_TARGETS. I am also testing with the latest one.
Comment by 65a (65a) - Saturday, 23 September 2023, 03:03 GMT
I am trying a local build with 01cd50b12b1a59d0a89531adbdc5f96a8e702fc3 rolled back. I have also noticed some strange issues with mkl dnnl, so I disabled it due to multiple imports of identical symbols. Build has not completed, but I observed it generating code for gfx1100 without other PKGBUILD changes.

Comment by 65a (65a) - Saturday, 23 September 2023, 03:48 GMT
git revert 01cd50b12b1a59d0a89531adbdc5f96a8e702fc3 + disable mkl dnnl works fine for me, so I would recommend doing so upstream until the change can be stabilized.
Comment by Toolybird (Toolybird) - Friday, 29 September 2023, 08:50 GMT
Merging  FS#79815  here.
Comment by Paul Sargent (psarge) - Saturday, 28 October 2023, 11:39 GMT
So I notice work has been done updating all the ROCm packages to 5.7.1 in staging, but this package has not. I suspect it might be due to this bug.

Are things stuck?
Comment by c (grinness) - Monday, 06 November 2023, 20:39 GMT
Hi,

to the best of my knowledge no solution to the issue is available.
I ended up building my own PKGBUILD from sources for pytorch 2.1 -- with full support of gfx1030 (I have installed full rocm & hip stack version 5.6.1 from official repos)

Thanks
Comment by Paul G (paulieg) - Friday, 10 November 2023, 11:03 GMT
@grinness: might you share your working PKGBUILD here or on a pastebin?
Comment by Paul G (paulieg) - Friday, 10 November 2023, 12:45 GMT
It's been nearly two months since the -rocm variant has been completely broken for everyone with an AMD GPU (save for the few who happen to have a gfx906 card). I understand that most people, including the maintainers of the package most likely, use Nvidia GPUs for their ML work. However, the broad base of Arch users I would expect trends much more towards AMD because of the open source (ish) driver and the desire to support a manufacturer who does that. This is certainly the case for me. Sure, we may not all be doing professional/critical work, but (without wishing to sound entitled), it's not unreasonable for us to expect that some effort might be undertaken to fix this. This is purely a build issue as all of the diagnostics in this thread have shown; a matter of getting hipblas built for more than gfx906. Fixing it doesn't even require someone to have an AMD GPU, just understanding of the (complex) build process.

I'm not a seer, but I have a strong intuition that if an update had broken the CUDA variant, the response would be immediate (if it even got that far).

Perhaps, if the maintainers of this multi-package do not have the time to support the -rocm variants, they should be returned to the AUR with dedicated maintainers who do, with a now-simplified build process that no longer has to support nvidia. Again, I do not wish to sound entitled, I'm trying to fix the build myself (made harder by the convoluted build process that isn't iterable), but not every user has a) clue, b) time, c) a fat pipe to sync all repos from scratch every time (prepare breaks in ways I can't understand if you re-run makepkg after making changes), d) enough cores, space and RAM to build.
Comment by Torsten Keßler (tpkessler) - Friday, 10 November 2023, 12:51 GMT
The problem is not with hipblas (works fine on my gfx900) but it's the multi-gpu compilation that is broken at the moment. If you want to have a working pytorch with ROCm, take the PKGBUILD, uncomment all GPU targets but the one you're using and recompile. ROCm 5.7.1 will be in [testing] soon. Hopefully, it addresses this issue!
Comment by Torsten Keßler (tpkessler) - Friday, 10 November 2023, 13:40 GMT
Comment by Paul G (paulieg) - Friday, 10 November 2023, 15:21 GMT
I've done that, with multiple build failures, and hours of build time wasted. Latest errors I haven't yet figured out how to fix by hand:

```
/usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/bits/stl_algobase.h:431:30: error: argument 1 null where non-null expected [-Werror=nonnull]
/usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/bits/stl_algobase.h:431:30: error: argument 1 null where non-null expected [-Werror=nonnull]
```

Right when it's building the wheel.

With all due respect, but for months this has been broken, it's not an upstream issue, and the response is 'just wait for next major version'? Again, I don't want to sound entitled and I understand maintainers do this in their free time of their own free will. But Arch does have *some* standards when it comes to packages in the repo, yes? I've never had this sort of breakage outside the AUR and have it go unresolved for this long. I'm just 'surprised'.
Comment by c (grinness) - Friday, 10 November 2023, 17:07 GMT
@Paul G

Compiling from source is extremely painful.
For whatever reason not all the exported env variables are picked up by the generated cmakefiles.
In my PKGBUILD file I had to force building the cmake-gui, edit the relevant variables (e.g. HIP paths, GFX type I wanted to build for, enable/disable USE_ROCM and USE_CUDA, etc.) and from there generate the cmake files and compile.

For reference the PKGBUILD file I used to built pytorch 2.1.0:

_pkgname=pytorch
pkgbase="python-${_pkgname}"
pkgname=("${pkgbase}-rocm")
pkgver=2.1.0
_pkgver=${pkgver}
pkgrel=0
_pkgdesc='Tensors and Dynamic neural networks in Python with strong GPU acceleration'
pkgdesc="${_pkgdesc}"
arch=('x86_64')
url="https://pytorch.org"
license=('BSD')
source=("${_pkgname}::git+https://github.com/pytorch/pytorch.git#tag=v$_pkgver")
#source=("${_pkgname}::git+https://github.com/pytorch/pytorch.git")
b2sums=('SKIP')
options=('!lto' '!debug')
conflicts=(python-pytorch)
provides=(python-pytorch)

get_pyver () {
python -c 'import sys; print(str(sys.version_info[0]) + "." + str(sys.version_info[1]))'
}

prepare() {
cd "${srcdir}/${_pkgname}"
git submodule sync
git submodule update --init --recursive
}
build() {
cd "${srcdir}/${_pkgname}"
export VERBOSE=1
#export PYTORCH_BUILD_VERSION="${pkgver}"
#export PYTORCH_BUILD_NUMBER=1

#export _GLIBCXX_USE_CXX11_ABI=1
export HIP_ROOT_DIR=/opt/rocm
export ROCM_HOME=/opt/rocm/
#export USE_CUDA=OFF
#export USE_ROCM=1
export PYTORCH_ROCM_ARCH="gfx1030"
export AMDGPU_TARGETS=${PYTORCH_ROCM_ARCH}
export GPU_TARGETS=${PYTORCH_ROCM_ARCH}
export MAGMA_HOME=/opt/rocm

python tools/amd_build/build_amd.py
#python setup.py build

python setup.py build --cmake-only
ccmake build # or cmake-gui build
}

package() {
cd "${srcdir}/${_pkgname}"
python setup.py install --root="${pkgdir}"/
install -Dm644 LICENSE "${pkgdir}/usr/share/licenses/${pkgname}/LICENSE"
}
Comment by c (grinness) - Friday, 10 November 2023, 17:22 GMT
@Paul G

A couple of notes:

* since I built my package there has been an update to magma-rocm -- I do not know what impact the update has
* if you interested in using torchvision from AUR ('python-torchvision-rocm'), I had to manually update that package as it pulls a version of torchvision that is not compatible with pytorch 2.1.0 -- you need to set pkgver=0.16.0 in the AUR PKGBUILD



Comment by Paul G (paulieg) - Friday, 10 November 2023, 17:34 GMT
@grinness

Thank you very much, that's incredibly helpful. I'll let it build overnight and update here with any tweaks necessary.
Comment by Paul G (paulieg) - Friday, 10 November 2023, 21:57 GMT
@grinness

Did you enable CUDA in the ccmake? If I don't, ccmake configures successfully, everything builds, but then pytorch can't function because rocm support is essentially piggybacking on CUDA support (so things like torch.cuda.current_device() fail). If I enable CUDA (and set all the requisite paths), ccmake errors out with 'enabling CUDA language, recursive call not allowed'.

It should be possible to use the required NVML that enumerates the GPUs without enabling CUDA language as a backend overall.

Incidentally, for the benefit of others, I seem to have to hand-patch `src/pytorch/binaries/dump_operator_names.cc` that's missing an `#include <iostream>`. I don't understand how it ever built for anyone else, including the upstream devs, without this.
Comment by c (grinness) - Saturday, 11 November 2023, 10:47 GMT
@Paul G,

in my build I did not have to patch anything, I did set USE_CUDA=0 in the cmake-gui -- but I am going to re-built as a test, I do not remember all the details, will report asap.
Note that with my package everything works as expected (on my system):

Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'AMD Radeon RX 6800'
>>>
Comment by c (grinness) - Saturday, 11 November 2023, 12:58 GMT
@Paul G,

I just retested my build and I confirm that all works correctly -- note that I do not have to patch anything and USE_CUDA is set top OFF
Moreover I confirm that all the 'export variable' in the build() function of the PKGBUILD file are completely irrelevant for the purposes of setting options and build the package

Below a step by step process to get pytorch 2.1.0 with rocm support with the PKGBUILD provided.

0. make sure that the directory where you put the PKGBUID I provided does not contain subdir pkg/ nor src/ (remove them if necessary) then run makepkg

1. you will be presented with a first screen of options that does not show the variable AMDGPU_TARGETS, nor any other GPU relevant variable.

In the first screen of options set options as follows:
HIP_ROOT_DIR /opt/rocm
USE_ROCM ON

Press 'c' to configure and then 'e' to exit

NOTE: do not set USE_MAGMA to ON (aka: keep it OFF) as it leads to the error hipErrorNoBinaryForGpu: Unable to find code object for all current devices! (at least for my gfx1030)

2. You will be presented with a second screen of options, this time showing AMDGPU_TARGETS variable with default values (adjust if needed -- note that now there are other GPU relevant options now, change accordingly), USE_ROCM set to ON, HIP_ROOT_DIR set to /opt/rocm). Review and make sure all is ok
Hit, 'c' then 'e' -> you will be back to the latest option screen -> hit 'g' to generate and wait for the package to be built

NOTE: if the second screen of options does not include the AMDGPU_TARGETS (you are presented again with the first screen), make sure that HIP_ROOT_DIR and USE_ROCM are set properly and repeat the process in 1.

3. Install the package and test:
> AMD_LOG_LEVEL=1 python
Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'AMD Radeon RX 6800'
>>>


Note that I have installed all packages from ROCM an HIP stack, plus magma-hip (this I believe is the source of the error in this bug report), intel-oneapi-common and intel-opeapi-mkl (plus dependencies -- TBH I do not know if needed) -- all from official repos

For reference I attach 2 files with the content of the first and second screen of options as in the process described above.
Comment by c (grinness) - Saturday, 11 November 2023, 13:44 GMT
@Paul G.

NOTE that I have no NVIDIA/CUDA package installed -- if you don't have an NVIDA device and you have relevant packages installed remove them -- they are not needed for ROCM support and may cause issue in configuring and compiling the package
Comment by c (grinness) - Saturday, 11 November 2023, 14:04 GMT
@the maintainers

As mentioned in my previous posts, the issues are 2 fold:

1. 'export variable' statements in build() (also within the PKGBUILD you created) are not picked up in the cmakefiles generated
2. setting USE_MAGMA to ON leads to the error 'hipErrorNoBinaryForGpu: Unable to find code object for all current devices! (at least for my gfx1030)' -- this indicates an issue in magma-hip (I wrongly mentioned an non-existent package magma-rocm in one of my previous posts)

Comment by Torsten Keßler (tpkessler) - Tuesday, 14 November 2023, 18:28 GMT
ROCm 5.7.1 with new pytorch 2.1.0 is in [extra-testing]. The new version fixes the segfault issue on my gfx900.
Comment by Evert Heylen (Evert7) - Wednesday, 15 November 2023, 10:11 GMT
The package(s) in testing also fixes it for me (RX 6600, gfx1030?) :)
Comment by 65a (65a) - Friday, 17 November 2023, 04:19 GMT
Can confirm that package(s) in testing solve the original issue I reported (gfx1102, rx7900xtx), so we are just waiting for them to move out of testing. Thank you @tpkessler, I recognize the mess of the ecosystem around AI/ML and appreciate the fix!
Comment by Torsten Keßler (tpkessler) - Friday, 17 November 2023, 07:16 GMT
Packages are in [extra] now.

Loading...