FS#78306 - [python-pytorch-opt-rocm] Segmentation fault when accessing variables in gpu memory

Attached to Project: Community Packages
Opened by Moshiur Rahman (moshiur_rahman) - Wednesday, 26 April 2023, 05:20 GMT
Last edited by Toolybird (Toolybird) - Wednesday, 26 April 2023, 21:21 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Konstantin Gizdov (kgizdov)
Torsten Keßler (tpkessler)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

I have an amd rx 6600 and I'm trying to use pytorch with rocm. But when I try to access the gpu memory the program crashes. It seems that memory is being allocated but I cannot read the memory. Here's the code:

```
import torch

# check for amd hip
print(torch.cuda.is_available())
print(torch.version.hip)

device = torch.device('cuda')
id = torch.cuda.current_device()
# print gpu name
print(torch.cuda.get_device_name(id))
# no memory is allocated at first
print(torch.cuda.memory_allocated(id))

# store some variable in gpu memory
r = torch.rand(16).to(device)
# memory is allocated
print(torch.cuda.memory_allocated(id))
# crashes when accessing r
print(r[0])
```

And here's the output:
```
~ > python test.py
Tru # gpu compute is available
5.4.22804- # rocm version
AMD Radeon RX 6600 # name of gpu
0 # memory allocation at start
512 # memory allocation after storing variable
zsh: segmentation fault (core dumped) python test.py # program crashes when reading variable
```

I have also tried running sample codes in pytorch doc and they also result in segfaults.


Additional info: version 2.0.0-2
This task depends upon

Closed by  Toolybird (Toolybird)
Wednesday, 26 April 2023, 21:21 GMT
Reason for closing:  None
Additional comments about closing:  Reporter says "resolved"
Comment by Moshiur Rahman (moshiur_rahman) - Wednesday, 26 April 2023, 06:13 GMT
Pytorch recommends rocm 5.4.2 in their website. However, the rocm runtime provided in arch repos is version 5.4.3. I am not sure if this is causing the issue.
Comment by Toolybird (Toolybird) - Wednesday, 26 April 2023, 07:05 GMT
> segmentation fault

Please post a backtrace with debug symbols [1]. It might shed some light. i.e.

$ coredumpctl gdb (then answer y when it asks "Enable debuginfod for this session?")

[1] https://wiki.archlinux.org/title/Debugging/Getting_traces#Debuginfod
Comment by Moshiur Rahman (moshiur_rahman) - Wednesday, 26 April 2023, 08:06 GMT
I haven't done any backtraces before so I'm not sure if I did it correctly. I've attached the backtrace from gdb.
And here's the output for running coredumpctl gdb.

```
~ > coredumpctl gdb
PID: 7563 (python)
UID: 1000 (moshiur)
GID: 1000 (moshiur)
Signal: 11 (SEGV)
Timestamp: Wed 2023-04-26 13:36:18 +06 (9min ago)
Command Line: python test.py
Executable: /usr/bin/python3.10
Control Group: /user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-5e4145d2-f467-4902-9e22-419452dbc5da.scope
Unit: user@1000.service
User Unit: vte-spawn-5e4145d2-f467-4902-9e22-419452dbc5da.scope
Slice: user-1000.slice
Owner UID: 1000 (moshiur)
Boot ID: 6a6c376b91ff459face9f5b9115398f6
Machine ID: f76121a43d0e4912a6c2371f993ec564
Hostname: archlinux
Storage: /var/lib/systemd/coredump/core.python.1000.6a6c376b91ff459face9f5b9115398f6.7563.1682494578000000.zst (present)
Size on Disk: 37.7M
Message: Process 7563 (python) of user 1000 dumped core.

Module [dso] without build-id.
Module libmkl_vml_def.so.2 without build-id.
Module librocfft-device-3.so.0 without build-id.
Module librocfft-device-2.so.0 without build-id.
Module librocfft-device-1.so.0 without build-id.
Module librocfft-device-0.so.0 without build-id.
Module librocsparse.so.0 without build-id.
Module librocrand.so.1 without build-id.
Module librocfft.so.0 without build-id.
Module librccl.so.1 without build-id.
Module libhipsparse.so.0 without build-id.
Module libhiprand.so.1 without build-id.
Module libhipfft.so without build-id.
Module librocblas.so.0 without build-id.
Module libMIOpen.so.1 without build-id.
Stack trace of thread 7563:
#0 0x00007f6dc42d9cd8 n/a (libamdhip64.so.5 + 0xd9cd8)
#1 0x00007f6dc42a9d5f n/a (libamdhip64.so.5 + 0xa9d5f)
#2 0x00007f6dc43fb2a3 n/a (libamdhip64.so.5 + 0x1fb2a3)
#3 0x00007f6dc43db3f2 n/a (libamdhip64.so.5 + 0x1db3f2)
#4 0x00007f6dc43dd1bb hipLaunchKernel (libamdhip64.so.5 + 0x1dd1bb)
#5 0x00007f6dc60dc2ae _ZN2at6native15gpu_kernel_implINS0_10AbsFunctorIfEEEEvRNS_18TensorIteratorBaseERKT_ (libtorch_hip.so + 0x8dc2ae)
#6 0x00007f6dc60d3034 _ZN2at6native15abs_kernel_cudaERNS_18TensorIteratorBaseE (libtorch_hip.so + 0x8d3034)
#7 0x00007f6dff514c3d n/a (libtorch_cpu.so + 0x1b14c3d)
#8 0x00007f6dc76a62f1 n/a (libtorch_hip.so + 0x1ea62f1)
#9 0x00007f6dffb47d48 _ZN2at4_ops7abs_out4callERKNS_6TensorERS2_ (libtorch_cpu.so + 0x2147d48)
#10 0x00007f6dff514345 _ZN2at6native3absERKNS_6TensorE (libtorch_cpu.so + 0x1b14345)
#11 0x00007f6e001e9305 n/a (libtorch_cpu.so + 0x27e9305)
#12 0x00007f6dffafc587 _ZN2at4_ops3abs10redispatchEN3c1014DispatchKeySetERKNS_6TensorE (libtorch_cpu.so + 0x20fc587)
#13 0x00007f6e022c6267 n/a (libtorch_cpu.so + 0x48c6267)
#14 0x00007f6e022c6957 n/a (libtorch_cpu.so + 0x48c6957)
#15 0x00007f6dffb3ccf7 _ZN2at4_ops3abs4callERKNS_6TensorE (libtorch_cpu.so + 0x213ccf7)
#16 0x00007f6dff47d653 _ZN2at6native8isfiniteERKNS_6TensorE (libtorch_cpu.so + 0x1a7d653)
#17 0x00007f6e003a6575 n/a (libtorch_cpu.so + 0x29a6575)
#18 0x00007f6dffe3e717 _ZN2at4_ops8isfinite4callERKNS_6TensorE (libtorch_cpu.so + 0x243e717)
#19 0x00007f6e0a77fa87 n/a (libtorch_python.so + 0x57fa87)
#20 0x00007f6e13356c31 n/a (libpython3.10.so.1.0 + 0x156c31)
#21 0x00007f6e1335031b _PyObject_MakeTpCall (libpython3.10.so.1.0 + 0x15031b)
#22 0x00007f6e1334b726 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x14b726)
#23 0x00007f6e1334f5fb _PyObject_FastCallDictTstate (libpython3.10.so.1.0 + 0x14f5fb)
#24 0x00007f6e1335f21d n/a (libpython3.10.so.1.0 + 0x15f21d)
#25 0x00007f6e133502f3 _PyObject_MakeTpCall (libpython3.10.so.1.0 + 0x1502f3)
#26 0x00007f6e1334b14c _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x14b14c)
#27 0x00007f6e133570e9 _PyFunction_Vectorcall (libpython3.10.so.1.0 + 0x1570e9)
#28 0x00007f6e13346336 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x146336)
#29 0x00007f6e133570e9 _PyFunction_Vectorcall (libpython3.10.so.1.0 + 0x1570e9)
#30 0x00007f6e13347476 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x147476)
#31 0x00007f6e133570e9 _PyFunction_Vectorcall (libpython3.10.so.1.0 + 0x1570e9)
#32 0x00007f6e13347476 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x147476)
#33 0x00007f6e133a4851 n/a (libpython3.10.so.1.0 + 0x1a4851)
#34 0x00007f6e13438560 n/a (libpython3.10.so.1.0 + 0x238560)
#35 0x00007f6e1336c974 PyObject_Str (libpython3.10.so.1.0 + 0x16c974)
#36 0x00007f6e133f95ef PyFile_WriteObject (libpython3.10.so.1.0 + 0x1f95ef)
#37 0x00007f6e133f8c5e n/a (libpython3.10.so.1.0 + 0x1f8c5e)
#38 0x00007f6e1334df3f n/a (libpython3.10.so.1.0 + 0x14df3f)
#39 0x00007f6e13346336 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x146336)
#40 0x00007f6e13344f80 n/a (libpython3.10.so.1.0 + 0x144f80)
#41 0x00007f6e133f39e4 PyEval_EvalCode (libpython3.10.so.1.0 + 0x1f39e4)
#42 0x00007f6e13404383 n/a (libpython3.10.so.1.0 + 0x204383)
#43 0x00007f6e133ffaea n/a (libpython3.10.so.1.0 + 0x1ffaea)
#44 0x00007f6e132a223f n/a (libpython3.10.so.1.0 + 0xa223f)
#45 0x00007f6e132a1ef0 _PyRun_SimpleFileObject (libpython3.10.so.1.0 + 0xa1ef0)
#46 0x00007f6e132a28a3 _PyRun_AnyFileObject (libpython3.10.so.1.0 + 0xa28a3)
#47 0x00007f6e13410b5d Py_RunMain (libpython3.10.so.1.0 + 0x210b5d)
#48 0x00007f6e133e4f3b Py_BytesMain (libpython3.10.so.1.0 + 0x1e4f3b)
#49 0x00007f6e1303c790 n/a (libc.so.6 + 0x23790)
#50 0x00007f6e1303c84a __libc_start_main (libc.so.6 + 0x2384a)
#51 0x0000564ace917045 _start (python3.10 + 0x1045)

Stack trace of thread 7564:
#0 0x00007f6e1311553f ioctl (libc.so.6 + 0xfc53f)
#1 0x00007f6d6d4d8541 n/a (libhsakmt.so.1 + 0xc541)
#2 0x00007f6d6d4d1fbf hsaKmtWaitOnMultipleEvents (libhsakmt.so.1 + 0x5fbf)
#3 0x00007f6d6d275b87 n/a (libhsa-runtime64.so.1 + 0x75b87)
#4 0x00007f6d6d257897 n/a (libhsa-runtime64.so.1 + 0x57897)
#5 0x00007f6d6d26934b n/a (libhsa-runtime64.so.1 + 0x6934b)
#6 0x00007f6d6d224bac n/a (libhsa-runtime64.so.1 + 0x24bac)
#7 0x00007f6e1309ebb5 n/a (libc.so.6 + 0x85bb5)
#8 0x00007f6e13120d90 n/a (libc.so.6 + 0x107d90)
ELF object binary architecture: AMD x86-64

```
   gdb.txt (15.1 KiB)
Comment by Toolybird (Toolybird) - Wednesday, 26 April 2023, 08:49 GMT
It appears to be the python interpreter crashing...but I'm unsure where the blame lies in cases like this.
Comment by Moshiur Rahman (moshiur_rahman) - Wednesday, 26 April 2023, 10:01 GMT
I think I found the problem. I looked at the rocm doc in github and it says rx 6600 only has support for hip runtime. I think support for hip sdk is also needed to use pytorch. Using HSA_OVERRIDE_GFX_VERSION=10.3.0 (I think this is the id for rx 6900xt which supports hip sdk) fixes the issue. Maybe hip sdk support will be added in future versions for rx 6600. For now I'll have to use the environment variable.

https://github.com/RadeonOpenCompute/ROCm/blob/ROCm-5.4.0/docs/release/gpu_os_support.md

Loading...