FS#78306 - [python-pytorch-opt-rocm] Segmentation fault when accessing variables in gpu memory
Attached to Project:
Community Packages
Opened by Moshiur Rahman (moshiur_rahman) - Wednesday, 26 April 2023, 05:20 GMT
Last edited by Toolybird (Toolybird) - Wednesday, 26 April 2023, 21:21 GMT
Opened by Moshiur Rahman (moshiur_rahman) - Wednesday, 26 April 2023, 05:20 GMT
Last edited by Toolybird (Toolybird) - Wednesday, 26 April 2023, 21:21 GMT
|
Details
I have an amd rx 6600 and I'm trying to use pytorch with
rocm. But when I try to access the gpu memory the program
crashes. It seems that memory is being allocated but I
cannot read the memory. Here's the code:
``` import torch # check for amd hip print(torch.cuda.is_available()) print(torch.version.hip) device = torch.device('cuda') id = torch.cuda.current_device() # print gpu name print(torch.cuda.get_device_name(id)) # no memory is allocated at first print(torch.cuda.memory_allocated(id)) # store some variable in gpu memory r = torch.rand(16).to(device) # memory is allocated print(torch.cuda.memory_allocated(id)) # crashes when accessing r print(r[0]) ``` And here's the output: ``` ~ > python test.py Tru # gpu compute is available 5.4.22804- # rocm version AMD Radeon RX 6600 # name of gpu 0 # memory allocation at start 512 # memory allocation after storing variable zsh: segmentation fault (core dumped) python test.py # program crashes when reading variable ``` I have also tried running sample codes in pytorch doc and they also result in segfaults. Additional info: version 2.0.0-2 |
This task depends upon
Closed by Toolybird (Toolybird)
Wednesday, 26 April 2023, 21:21 GMT
Reason for closing: None
Additional comments about closing: Reporter says "resolved"
Wednesday, 26 April 2023, 21:21 GMT
Reason for closing: None
Additional comments about closing: Reporter says "resolved"
Please post a backtrace with debug symbols [1]. It might shed some light. i.e.
$ coredumpctl gdb (then answer y when it asks "Enable debuginfod for this session?")
[1] https://wiki.archlinux.org/title/Debugging/Getting_traces#Debuginfod
And here's the output for running coredumpctl gdb.
```
~ > coredumpctl gdb
PID: 7563 (python)
UID: 1000 (moshiur)
GID: 1000 (moshiur)
Signal: 11 (SEGV)
Timestamp: Wed 2023-04-26 13:36:18 +06 (9min ago)
Command Line: python test.py
Executable: /usr/bin/python3.10
Control Group: /user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-5e4145d2-f467-4902-9e22-419452dbc5da.scope
Unit: user@1000.service
User Unit: vte-spawn-5e4145d2-f467-4902-9e22-419452dbc5da.scope
Slice: user-1000.slice
Owner UID: 1000 (moshiur)
Boot ID: 6a6c376b91ff459face9f5b9115398f6
Machine ID: f76121a43d0e4912a6c2371f993ec564
Hostname: archlinux
Storage: /var/lib/systemd/coredump/core.python.1000.6a6c376b91ff459face9f5b9115398f6.7563.1682494578000000.zst (present)
Size on Disk: 37.7M
Message: Process 7563 (python) of user 1000 dumped core.
Module [dso] without build-id.
Module libmkl_vml_def.so.2 without build-id.
Module librocfft-device-3.so.0 without build-id.
Module librocfft-device-2.so.0 without build-id.
Module librocfft-device-1.so.0 without build-id.
Module librocfft-device-0.so.0 without build-id.
Module librocsparse.so.0 without build-id.
Module librocrand.so.1 without build-id.
Module librocfft.so.0 without build-id.
Module librccl.so.1 without build-id.
Module libhipsparse.so.0 without build-id.
Module libhiprand.so.1 without build-id.
Module libhipfft.so without build-id.
Module librocblas.so.0 without build-id.
Module libMIOpen.so.1 without build-id.
Stack trace of thread 7563:
#0 0x00007f6dc42d9cd8 n/a (libamdhip64.so.5 + 0xd9cd8)
#1 0x00007f6dc42a9d5f n/a (libamdhip64.so.5 + 0xa9d5f)
#2 0x00007f6dc43fb2a3 n/a (libamdhip64.so.5 + 0x1fb2a3)
#3 0x00007f6dc43db3f2 n/a (libamdhip64.so.5 + 0x1db3f2)
#4 0x00007f6dc43dd1bb hipLaunchKernel (libamdhip64.so.5 + 0x1dd1bb)
#5 0x00007f6dc60dc2ae _ZN2at6native15gpu_kernel_implINS0_10AbsFunctorIfEEEEvRNS_18TensorIteratorBaseERKT_ (libtorch_hip.so + 0x8dc2ae)
#6 0x00007f6dc60d3034 _ZN2at6native15abs_kernel_cudaERNS_18TensorIteratorBaseE (libtorch_hip.so + 0x8d3034)
#7 0x00007f6dff514c3d n/a (libtorch_cpu.so + 0x1b14c3d)
#8 0x00007f6dc76a62f1 n/a (libtorch_hip.so + 0x1ea62f1)
#9 0x00007f6dffb47d48 _ZN2at4_ops7abs_out4callERKNS_6TensorERS2_ (libtorch_cpu.so + 0x2147d48)
#10 0x00007f6dff514345 _ZN2at6native3absERKNS_6TensorE (libtorch_cpu.so + 0x1b14345)
#11 0x00007f6e001e9305 n/a (libtorch_cpu.so + 0x27e9305)
#12 0x00007f6dffafc587 _ZN2at4_ops3abs10redispatchEN3c1014DispatchKeySetERKNS_6TensorE (libtorch_cpu.so + 0x20fc587)
#13 0x00007f6e022c6267 n/a (libtorch_cpu.so + 0x48c6267)
#14 0x00007f6e022c6957 n/a (libtorch_cpu.so + 0x48c6957)
#15 0x00007f6dffb3ccf7 _ZN2at4_ops3abs4callERKNS_6TensorE (libtorch_cpu.so + 0x213ccf7)
#16 0x00007f6dff47d653 _ZN2at6native8isfiniteERKNS_6TensorE (libtorch_cpu.so + 0x1a7d653)
#17 0x00007f6e003a6575 n/a (libtorch_cpu.so + 0x29a6575)
#18 0x00007f6dffe3e717 _ZN2at4_ops8isfinite4callERKNS_6TensorE (libtorch_cpu.so + 0x243e717)
#19 0x00007f6e0a77fa87 n/a (libtorch_python.so + 0x57fa87)
#20 0x00007f6e13356c31 n/a (libpython3.10.so.1.0 + 0x156c31)
#21 0x00007f6e1335031b _PyObject_MakeTpCall (libpython3.10.so.1.0 + 0x15031b)
#22 0x00007f6e1334b726 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x14b726)
#23 0x00007f6e1334f5fb _PyObject_FastCallDictTstate (libpython3.10.so.1.0 + 0x14f5fb)
#24 0x00007f6e1335f21d n/a (libpython3.10.so.1.0 + 0x15f21d)
#25 0x00007f6e133502f3 _PyObject_MakeTpCall (libpython3.10.so.1.0 + 0x1502f3)
#26 0x00007f6e1334b14c _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x14b14c)
#27 0x00007f6e133570e9 _PyFunction_Vectorcall (libpython3.10.so.1.0 + 0x1570e9)
#28 0x00007f6e13346336 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x146336)
#29 0x00007f6e133570e9 _PyFunction_Vectorcall (libpython3.10.so.1.0 + 0x1570e9)
#30 0x00007f6e13347476 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x147476)
#31 0x00007f6e133570e9 _PyFunction_Vectorcall (libpython3.10.so.1.0 + 0x1570e9)
#32 0x00007f6e13347476 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x147476)
#33 0x00007f6e133a4851 n/a (libpython3.10.so.1.0 + 0x1a4851)
#34 0x00007f6e13438560 n/a (libpython3.10.so.1.0 + 0x238560)
#35 0x00007f6e1336c974 PyObject_Str (libpython3.10.so.1.0 + 0x16c974)
#36 0x00007f6e133f95ef PyFile_WriteObject (libpython3.10.so.1.0 + 0x1f95ef)
#37 0x00007f6e133f8c5e n/a (libpython3.10.so.1.0 + 0x1f8c5e)
#38 0x00007f6e1334df3f n/a (libpython3.10.so.1.0 + 0x14df3f)
#39 0x00007f6e13346336 _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x146336)
#40 0x00007f6e13344f80 n/a (libpython3.10.so.1.0 + 0x144f80)
#41 0x00007f6e133f39e4 PyEval_EvalCode (libpython3.10.so.1.0 + 0x1f39e4)
#42 0x00007f6e13404383 n/a (libpython3.10.so.1.0 + 0x204383)
#43 0x00007f6e133ffaea n/a (libpython3.10.so.1.0 + 0x1ffaea)
#44 0x00007f6e132a223f n/a (libpython3.10.so.1.0 + 0xa223f)
#45 0x00007f6e132a1ef0 _PyRun_SimpleFileObject (libpython3.10.so.1.0 + 0xa1ef0)
#46 0x00007f6e132a28a3 _PyRun_AnyFileObject (libpython3.10.so.1.0 + 0xa28a3)
#47 0x00007f6e13410b5d Py_RunMain (libpython3.10.so.1.0 + 0x210b5d)
#48 0x00007f6e133e4f3b Py_BytesMain (libpython3.10.so.1.0 + 0x1e4f3b)
#49 0x00007f6e1303c790 n/a (libc.so.6 + 0x23790)
#50 0x00007f6e1303c84a __libc_start_main (libc.so.6 + 0x2384a)
#51 0x0000564ace917045 _start (python3.10 + 0x1045)
Stack trace of thread 7564:
#0 0x00007f6e1311553f ioctl (libc.so.6 + 0xfc53f)
#1 0x00007f6d6d4d8541 n/a (libhsakmt.so.1 + 0xc541)
#2 0x00007f6d6d4d1fbf hsaKmtWaitOnMultipleEvents (libhsakmt.so.1 + 0x5fbf)
#3 0x00007f6d6d275b87 n/a (libhsa-runtime64.so.1 + 0x75b87)
#4 0x00007f6d6d257897 n/a (libhsa-runtime64.so.1 + 0x57897)
#5 0x00007f6d6d26934b n/a (libhsa-runtime64.so.1 + 0x6934b)
#6 0x00007f6d6d224bac n/a (libhsa-runtime64.so.1 + 0x24bac)
#7 0x00007f6e1309ebb5 n/a (libc.so.6 + 0x85bb5)
#8 0x00007f6e13120d90 n/a (libc.so.6 + 0x107d90)
ELF object binary architecture: AMD x86-64
```
https://github.com/RadeonOpenCompute/ROCm/blob/ROCm-5.4.0/docs/release/gpu_os_support.md