FS#62282 - [cuda] $PATH and $LD_LIBRARY_PATH were not updated automatically

Attached to Project: Community Packages
Opened by Zhen Xi (Mayrixon) - Tuesday, 09 April 2019, 00:12 GMT
Last edited by Konstantin Gizdov (kgizdov) - Thursday, 11 April 2019, 16:50 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Konstantin Gizdov (kgizdov)
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:
The package requiring cuda, such as python-tensorflow-cuda cannot find cuda *.so automatically.

Additional info:
* package version(s)
cuda 10.1.105-6

* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:
This task depends upon

Closed by  Konstantin Gizdov (kgizdov)
Thursday, 11 April 2019, 16:50 GMT
Reason for closing:  Fixed
Additional comments about closing:  glibc-2.28-6
cuda-10.1.105-8
Comment by Zhen Xi (Mayrixon) - Tuesday, 09 April 2019, 00:15 GMT
Sorry for misclick.

Package versions:
cuda 10.1.105-6
python-tensorflow-version 1.13.1-4

Logs:
2019-04-09 01:14:15.966295: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-09 01:14:15.983045: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4009500000 Hz
2019-04-09 01:14:15.983530: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x564380471ef0 executing computations on platform Host. Devices:
2019-04-09 01:14:15.983542: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-04-09 01:14:16.038294: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-09 01:14:16.038739: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56437f2f9fd0 executing computations on platform CUDA. Devices:
2019-04-09 01:14:16.038753: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-04-09 01:14:16.038956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.39GiB
2019-04-09 01:14:16.038965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-09 01:14:16.282489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-09 01:14:16.282508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-09 01:14:16.282512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-09 01:14:16.282705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7120 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-04-09 01:14:17.015797: I tensorflow/stream_executor/dso_loader.cc:142] Couldn't open CUDA library libcublas.so.10.1. LD_LIBRARY_PATH:
2019-04-09 01:14:17.015819: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcublas.so.10.1; dlerror: libcublas.so.10.1: cannot open shared object file: No such file or directory

The bug could be fixed by adding the following commands in .bashrc or .zshrc:
export PATH=/opt/cuda/bin:$PATH
export LD_LIBRARY_PATH=/opt/cuda/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
Comment by Konstantin Gizdov (kgizdov) - Wednesday, 10 April 2019, 11:13 GMT
python-tensorflow does not depend on or use cuda. Only python-tensorflow-cuda or python-tensorflow-opt-cuda try to look for cuda, so I'm very sceptical this log was produced by python-tensorflow.

Please describe in detail you setup (CPU, GPU, relevant installed packages) and a full steps to reproduce the issue.
Comment by Zhen Xi (Mayrixon) - Wednesday, 10 April 2019, 13:42 GMT
That is my fault. I am using python-tensorflow-cuda rather than python-tensorflow. The log was produced by python-tensorflow-cuda
Comment by Konstantin Gizdov (kgizdov) - Wednesday, 10 April 2019, 18:31 GMT
Could you please provide the details requested? It would also be nice to know if you have a hook to recompile nvidia modules on upgrade. Thanks
Comment by Zhen Xi (Mayrixon) - Wednesday, 10 April 2019, 19:43 GMT
hardware:
CPU: i7-6700k
GPU: GTX 1080

relevant installed packages:
linux 5.0.7.arch1-1
nvidia 418.56-7
nvidia-utils 418.56-1
cuda 10.1.105-6
cudnn 7.5.0.56-1
python 3.7.3-1
python-tensorflow-cuda 1.13.1-4

relevant settings:
/etc/mkinitcpio.conf
MODULES=(nvidia nvidia_modeset nvidia_uvm nvidia_drm)
HOOKS=(base udev autodetect modconf block filesystems keyboard fsck)

/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet nvidia-drm.modeset=1 nowatchdog"

/etc/pacman.d/hooks
[Trigger]
Operation=Install
Operation=Upgrade
Operation=Remove
Type=Package
Target=nvidia
Target=linux
# Change the linux part above and in the Exec line if a different kernel is used

[Action]
Description=Update Nvidia module in initcpio
Depends=mkinitcpio
When=PostTransaction
NeedsTargets
Exec=/bin/sh -c 'while read -r trg; do case $trg in linux) exit 0; esac; done; /usr/bin/mkinitcpio -P'

Shell variables:
$PATH
/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/lib:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/command-not-found:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/fzf:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/tmux:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/git:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/gitignore:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/pip:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/colorize:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/history:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/thefuck:/home/zhen/.antigen/bundles/Vifon/deer:/home/zhen/.antigen/bundles/supercrabtree/k:/home/zhen/.antigen/bundles/zsh-users/zsh-autosuggestions:/home/zhen/.antigen/bundles/zsh-users/zsh-completions
$LD_LIBRARY_PATH (There is no this variable)


Steps to reproduce:
1. Create a python script minimum_script.py as follows:
import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)
2. Execute command python minimum_script.py
3. Logs as follows:
WARNING:tensorflow:From /usr/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-04-10 20:42:50.494741: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-10 20:42:50.515438: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4009500000 Hz
2019-04-10 20:42:50.515766: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55879a667250 executing computations on platform Host. Devices:
2019-04-10 20:42:50.515781: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-04-10 20:42:50.585591: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-10 20:42:50.586063: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x558799e29500 executing computations on platform CUDA. Devices:
2019-04-10 20:42:50.586075: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-04-10 20:42:50.586274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 6.12GiB
2019-04-10 20:42:50.586283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-10 20:42:50.842967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-10 20:42:50.842988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-10 20:42:50.842995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-10 20:42:50.843184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5892 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
Epoch 1/5
2019-04-10 20:42:50.989782: I tensorflow/stream_executor/dso_loader.cc:142] Couldn't open CUDA library libcublas.so.10.1. LD_LIBRARY_PATH:
2019-04-10 20:42:50.989802: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcublas.so.10.1; dlerror: libcublas.so.10.1: cannot open shared object file: No such file or directory
[1] 28870 abort (core dumped) python minimum_script.py

Comment by Konstantin Gizdov (kgizdov) - Thursday, 11 April 2019, 10:05 GMT
I am able to replicate. This turns out to be a regression in CUDA 10.1 shipped libs, where the object SONAME does not provide a 10.1 variant for all libs. There is now a version of cuda in testing that attempts to correct for this regression, but it requires a new release of glibc (ldconfig). Thus we cannot do much in the immediate term. My recommendation is to downgrade to cuda 10 and older tensorflow that was built against that (1.13.1-3 for example).
Comment by Konstantin Gizdov (kgizdov) - Thursday, 11 April 2019, 12:51 GMT
progress update - we now have glibc-2.28-6 in [testing] and cuda-10.1.105-8 in [community-testing] which should resolve the issue. Please try them and see if it works.
Comment by Zhen Xi (Mayrixon) - Thursday, 11 April 2019, 14:37 GMT
Updated packages:
glibc-2.28-6
cuda-10.1.105-8

The problem has been solved. Thank you!

Loading...