FS#72231 - mlx5_core module no longer works with Connectx4-LX (LTS Kernel 5.10.68-1-lts)

Attached to Project: Arch Linux
Opened by Michael Brock (hrast) - Friday, 24 September 2021, 22:26 GMT
Last edited by Andreas Radke (AndyRTR) - Wednesday, 29 September 2021, 05:07 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To No-one
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

Somewhere between 5.10.61 and 5.10.68, mlx5_core changed apparently.

After updating to current LTS kernel (5.10.68-1-lts), driver no longer loads correctly:
# dmesg | grep mlx
[ 18.001398] mlx5_core 0000:02:00.0: firmware version: 14.30.1004
[ 18.001430] mlx5_core 0000:02:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 18.219630] mlx5_core 0000:02:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[ 18.222305] mlx5_core 0000:02:00.0: Port module event: module 0, Cable unplugged
[ 20.335506] mlx5_core 0000:02:00.0: E-Switch: cleanup
[ 21.058695] mlx5_core 0000:02:00.0: init_one:1371:(pid 306): mlx5_load_one failed with error code -22
[ 21.059022] mlx5_core: probe of 0000:02:00.0 failed with error -22
[ 21.059413] mlx5_core 0000:02:00.1: firmware version: 14.30.1004
[ 21.059443] mlx5_core 0000:02:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 21.261641] mlx5_core 0000:02:00.1: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[ 21.263970] mlx5_core 0000:02:00.1: Port module event: module 1, Cable plugged
[ 22.935551] mlx5_core 0000:02:00.1: E-Switch: cleanup
[ 23.627161] mlx5_core 0000:02:00.1: init_one:1371:(pid 306): mlx5_load_one failed with error code -22
[ 23.627463] mlx5_core: probe of 0000:02:00.1 failed with error -22

Reverting to previous kernel resolves the issue (5.10.61-1-lts):
# dmesg | grep mlx
[ 26.818341] mlx5_core 0000:02:00.0: firmware version: 14.30.1004
[ 26.818370] mlx5_core 0000:02:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 27.019482] mlx5_core 0000:02:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[ 27.021747] mlx5_core 0000:02:00.0: Port module event: module 0, Cable unplugged
[ 27.032310] mlx5_core 0000:02:00.1: firmware version: 14.30.1004
[ 27.032369] mlx5_core 0000:02:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 27.250396] mlx5_core 0000:02:00.1: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[ 27.253858] mlx5_core 0000:02:00.1: Port module event: module 1, Cable plugged
[ 27.265641] mlx5_core 0000:02:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 27.475908] mlx5_core 0000:02:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[ 27.492488] mlx5_core 0000:02:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 27.697332] mlx5_core 0000:02:00.1: Supported tc offload range - chains: 4294967294, prios: 4294967295
[ 27.717900] mlx5_core 0000:02:00.0 enp2s0f0np0: renamed from eth0
[ 27.806765] mlx5_core 0000:02:00.1 enp2s0f1np1: renamed from eth1
[ 52.465495] mlx5_core 0000:02:00.1 enp2s0f1np1: Link down

# lspci | grep -i mel
02:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
02:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

# mstconfig -d 02:00.0 q

Device #1:
----------

Device type: ConnectX4LX
Name: MCX4121A-ACA_Ax
Description: ConnectX-4 Lx EN network interface card; 25GbE dual-port SFP28; PCIe3.0 x8; ROHS R6
Device: 02:00.0

Additional info:
* package version:
linux-lts 5.10.68-1

I have two systems with identical cards that have the same issue after the upgrade, different CPU types (e3-1241v3 vs E5-2650v4).
This task depends upon

Closed by  Andreas Radke (AndyRTR)
Wednesday, 29 September 2021, 05:07 GMT
Reason for closing:  Fixed
Comment by Michael Brock (hrast) - Friday, 24 September 2021, 22:37 GMT
Reverted to 5.10.63 on the other system, and that works now. So whatever change was introduced was between .64 and .68 (and I think .67 even).
Comment by Michael Brock (hrast) - Friday, 24 September 2021, 22:40 GMT
Looking at the change logs, it was probably this bundle of changes dropped in 5.10.65 (https://lwn.net/Articles/869305/):

drivers/net/ethernet/mellanox/mlx5/core/devlink.c | 52 +++
drivers/net/ethernet/mellanox/mlx5/core/en/fs.h | 6
drivers/net/ethernet/mellanox/mlx5/core/en_fs.c | 10
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 15 +
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 5
drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 18 -
Comment by loqs (loqs) - Friday, 24 September 2021, 22:46 GMT Comment by Michael Brock (hrast) - Friday, 24 September 2021, 23:39 GMT
Thanks for that. I looked around a bit, but didn't come across that post.
Comment by loqs (loqs) - Sunday, 26 September 2021, 14:33 GMT
5.10.69 contains 473cea4983b582fedb10f84b43e8924716ebc4fc which reverts fe6322774ca28669868a7e231e173e09f7422118.
Can you confirm this resolves the issue?
Comment by Michael Brock (hrast) - Tuesday, 28 September 2021, 16:01 GMT
Yep, after updating to 5.10.69 the problem is resolved.

Loading...