FS#63870 - [linux][wireguard] causes system freeze after latest updates

Attached to Project: Arch Linux
Opened by mike (mbalajew) - Saturday, 21 September 2019, 02:23 GMT
Last edited by Christian Hesse (eworm) - Tuesday, 08 October 2019, 07:43 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Christian Hesse (eworm)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 4
Private No

Details

System freezes after running `wg-quick up`

Additional info:
linux 5.3.arch1-1
wireguard-tools 0.0.20190913-1
wireguard-arch 0.0.20190913-2

Steps to reproduce:

1. run `wg-quick up <some wireguard conf file>`
2. system doesn't always freeze immediately, sometimes it takes a few minutes.
This task depends upon

Closed by  Christian Hesse (eworm)
Tuesday, 08 October 2019, 07:43 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 5.3.4.arch1-1
Comment by mike (mbalajew) - Saturday, 21 September 2019, 03:04 GMT
I added loglevel=8 to my kernel parameters, ran wg-quick up ..., and found the following in my dmesg output:

[Fri Sep 20 20:53:23 2019] wireguard: WireGuard 0.0.20190913 loaded. See www.wireguard.com for information.
[Fri Sep 20 20:53:23 2019] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[Fri Sep 20 20:53:38 2019] dst_release: dst:00000000ac66f553 refcnt:-1
[Fri Sep 20 20:53:38 2019] dst_release: dst:00000000ac66f553 refcnt:-2
[Fri Sep 20 20:53:38 2019] dst_release: dst:00000000ac66f553 refcnt:-3
[Fri Sep 20 20:53:38 2019] dst_release: dst:00000000ac66f553 refcnt:-4
[Fri Sep 20 20:53:38 2019] dst_release: dst:00000000ac66f553 refcnt:-5
[Fri Sep 20 20:53:38 2019] dst_release: dst:00000000ac66f553 refcnt:-6
[Fri Sep 20 20:53:38 2019] dst_release: dst:00000000ac66f553 refcnt:-7
[Fri Sep 20 20:53:38 2019] dst_release: dst:000000001b776db6 refcnt:-1
[Fri Sep 20 20:53:38 2019] BUG: kernel NULL pointer dereference, address: 0000000000000000
[Fri Sep 20 20:53:38 2019] #PF: supervisor read access in kernel mode
[Fri Sep 20 20:53:38 2019] #PF: error_code(0x0000) - not-present page
[Fri Sep 20 20:53:38 2019] PGD 0 P4D 0
[Fri Sep 20 20:53:38 2019] Oops: 0000 [#1] PREEMPT SMP PTI
Comment by mike (mbalajew) - Saturday, 21 September 2019, 13:36 GMT
I'm not sure if this is helpful, but, just wanted to add that I'm not getting any crashes on the -lts or -hardened kernel variants.
Comment by loqs (loqs) - Saturday, 21 September 2019, 15:11 GMT
You could contact the wireguard developers using the mailing list [1] or IRC channel #wireguard on Freenode.

[1] https://www.wireguard.com/#contact-the-team
Comment by mike (mbalajew) - Saturday, 21 September 2019, 16:20 GMT
loqs, thanks for the suggestion. I just chatted with a few people on #wireguard and limiting wireguard to IPv4 traffic fixes the issue. Thus, it seems like this is related to an IPv6-related bug in the latest kernel.
Comment by loqs (loqs) - Saturday, 21 September 2019, 16:27 GMT
Can reproduce the issue without the wireguard module loaded using IPV6?
Also can you try applying the fix from https://lore.kernel.org/netdev/20190919171236.111294-1-edumazet%40google.com/
Comment by mike (mbalajew) - Saturday, 21 September 2019, 17:03 GMT
Loqs, yes, I'll apply the patch and get back to you. Regarding reproducing the issue without wireguard, no, I haven't tried anything, yet, but others have, I think; see this discussion: https://old.reddit.com/r/archlinux/comments/d73jca/panics_and_null_pointer_derereferences_when_using/
Comment by mike (mbalajew) - Saturday, 21 September 2019, 19:04 GMT
loqs, tried the patch and unfortunately it still crashes with wireguard.
Comment by loqs (loqs) - Saturday, 21 September 2019, 19:09 GMT
If someone who can reproduce the issue without using wireguard could report it upstream that would be preferable
because until wireguard is merged into the kernel it is officially an unsupported module that taints the kernel.
Alternately bisecting between 5.2 and 5.3 should locate the causal commit that could then be reported upstream.
Comment by mike (mbalajew) - Saturday, 21 September 2019, 19:15 GMT
loqs, ok, will do. But just out of curiosity, would you be able to suggest something else besides wireguard to test with? In the meantime, I'll try the bisecting as you mentioned.
Comment by loqs (loqs) - Saturday, 21 September 2019, 20:55 GMT
Does connecting to a website using IPV6 trigger the issue e.g. https://ipv6.google.com ?
You could try limiting the bisect to the path net to try speeding the process up.
Edit:
To save possible wasted effort how did you apply the patch from the mailing list?
Comment by mike (mbalajew) - Saturday, 21 September 2019, 21:48 GMT
Since the change was only one line, I though I could get away with just a single sed command. So this is what my PKGBUILD looks like:
prepare() {
cd $_srcname
sed -i "321s/|/\&/g" net/ipv6/ip6_fib.c
...
Comment by mike (mbalajew) - Saturday, 21 September 2019, 21:51 GMT
Oh, well, I probably should be more precise. First, I clone the kernel repo, so I ran:
git clone --single-branch -b packages/linux https://projects.archlinux.org/svntogit/packages.git
and then modified the PKBUILD in the repos/core-x86_64 directory.
Comment by mike (mbalajew) - Saturday, 21 September 2019, 22:01 GMT
And to answer your first question, no, visiting a website using IPV6 (e.g. https://ipv6.google.com) does not seem to trigger the issue. And here I am, of course, talking about the vanilla 5.3.0 kernel, so no patch applied.
Comment by loqs (loqs) - Saturday, 21 September 2019, 23:13 GMT
If dst_release is significant then I would suggest:
git revert d64a1f574a2957b4bcb06452d36cc1c6bf16e9fc
git revert -m 1 7d30a7f6424e88c958c19a02f6f54ab8d25919cd

patch of the diff attached. If you are already bisecting ignore this.
   tmp.diff (12.9 KiB)
Comment by mike (mbalajew) - Sunday, 22 September 2019, 00:09 GMT
hhmmm, well, I think at this point we might have reached the limit of my coding abilities :) I could continue fumbling around trying to figure this all out, and it would definitely be fun to learn, but, I suspect your time is probably better spent doing more important thins. And given that misunderstood your original request to try the patch, I worry I might also have misunderstood your other request of "bisecting between 5.2 and 5.3". I was under the impression you just wanted me to downgrade the kernel (from the offical arch repos) until we identify the original of the bug. But now, I have a feeling you're talking about something much more specific. Is that the case? Anyway, if at this point I'm becoming more of a nuisance, don't be shy and let me know :D
Comment by Piotr (piorekf) - Sunday, 22 September 2019, 13:21 GMT
@loqs:
On Gentoo I'm experiencing the same crash. Applying your tmp.diff fixed the bug for me. Wireguard with IPv6 looks to be working fine now. Thanks.

What should we do to report this upstream?
Comment by mike (mbalajew) - Sunday, 22 September 2019, 15:45 GMT
After enabling the testing repos, I updated to kernel to 5.3.1 and now wireguard with IPv6 seems to be working just fine. This is strange because it was my understanding that the tmp.diff fix from above was not included in 5.3.1, or am I missing something?
Comment by loqs (loqs) - Sunday, 22 September 2019, 18:23 GMT
@Piotr ideally narrow down which commit is triggering it then report it upstream probably to the netdev mailing list.

d64a1f574a29 ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logic
7d30a7f6424e Merge branch 'ipv6-avoid-taking-refcnt-on-dst-during-route-lookup' #merge contains the commits below so can be ignored when reverting the commits one by one
74109218b051 ipv6: initialize rt6->rt6i_uncached in all pre-allocated dst entries
7d9e5f422150 ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF
0e09edcce7ad ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route()
67f415dd2906 ipv6: convert rx data path to not take refcnt on dst

Updated tmp.diff without 67f415dd2906 to check if that is the cause.

@mbalajew - 5.3.1 does not contain any of the changes from tmp.diff https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.3.1

Comment by mike (mbalajew) - Sunday, 22 September 2019, 23:49 GMT
I take that back, still crashing on 5.3.1. But I'm not crazy. Wireguard worked perfectly fine this morning when I was working at a coffee shop but then started crashing again after I got home and connected to my home wifi. Very strange. I guess there was something about the configuration of the coffee shop wifi that prevented the bug from surfacing.
Comment by Piotr (piorekf) - Sunday, 22 September 2019, 23:53 GMT
@mike:
I observed that too: workes great on a company wifi and cable, crashes at home and in hackerspace.

@loqs:
So far I have compiled kernel with debugging symbols and run gdb on it as written here: https://www.kernel.org/doc/html/latest/admin-guide/bug-hunting.html
And this is what gdb told me:

(gdb) l *fib6_rule_action+0xda
0xffffffff819e6cba is in fib6_rule_action (./include/net/ip6_fib.h:212).
207 for (rt = (w)->leaf; rt; \
208 rt = rcu_dereference_protected(rt->fib6_next, 1))
209
210 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
211 {
212 return ((struct rt6_info *)dst)->rt6i_idev;
213 }
214
215 static inline void fib6_clean_expires(struct fib6_info *f6i)
216 {
Comment by loqs (loqs) - Monday, 23 September 2019, 00:16 GMT
@piorekf is that from 5.3 with no patches?
Comment by Piotr (piorekf) - Monday, 23 September 2019, 12:26 GMT
That was from 5.3.1 gentoo-sources so with small patches which Gentoo adds. On a vanilla 5.3.1 I get the same thing, just the addresses are slightly different:

(gdb) l *fib6_rule_action+0xe0
0xffffffff819c8490 is in fib6_rule_action (./include/net/ip6_fib.h:212).
207 for (rt = (w)->leaf; rt; \
208 rt = rcu_dereference_protected(rt->fib6_next, 1))
209
210 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
211 {
212 return ((struct rt6_info *)dst)->rt6i_idev;
213 }
214
215 static inline void fib6_clean_expires(struct fib6_info *f6i)
216 {
Comment by loqs (loqs) - Monday, 23 September 2019, 17:00 GMT
https://bugs.archlinux.org/task/63855#comment182065 reports the second tmp.diff also works.
Reduce the number of reverted commits again now only three commits reverted.
Please test if the latest tmp.diff still works.
   tmp.diff (8.2 KiB)
Comment by Piotr (piorekf) - Monday, 23 September 2019, 17:05 GMT
For me everything works with only this one reverted: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7d9e5f422150ed00de744e02a80734d74cc9704d
I also reported my findings on #wireguard on freenode and zx2c4 is looking into it too.
Comment by persson (persson) - Monday, 23 September 2019, 18:06 GMT
@piorekf @loqs reverting that single commit fixes the panics for me too.
Comment by Jason A. Donenfeld (zx2c4) - Tuesday, 24 September 2019, 06:36 GMT
Can you let me know if <https://lore.kernel.org/netdev/20190924073615.31704-1-Jason@zx2c4.com/raw> fixes it for you?
Comment by Piotr (piorekf) - Tuesday, 24 September 2019, 08:05 GMT
Can confirm that it's working for me.
Comment by persson (persson) - Tuesday, 24 September 2019, 09:42 GMT
@zx2c4 works for me too with just that patch applied.
Comment by Ronan Pigott (Brocellous) - Tuesday, 08 October 2019, 07:32 GMT
This is fixed now, right? https://git.archlinux.org/linux.git/commit/?h=v5.3.4-arch1&id=ecc265624956ea784cb2bd2b31a95bd54c4f5f13

So can we mark this resolved? Wireguard works for me right now on 5.3.5
Comment by Christian Hesse (eworm) - Tuesday, 08 October 2019, 07:42 GMT
Yes, fixed with 5.3.4.

Loading...