FS#58355 - [linux][linux-lts] crng init really slow

Attached to Project: Arch Linux
Opened by qwerty (macrocdd) - Wednesday, 25 April 2018, 22:56 GMT
Last edited by Andreas Radke (AndyRTR) - Tuesday, 17 March 2020, 09:53 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Tobias Powalowski (tpowa)
Andreas Radke (AndyRTR)
Jan Alexander Steffens (heftig)
Christian Hesse (eworm)
Levente Polyak (anthraxx)
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 6
Private No

Details

Description:
After updating linux-lts 4.14.35-1 -> 4.14.36-1, the kernel: random: crng init done process takes 10 seconds to load from ssd. IntelCore i3.

Additional info:
* 4.14.36-1
* # journalctl -b


Steps to reproduce:
апр 26 01:35:06 archlabs ntpd[383]: Listen normally on 4 wlan0 192.168.1.9:123
апр 26 01:35:06 archlabs ntpd[383]: Listen normally on 5 wlan0 [fe80::7867:3b18:41>
апр 26 01:35:06 archlabs ntpd[383]: new interface(s) found: waking up resolver
апр 26 01:35:17 archlabs kernel: random: crng init done
апр 26 01:35:17 archlabs systemd[452]: Started D-Bus User Message Bus.
апр 26 01:35:19 archlabs systemd[452]: Starting Sound Service...
This task depends upon

Closed by  Andreas Radke (AndyRTR)
Tuesday, 17 March 2020, 09:53 GMT
Reason for closing:  Won't fix
Additional comments about closing:  if boot is stalling, try adding this: random.trust_cpu=1
Comment by qwerty (macrocdd) - Thursday, 26 April 2018, 00:03 GMT
The kernel rollback showed the absence of the random string: fast init done at the beginning of the download
Comment by loqs (loqs) - Thursday, 26 April 2018, 00:52 GMT Comment by loqs (loqs) - Sunday, 29 April 2018, 15:24 GMT
If the system has a TPM is the issue still present in 4.14.37?
Comment by Dimos Dimoulis (dimosd) - Monday, 30 April 2018, 13:31 GMT
I also have this problem, crng init is delayed 45 secs. Still affects 4.14.38.
Comment by loqs (loqs) - Monday, 30 April 2018, 14:21 GMT
Please bisect between 4.14.35 and 4.14.36 and report the bad commit upstream.
Comment by qwerty (macrocdd) - Tuesday, 01 May 2018, 09:08 GMT
4.16.5 kernel also have this bug

Comment by qwerty (macrocdd) - Thursday, 03 May 2018, 09:30 GMT
The problem is not very well resolved:
# pacman -S haveged
# systemctl enable haveged
Comment by Dimos Dimoulis (dimosd) - Thursday, 03 May 2018, 09:46 GMT
I had noticed that pressing a few keys to provide some entropy, quickly initialized the random generator.
There are also several messages such as this:
random: systemd: uninitialized urandom read (16 bytes read)
Could it be that the recent kernel changes broke systemd? And why does this only affect a few people?
Comment by loqs (loqs) - Thursday, 03 May 2018, 11:21 GMT
Once you have located which commit is the cause you could discuss that commit upstream with the kernel and systemd developers.
Comment by loqs (loqs) - Saturday, 05 May 2018, 21:06 GMT Comment by Andreas Radke (AndyRTR) - Sunday, 06 May 2018, 07:41 GMT
Only few people seem to be affected. Better use custom builds or stay with older pkg versions until upstream solution is available.
Comment by Antonio Tessarolo (anthonytex) - Saturday, 09 June 2018, 13:51 GMT
Same here with 4.16.12-1-ARCH
Comment by Christian Galander (twoCore) - Saturday, 23 June 2018, 07:45 GMT
It looks like, that only systems with no TPM are affected:

- Fujitsu PC, Intel Core i3-4150, no TPM built-in ( delayed up to 30s - Kernel 4.17.2 )
- Acer Swift 3, Intel Core i5-8250U, TPM built-in ( no delay during boot - Kernel 4.17.2 )

Regards from Germany
Comment by Dimos Dimoulis (dimosd) - Saturday, 23 June 2018, 15:31 GMT
For me and without TPM, only linux-lts is affected. linux-4.17.2 is not affected. Also, a custom 4.14 build with a configuration based on linux-stable, is not affected.
Reverting the patch as suggested, gave me stability problems with suspend/resume.
Comment by Tobias Powalowski (tpowa) - Sunday, 24 June 2018, 19:12 GMT
I'm also affected on 4.17.2.
Comment by tleo (tleo) - Wednesday, 04 July 2018, 07:48 GMT
I'm also affected on two of my machines (on 4.14 and 4.17), none of them has tpm.
Comment by loqs (loqs) - Sunday, 29 July 2018, 21:02 GMT Comment by loqs (loqs) - Tuesday, 31 July 2018, 08:09 GMT Comment by Jan (medhefgo) - Tuesday, 31 July 2018, 20:09 GMT
FYI, that doesn't fix the issue at all. It only mixes in rdrand entropy to any entropy provided by userspace. Which is funny when you use rng-tools to work around this: mixing rdrand entropy into rdrand entropy. Why not just give us a kernel command line option to tell the kernel to trust rdrand? Intel has much better options available to fuck us over rather than surreptitiously tamper their hardware rng.
Comment by Jan (medhefgo) - Monday, 08 October 2018, 15:16 GMT Comment by loqs (loqs) - Friday, 26 October 2018, 19:59 GMT
Can those affected test linux 4.19.arch1-1 which has CONFIG_RANDOM_TRUST_CPU=y
Comment by Jan (medhefgo) - Friday, 26 October 2018, 20:16 GMT
4.19.arch1-1 works for me. Though, should this really be enabled by default considering a lot of people distrust their CPU vendors?
Comment by Jan Alexander Steffens (heftig) - Sunday, 28 October 2018, 09:33 GMT
I'll probably revert the config change so you will have to boot with the parameter.
Comment by Dimos Dimoulis (dimosd) - Sunday, 28 October 2018, 09:46 GMT
I haven't yet tried 4.19, however if you do revert the option please keep it as a boot time parameter. Without it, the system appears to hang until I press a few keys and I think this is unacceptable as default behaviour. Since the kernel has several sources of entropy and the lack of trust for the CPU is more of a problem in virtual machines and such, maybe you should consider keeping CONFIG_RANDOM_TRUST_CPU=y as default and letting people change it if they so wish.
Comment by Jensen McKenzie (your_doomsday) - Sunday, 28 October 2018, 18:36 GMT
"please keep it as a boot time parameter."

This isn't a distro choice. Boot parameter will always exist.

"maybe you should consider keeping CONFIG_RANDOM_TRUST_CPU=y as default and letting people change it if they so wish."

The problem with this is that people who may want to switch it off, won't be aware that such thing exist in the first place as there won't be any visual changes in their system unless you look under the hood. On the other hand in case CONFIG_RANDOM_TRUST_CPU=N people who may want to switch it on, will have to be aware of it otherwise they won't boot their system. CONFIG_RANDOM_TRUST_CPU=N is also the same behaviour as before Linux 4.19 so having to switch something on or use other tools like haveged won't be a regression.
Comment by Dimos Dimoulis (dimosd) - Monday, 29 October 2018, 09:46 GMT
The behaviour introduced in 4.18 was causing problems for some people. I think it was mentioned that Fedora went as far as reversing the patch. We'll have to wait and see how other distros are handling this in 4.19, but I am guessing that whey will default in =Y, because it will cause fewer problem reports for them.
Also, not trusting the CPU and its RNG really only affects the first seconds of booting: after that, network activity, keyboard etc. mix in more entropy. This option was introduced for certain low entropy situations, such as virtual machines. If someone wants to be extra cautious and it doesn't cause problems, then they can enable it (and it would be advertised in wiki/Security), but imho it's not a "must have" feature for everyone.
Comment by Jensen McKenzie (your_doomsday) - Monday, 29 October 2018, 10:58 GMT
"Also, not trusting the CPU and its RNG really only affects the first seconds of booting"

That's right but that's the actual concern. CPU RNG was always added to entropy mix AFTER boot and that wasn't controversial at all. CONFIG_RANDOM_TRUST_CPU=Y allows to use CPU RNG as seed for entropy in early boot which theoretically can affect further entropy mix in deterministic way.

"If someone wants to be extra cautious and it doesn't cause problems, then they can enable it"

Reading from context I think you meant "disable" not "enable".

"it's not a "must have" feature for everyone"

I know you mean the opposite but this perfectly fits for: CONFIG_RANDOM_TRUST_CPU=y it's not a "must have" feature for everyone. I bet it's needed for 1% usecases.
Comment by Dimos Dimoulis (dimosd) - Friday, 09 November 2018, 09:53 GMT
https://lkml.org/lkml/2018/7/17/1279

A discussion about the original patch, its intention, pros and cons.
Comment by Dimos Dimoulis (dimosd) - Saturday, 12 January 2019, 07:25 GMT
With 4.19 kernels, CONFIG_RANDOM_TRUST_CPU=n doesn't cause too much a delay any more.
[ 4.206676] random: crng init done
Even if it did, the solution is to use random.trust_cpu=on
Comment by loqs (loqs) - Saturday, 12 January 2019, 19:30 GMT
@dimosd see  FS#61233  for a recent case of the delay being in the tens of seconds range and random.trust_cpu=on having no effect due to the absence of RDRAND and RDSEED haveged did resolve the issue.
Systemd 420 will prefer using the RDRAND processor instruction over /dev/urandom whenever it requires randomness that neither has to be crypto-grade nor should be reproducible.
Possibly close as Not a bug / Won't Fix as this is what upstream has chosen to do?
Comment by Dimos Dimoulis (dimosd) - Sunday, 13 January 2019, 10:19 GMT
With systemd 240, linux 4.19.14, cpu supports RDRAND: I get delays of 4-24 secs and no delay if I use random.trust_cpu=on
If the cpu didn't support RDRAND then I would have to use haveged.
These are the two known workarounds and I don't think there's anything more we can do for now.

https://www.phoronix.com/scan.php?page=news_item&px=Systemd-RdRand-Direct also mentions a "high_quality_required" systemd option.
Comment by Andreas Radke (AndyRTR) - Wednesday, 04 September 2019, 05:46 GMT
With recent systemd 243.0 release the issue gets worse. LTS-kernel hangs on both of my systems now.

I guess this is related to this bug here and can be resolved updating haveged and fixing its service file.
https://github.com/systemd/systemd/issues/13252 and

@eworm - can you please have a look?
Comment by Jan Alexander Steffens (heftig) - Wednesday, 04 September 2019, 07:56 GMT
@mtorromeo I think the rngd.service unit is also lacking. It specifies WantedBy=sysinit.target but that doesn't do what you want without a few more directives:

DefaultDependencies=no
Before=sysinit.target shutdown.target

(Technically also Conflicts=shutdown.target but I'm not sure shutting down the entropy gatherer while systemd still needs bits is a good idea)
Comment by Andreas Radke (AndyRTR) - Wednesday, 04 September 2019, 11:22 GMT
Fixed for me with haveged 1.9.6-1 using 4.19 lts kernel. Please test it.
Comment by Massimiliano Torromeo (mtorromeo) - Wednesday, 04 September 2019, 12:50 GMT
Published rng-tools-6.7-2 with the changes to dependency resolution.
Comment by Andreas Radke (AndyRTR) - Monday, 18 November 2019, 21:19 GMT
Is this still an issue? (Works for me.)
Comment by loqs (loqs) - Monday, 18 November 2019, 22:11 GMT
5.4 is adding https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=50ee7529ec4500c88f8664560770a7a1b65db72b which should reduce the issue.
It is still reported as an issue on the forums for new installs with GDM without rotational media or requiring keyboard input before GDM starts.
Edit:
https://lore.kernel.org/lkml/alpine.DEB.2.21.1909290010500.2636%40nanos.tec.linutronix.de/
Comment by Dimos Dimoulis (dimosd) - Tuesday, 19 November 2019, 09:34 GMT
I still have problems with it so I decided to enable random.trust_cpu=on in boot options. I would like it to point out that
- haveged is not necessarily a good alternative to RDRAND because it has the same problem: it cannot be verified
- The following distributions have now enabled CONFIG_RANDOM_TRUST_CPU: Debian, Ubuntu, Fedora, Alpine. In particular Ubuntu and Alpine had initially disabled it but later reverted it because of bug reports.
Any decision on the default would be a compromise, CONFIG_RANDOM_TRUST=y deals with the known while =n deals with the unknown.
Comment by Levente Polyak (anthraxx) - Tuesday, 19 November 2019, 10:13 GMT
From a technical reasoning standpoint its totally not important what other distros do, other distros also heavily patch their software, so we do that as well?
Back to technical reasoning:
Right now its an opt-in, as it should be. People who really need to trust the cpu on RNG because they are affected can opt-in to do so, while not exposing the majority of people who are not affected. The only thing that really is unknown is random.trust_cpu itself, i fail to see where all the blind trust comes from. Hardware vendors mostly choose performance over security to compete on the market. With all the lately discovers of spectre/meltdown/L1TF/MDS/TAA/iTLB i fail to see why we want to blindly trust on rng on a global scale instead of opt-in who _really_ needs it. The CPU rngs are closed spec knowledge, not audit-able in the classical sense, non-blocking infinite source of numbers and are purely considered safe from the vendors themselves because "its surely too complicated for anyone to ever understand what influences it so it must be safe".

The current setting of disabling it by default is the only sane option to do for a user base that is considered to be technically competent to enable it themselves if they really are affected and need to plus decided themselves its fine for them to trust it. It could surely be documented better in the wiki (contributions welcome) but the setting shall not be changed.
Comment by Dimos Dimoulis (dimosd) - Tuesday, 19 November 2019, 11:13 GMT
The good thing about Arch Linux is that it gives users a lot of flexibility to make their own decisions, in case they disagree with the decisions made by the distribution for them. But sane defaults are still important.
>From a technical reasoning standpoint its totally not important what other distros do
I very much disagree here. Arch Linux maintainers are highly capable, but so are the maintainers of other respected distributions. There have been discussions elsewhere on the subject and the current consensus is that disabling CONFIG_RANDOM_TRUST_CPU causes harm, most often in low entropy situations such as virtual machines but in real hardware as well.
For me the sane option would be to ship bug-free. An extra security conscious user should disable RDRAND (if possible) and also disable hyper-threading for instance, but we don't do it by default because it causes a large performance penalty. There has been no demonstration of exploiting RDRAND to my knowledge, unlike HT.
Comment by Levente Polyak (anthraxx) - Tuesday, 19 November 2019, 13:05 GMT
security is not something that comes by opt-outing of thousands of switches. security must come by default and allow users who do not want something because of some reason to opt-out that's literally the only sane approach to security.
Users have the flexibility to turn it on, like you did. no insecure default needed no matter what you claim. Again: Why on earth should something be considered a good source of entropy for early boot (which fundamentally is implicitly how secure KASLR will be) if its purely based on closed-spec and by its creators only deemed secure because "its surely too complicated for anyone to ever understand what influences it so it must be safe".
The good thing about our user base is that we can expect competence, we neither enable, start or restart systemd units and we expect users to configure their systems how they like it. we are neither debian nor ubuntu. You are mixing my statement, its not about the package maintainers of the distros its about the expectations related to the user base, which frankly is fundamentally different to distros like ubuntu and debian.
Comment by Dimos Dimoulis (dimosd) - Tuesday, 19 November 2019, 14:48 GMT
Fine. I understand your concerns about security and I am not suggesting that we should be sloppy. However, I have been banging on the keyboard to get the system to boot for the past few months, exactly to avoid turning this option on, until it occurred to me that the situation is slightly ridiculous.
Comment by Levente Polyak (anthraxx) - Tuesday, 19 November 2019, 15:33 GMT
That's why i suggested we improve the documentation in the wiki and make this use-case easier to spot and address with a small sub section of the implications just to raise tiny awareness of what people may turn on :)
Comment by Eli Schwartz (eschwartz) - Tuesday, 19 November 2019, 20:20 GMT
> >From a technical reasoning standpoint its totally not important what other distros do

> I very much disagree here. Arch Linux maintainers are highly capable, but so are the maintainers of other respected distributions. There have been discussions elsewhere on the subject and the current consensus is that disabling CONFIG_RANDOM_TRUST_CPU causes harm, most often in low entropy situations such as virtual machines but in real hardware as well.

Then do not defend your stance by saying "but some other distros did it". Defend your stance by saying "the well-respected maintainers of X distro had the following observation to make on the pros/cons of it, and I agree with their analysis".

Levente is saying, let's focus on merit-based arguments rather than simply blindly trusting another distro's judgment calls. Based purely on this argument, I have no idea why those distros made the decision they did.

Given one of the example distros is Ubuntu, it is plausible to me that the rationale was "users know nothing about computers and most of them don't have anything secure, but the distro comes preinstalled with gdm so we should optimize for this use case".

This fails to apply to archlinux for a whole bunch of reasons, including the fact that gdm is not preinstalled and the archlinux user base is explicitly targeted at people who tend to have biased ideas like "gnome is evil and DEs in general sort of suck, let me use this niche tiling WM that really enhances my personal use". I somehow doubt the i3 and sway users use gdm!

The logical conclusion here is that *iff* the tradeoff by Ubuntu and others was "our gdm users are sufficiently problematic that we're willing to make security sacrifices on their behalf", we should be definitively doing the opposite.

So: what technical arguments did these other distros use? Maybe something from them, applies to us as well.
Comment by Dimos Dimoulis (dimosd) - Wednesday, 20 November 2019, 09:00 GMT
What I am trying to do is summarize 1.5 year's experience of others on the subject. It's not a black and white issue, because it's a policy rather than pure technical decision. It introduces bugs, therefore it has a cost.
Some use cases that may trigger the bug:
- Encrypted swap (read from /dev/urandom, block booting early)
- Losing connection to a remote server after reboot for some time (booting blocks before the network is initialized)
- Initializing encrypted LVM on a low entropy system leads to reduced security

Some comments:
https://lists.debian.org/debian-devel/2018/12/msg00204.html (the whole thread brings several pros-and-cons)
https://gitlab.alpinelinux.org/alpine/aports/issues/9960

The above are mostly server oriented and thus security minded distros with a technical user base.
Comment by Jan Alexander Steffens (heftig) - Wednesday, 20 November 2019, 09:28 GMT
Please don't derail this issue; it's not just a problem for GDM.

As far as my opinion goes, I would enable RANDOM_TRUST_CPU. Stalling boots are quite painful (especially if it's not obvious why) and I think we're better served with the smoother experience than satisfying our paranoia about attacks on the RNG that haven't been demonstrated (especially if it would have to be pre-boot or early-boot).

But I also agree that our users should be competent enough to discover "if boot is stalling, try random.trust_cpu=1".

What about enabling it after the active entropy generation lands in our kernel? Theodore Ts'o (who added the above config) thinks the HWRNG is trustworthier than the jitter entropy and I'm inclined to agree with him.
Comment by Levente Polyak (anthraxx) - Wednesday, 20 November 2019, 09:43 GMT
No attacks? Well for the bugs already have been appeared no "attack" was needed, the rng always constantly returns the same value and even systemd needed to work around it:
https://github.com/systemd/systemd/commit/b62bc66018fa1ada09554e7ee46abbbfc8e6b3ad
And yet another set of kernel workaround patches to handle this borked hardware/firmware combination: https://lore.kernel.org/patchwork/patch/1115413/.
And yes, this is a CVE worthy hardware issue with the RNG, so please stop dragging security discussions always down to "paranoia", you can leave that part out and still be technically reasonable in your arguments
Comment by Dimos Dimoulis (dimosd) - Wednesday, 20 November 2019, 10:50 GMT
The AMD RDRAND bugs are well known and fairly recent, but they are not the reason the kernel stopped trusting RDRAND. And I am not saying it should provided there was an alternative, but there isn't in all cases (yet?). So disabling RDRAND will cause breakage for some for the time being. This is why I think this should be a "hardening" option rather than the default. Besides it should be mentioned that even when RDRAND is enabled it is mixed with other (lower quality) sources of entropy.
Comment by Levente Polyak (anthraxx) - Wednesday, 20 November 2019, 10:55 GMT
@dimosd: You are repeating yourself so we are doing a full round now. sane hardening must come per default that's how security must work. Lets not repeat the same arguments over and over again, heftig provided new ones and i gonna think about it, but we are wasting everyone's time if we do full circles all over again.
PS: Read the kernel docs, the kernel itself neither blindly trusts nor blindly mistrusts it, its a matter of downstream choice
PPS: an alternative is using the TPMs RNG in case you have a TPM, if someone decides to trust that more than the CPU, but that requires defining the trust in terms of rng_core.default_quality
PPPS: I wouldn't call a bug known since 2014 fairly recent: https://bugzilla.kernel.org/show_bug.cgi?id=85911
Comment by Dimos Dimoulis (dimosd) - Monday, 25 November 2019, 18:39 GMT
Not sure if those uninitialized urandom reads are safe.
   crng.txt (0.7 KiB)
Comment by Andreas Radke (AndyRTR) - Wednesday, 11 December 2019, 13:05 GMT
I'm for staying on the safe side. The workaround to add random.trust_cpu=1 to the kernel boot prompt is well documented.

Maybe someone is willing to write some note to our wiki so it can be easily found how to solve.
Best place should be either https://wiki.archlinux.org/index.php/Arch_boot_process or https://wiki.archlinux.org/index.php/Random_number_generation
maybe with links to each other.

Then we should close this issue.

Loading...