FS#54478 - [nginx] OCSP stapling fails due to not yet working DNS after reboots

Attached to Project: Arch Linux
Opened by Pascal Ernster (hardfalcon) - Friday, 16 June 2017, 15:05 GMT
Last edited by Giancarlo Razzolini (grazzolini) - Tuesday, 04 July 2017, 19:21 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Sébastien Luttringer (seblu)
Giancarlo Razzolini (grazzolini)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No


nginx's shoddy OCSP stapling support fails when DNS resolution is not yet available at the time the nginx daemon is started. This affects using systemd-resolved or running their own DNS resolver (for example unbound).

In theory, the fix would consist of adding the following line to nginx.service:

After=network-online.target nss-lookup.target

In practice, however, the following bugs would need to be fixed first:

A workaround might be to temporarily add the following line until the two aforementioned bugs are closed:

After=network-online.target nss-lookup.target systemd-resolved.service unbound.service

Should I file a separate bug report for nginx-mainline (since its a community package and has a different package maintainer), or is this bug report sufficient to cover both packages?
This task depends upon

Closed by  Giancarlo Razzolini (grazzolini)
Tuesday, 04 July 2017, 19:21 GMT
Reason for closing:  Implemented
Additional comments about closing:  Added the workaround on our package. Upstream won't fix this.
Comment by Daniel M. Capella (polyzen) - Saturday, 17 June 2017, 00:32 GMT
This bug is sufficient. [nginx-mainline] will probably be added to the title and all 3 maintainers assigned.

Just wondering, why do you call NGINX's OCSP support "shoddy"?
Comment by Pascal Ernster (hardfalcon) - Saturday, 17 June 2017, 07:11 GMT
There's several reasons:

a) It claims to have support for OCSP stapling, and even lets you specify a separate DNS resolver for that (you need working DNS to resolve the host name of the CA's OCSP responder, which is specified in your certificate), but the whole thing falls apart when that DNS resolver is not reachable at the time when nginx is started. When nginx can't reach the resolver the first time, it will not try to use that resolver (or any other resolver, for example the systems default resolver) later on - instead, it will just tell the browser that the CA's OCSP responder is not available (although it is, and it's just nginx being too stupid to properly use DNS.

b) Even if DNS is working at the time when nginx is started, it doesn't actually care to provision it's OCSP cache by itself when started. Instead, it will only do so once the first client request actually requesting OCSP stapling comes along. And that first request will still always get an answer without an OCSP response stapled to it. If you use TLS certificates with the "Must-Staple" flag, and you use Firefox (Chrome seems to ignore all forms of OCSP, including OCSP stapling and the Must-Staple flag), you will get a nasty error page by Firefox on that first request.

c) Even if your DNS is working at the time when nginx is started, and even if you provisioned nginx' OCSP cache by manually loading the page using Firefox (or "curl --cert-status"), nginx will fail you once the CA's OCSP responder becomes unreachable (like it has happened with Let's Encrypt a few weeks ago). The reason for this is that nginx doesn't really have a cache for OCSP responses. The closest thing would be the "ssl_stapling_file" directive, but then you'll have to write a daemon by yourself which provisions and maintains that file.

And to top it all off, nginx developers come up with lame excuses for not fixing this in the corresponding bug report, claiming that "Must-Staple is an attempt to turn OCSP Stapling into something it was never designed to be" and similar nonsense[1].

Btw, if you are considering switching to Apache to get proper support for OCSP stapling, you can spare the effort - Apache's OCSP stapling code is just as shitty[2].

[1] https://trac.nginx.org/nginx/ticket/990
[2] https://blog.hboeck.de/archives/886-The-Problem-with-OCSP-Stapling-and-Must-Staple-and-why-Certificate-Revocation-is-still-broken.html
Comment by Daniel M. Capella (polyzen) - Saturday, 17 June 2017, 22:49 GMT
Thank you for taking the time to share that.

Oh, I would never, ever, consider Apache. :P
Comment by Giancarlo Razzolini (grazzolini) - Monday, 26 June 2017, 03:55 GMT
Well, I really think this bug should be fixed upstream. Adding it to the systemd unit is a workaround that will only work on arch. At least, if I'm understanding this correctly, what the reporter is asking is that we add this to the official nginx packages. Not to mention, there's the dependency on those other bugs, which in fact aren't really, really bugs. Don't get me wrong, I run unbound, and I also had to go through lengths to make #54476 a non-issue. And I have nginx servers on machines that don't initially have internet connectivity (I've noticed this bug, or should I say design issue, too).

Don't forget that not having stapling is not the end of the world, the browser will go through one more request (at most) and things remain fine. It's not like revocation stops working, just that things are a little slower. And, if you rely on serving anything on scale, through a server that does not have connectivity from boot, then you're doing it wrong. It's ok for hosting your own things on home made servers and all, but at scale, things get slow and OCSP stapling is, indeed, something you *want* to have. If it is up to me, we either open this up upstream, or close it right now as not a bug. But I will wait on the other maintainers. I maintain nginx-mainline.
Comment by Pascal Ernster (hardfalcon) - Monday, 26 June 2017, 05:13 GMT
I'm sorry, but IMHO you are wrong on this. The fact that current nginx/nginx-mainline unit files are missing the correct After= line is IMHO plain and simply a bug in the nginx/nginx-mainline packages, as per the official systemd documentation[1]. nginx *does* use DNS lookups, so it *is* supposed to have nss-lookup.target on that After= line, period. Nginx *isn't* working properly in common configurations when it doesn't have a working network connection right from the start, so network-online.target belongs in the After= line. Nginx *does* use DNS lookups, so nss-lookup.target belongs in the After= line.

Note that After= does *not* cause any additional units to be started, it merely assures that the nginx unit file is started at the correct point in time during boot[2]. Please do not confuse Before= and After= with Wants= or Requires=[3]. I'm *not* asking you to add "Wants=network-online.target", but merely to add the proper After= line.

They only situation where adding those targets to the After= line could actually cause an undesired behavior would be a system where a unit proving nss-lookup.target or network-online.target is enabled (=configured to be started on boot) but is broken in such a way that the start of that unit fails. And even in that case, it wouldn't be a bug in the nginx unit file but rather in the failing unit file, *and* still it would only cause a delay of 90 seconds before the nginx unit is started.

It is *exactly* bugs like this one that make people complain about systemd causing problems for them or "Why can't you somehow force systemd to start units in a serialized fashion like SysV did?".

So, as far as I can tell, there is no technical reason not to add the proper After= line, and it is even a bug not to add it according to the official systemd documentation.

[1] https://www.freedesktop.org/software/systemd/man/systemd.special.html#nss-lookup.target
[2] https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Before=
[3] https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Requires=
Comment by Bartłomiej Piotrowski (Barthalion) - Monday, 26 June 2017, 09:06 GMT
+1 for 'After=network-online.target nss-lookup.target', not for particular services. I will put it in trunk soon.
Comment by Giancarlo Razzolini (grazzolini) - Monday, 26 June 2017, 11:32 GMT
Hi Pascal,

While you can correctly argue that this might be a bug on the packaging (and I'm fully aware of After= and Wants= differences), this still doesn't change the fact that this is an upstream bug. I will follow Bartłomiej decision to add this, but I still think this should be fixed there and, doing a workaround on our end, is counter-productive. I will look later into creating a bug report there, if there's not one yet.
Comment by Pascal Ernster (hardfalcon) - Monday, 26 June 2017, 12:45 GMT

Giancarlo: I agree that this is actually an upstream bug, but TBH, I have little confidence that upstream is willing to get their act together on their implementation of OCSP stapling anytime soon. And as long as this is not fixed upstream, the fix leveraging the After= directive is still the "right" and "clean" way to fix this, because nginx, in its current shape, *does* require working network connectivity and DNS resolution to work properly.
Besides, the "dysfunctional OCSP is not the end of the world" argumentation is a slippery slope IMHO, because it is exactly this argumentation which made upstream ship their broken implementation in the first place (though I might be a little biased on this because I run servers which use the Must-Staple flag in their TLS certificates, and I prefer to configure my browsers to enforce OCSP validation).
Comment by Giancarlo Razzolini (grazzolini) - Monday, 26 June 2017, 13:29 GMT
Well, to be fair, I never argued against the usage of OCSP. Just that if stapling doesn't work, your site still does, regardless. I'll bug upstream to fix this, because it is clearly a design issue. I should push an update to nginx-mainline later today, or tomorrow.
Comment by Bartłomiej Piotrowski (Barthalion) - Thursday, 29 June 2017, 10:50 GMT
Pushed the change to trunk for extra/nginx.
Comment by Giancarlo Razzolini (grazzolini) - Thursday, 29 June 2017, 13:23 GMT
Pushed the change to community/nginx-mainline. It's on community-testing right now, due to a new version of nginx-mainline. Just don't close this ticket yet. I'm going to report this upstream.
Comment by Giancarlo Razzolini (grazzolini) - Thursday, 29 June 2017, 13:44 GMT Comment by Giancarlo Razzolini (grazzolini) - Tuesday, 04 July 2017, 19:19 GMT
Well, upstream closed the ticket as wontfix. And they won't even answer my question. This is a design decision. But I think it should be properly documented. Hopefully we fixed it with our workaround.