FS#69465 - [man.archlinux.org] HTML "a href=" heading links not URL escaped
Attached to Project:
Arch Linux
Opened by Adam Nielsen (Malvineous) - Friday, 29 January 2021, 11:16 GMT
Last edited by Jelle van der Waa (jelly) - Tuesday, 07 September 2021, 20:48 GMT
Opened by Adam Nielsen (Malvineous) - Friday, 29 January 2021, 11:16 GMT
Last edited by Jelle van der Waa (jelly) - Tuesday, 07 September 2021, 20:48 GMT
|
Details
Apologies if this is the wrong place to post this but it
appears I can't add issues to
https://gitlab.archlinux.org/archlinux/archmanweb
Description: The HTML "a href" links generated for headings don't appear to be URL escaped, so you can't copy and paste them into other places (like the Arch Wiki) if the heading has special characters in it. Steps to reproduce: 1. Go to https://man.archlinux.org/man/systemd.network.5 2. Scroll down to the fourth heading "[MATCH] SECTION OPTIONS" (or any other heading with square brackets in it) 3. Right-click on the heading and copy the URL. 4. Observe it is not a valid URL as the anchor contains square brackets rather than %5D type escape codes. For example you can't paste it into the Arch Linux wiki as MediaWiki does not recognise the square brackets as part of the URL. The heading just needs to be URL-encoded before having the '#' added to the front of it. |
This task depends upon
Closed by Jelle van der Waa (jelly)
Tuesday, 07 September 2021, 20:48 GMT
Reason for closing: Implemented
Additional comments about closing: https://gitlab.archlinux.org/archlinux/a rchmanweb/-/commit/2e325ac6ba4300c608ca1 af46f1fbaec9a440cda
Tuesday, 07 September 2021, 20:48 GMT
Reason for closing: Implemented
Additional comments about closing: https://gitlab.archlinux.org/archlinux/a rchmanweb/-/commit/2e325ac6ba4300c608ca1 af46f1fbaec9a440cda
Technically, archmanweb uses the same encoding function [1] for the URL fragments as MediaWiki. See also the comparison table in [2].
[1] https://gitlab.archlinux.org/archlinux/archmanweb/-/blob/master/archmanweb/utils/encodings.py
[2] https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding#Encodings_compared
[3] https://wiki.archlinux.org/index.php/Template:Man
[4] https://wiki.archlinux.org/index.php/Systemd-networkd#Speeding_up_TCP_slow-start
If I look at RFC3986 at https://www.ietf.org/rfc/rfc3986.txt and how it defines URLs, it says:
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
fragment = *( pchar / "/" / "?" )
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
Which I read as normal brackets don't need to be percent-encoded, but square-brackets do. Am I misunderstanding this?
https://url.spec.whatwg.org/ is the relevant specification. [] aren't valid in the https://url.spec.whatwg.org/#url-fragment-string, but the parser doesn't care: https://url.spec.whatwg.org/#fragment-state.
From the RFC3986, section 2.4. When to Encode or Decode:
> Under normal circumstances, the only time when octets within a URI
> are percent-encoded is during the process of producing the URI from
> its component parts. This is when an implementation determines which
> of the reserved characters are to be used as subcomponent delimiters
> and which can be safely used as data. Once produced, a URI is always
> in its percent-encoded form.
The encoding happens inside the browser when it takes the (un-encoded) URL and contacts the server. Specifically, it is not required of HTML documents to contain percent-encoded URLs (see e.g. the example in https://url.spec.whatwg.org/#query-encoding-example where the href contains even HTML entities). It depends on the browser if its "copy link URL" button gives you an encoded or unmodified URL, if it shows an encoded or decoded URL in the address bar, etc.
The reported issue seems to be only a MediaWiki limitation, so I'm inclined to close this as "not a bug".
I suppose the question then becomes, should archmanweb be stubborn and make it difficult for people to link back to it just to make a point that other systems aren't following the spec properly, or should it encode the URLs even though it doesn't have to, to make it easier for people to share links to it?
I guess it depends on whether you want to promote the site and make it easy for people to reference, or whether you don't especially want people to post links to it.
FWIW it's not just the Arch Linux MediaWiki instance that won't accept these links, the Arch Linux forums won't accept them either. So I would argue that at least temporarily archmanweb should escape the links, at least until the rest of the Arch ecosystem correctly accepts the un-escaped URLs.
But as I said, man pages should be linked via a template [3] on the wiki. We've just improved it so that brackets do not have to be encoded manually [6].
[5] https://gitlab.archlinux.org/archlinux/archmanweb/-/commit/2e325ac6ba4300c608ca1af46f1fbaec9a440cda
[6] https://wiki.archlinux.org/index.php?title=Template:Man&diff=693490&oldid=648692