FS#69465 - [man.archlinux.org] HTML "a href=" heading links not URL escaped

Attached to Project: Arch Linux
Opened by Adam Nielsen (Malvineous) - Friday, 29 January 2021, 11:16 GMT
Last edited by Jelle van der Waa (jelly) - Tuesday, 07 September 2021, 20:48 GMT
Task Type Bug Report
Category Arch Projects
Status Closed
Assigned To Jelle van der Waa (jelly)
Sven-Hendrik Haase (Svenstaro)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Apologies if this is the wrong place to post this but it appears I can't add issues to https://gitlab.archlinux.org/archlinux/archmanweb

Description:

The HTML "a href" links generated for headings don't appear to be URL escaped, so you can't copy and paste them into other places (like the Arch Wiki) if the heading has special characters in it.

Steps to reproduce:

1. Go to https://man.archlinux.org/man/systemd.network.5
2. Scroll down to the fourth heading "[MATCH] SECTION OPTIONS" (or any other heading with square brackets in it)
3. Right-click on the heading and copy the URL.
4. Observe it is not a valid URL as the anchor contains square brackets rather than %5D type escape codes. For example you can't paste it into the Arch Linux wiki as MediaWiki does not recognise the square brackets as part of the URL.

The heading just needs to be URL-encoded before having the '#' added to the front of it.
This task depends upon

Closed by  Jelle van der Waa (jelly)
Tuesday, 07 September 2021, 20:48 GMT
Reason for closing:  Implemented
Additional comments about closing:  https://gitlab.archlinux.org/archlinux/a rchmanweb/-/commit/2e325ac6ba4300c608ca1 af46f1fbaec9a440cda
Comment by Jelle van der Waa (jelly) - Wednesday, 21 April 2021, 19:59 GMT Comment by Jakub Klinkovský (lahwaacz) - Wednesday, 21 April 2021, 20:02 GMT
MediaWiki's limitations do not make these URLs invalid. The links work perfectly fine when you click the heading on man.archlinux.org.

Technically, archmanweb uses the same encoding function [1] for the URL fragments as MediaWiki. See also the comparison table in [2].

[1] https://gitlab.archlinux.org/archlinux/archmanweb/-/blob/master/archmanweb/utils/encodings.py
[2] https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding#Encodings_compared
Comment by Jakub Klinkovský (lahwaacz) - Wednesday, 21 April 2021, 21:01 GMT
Also note that links to manual pages from the wiki should be formatted using the man template [3]. To deal with square brackets, see how the link to "systemd.network(5) § [ROUTE] SECTION OPTIONS" is formatted in [4].

[3] https://wiki.archlinux.org/index.php/Template:Man
[4] https://wiki.archlinux.org/index.php/Systemd-networkd#Speeding_up_TCP_slow-start
Comment by Adam Nielsen (Malvineous) - Thursday, 22 April 2021, 02:53 GMT
Am I looking in the wrong place then?

If I look at RFC3986 at https://www.ietf.org/rfc/rfc3986.txt and how it defines URLs, it says:

sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

fragment = *( pchar / "/" / "?" )

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

Which I read as normal brackets don't need to be percent-encoded, but square-brackets do. Am I misunderstanding this?
Comment by Jakub Klinkovský (lahwaacz) - Thursday, 22 April 2021, 06:59 GMT
That RFC deals only with ASCII so if everybody followed it to the letter, all Unicode non-ASCII symbols would have to be encoded, which would lead to very ugly URLs. MediaWiki has the "html5" encoding style for fragments specifically to avoid useless encoding of Unicode symbols. When it comes to brackets, it should be reported to MediaWiki, since our algorithm just matches their encoding. I don't know which RFC or whatever covers this, but notice that when you manually encode the brackets, it leads to the same resource: https://man.archlinux.org/man/systemd.network.5#%5BMATCH%5D_SECTION_OPTIONS
Comment by Kristian (klausenbusk) - Thursday, 22 April 2021, 13:02 GMT
> I don't know which RFC or whatever covers this,

https://url.spec.whatwg.org/ is the relevant specification. [] aren't valid in the https://url.spec.whatwg.org/#url-fragment-string, but the parser doesn't care: https://url.spec.whatwg.org/#fragment-state.
Comment by Jakub Klinkovský (lahwaacz) - Sunday, 29 August 2021, 13:01 GMT
According to https://url.spec.whatwg.org/#url-fragment-string, even spaces are not allowed in URL fragments, but they can be seen very often.

From the RFC3986, section 2.4. When to Encode or Decode:

> Under normal circumstances, the only time when octets within a URI
> are percent-encoded is during the process of producing the URI from
> its component parts. This is when an implementation determines which
> of the reserved characters are to be used as subcomponent delimiters
> and which can be safely used as data. Once produced, a URI is always
> in its percent-encoded form.

The encoding happens inside the browser when it takes the (un-encoded) URL and contacts the server. Specifically, it is not required of HTML documents to contain percent-encoded URLs (see e.g. the example in https://url.spec.whatwg.org/#query-encoding-example where the href contains even HTML entities). It depends on the browser if its "copy link URL" button gives you an encoded or unmodified URL, if it shows an encoded or decoded URL in the address bar, etc.

The reported issue seems to be only a MediaWiki limitation, so I'm inclined to close this as "not a bug".
Comment by Adam Nielsen (Malvineous) - Monday, 30 August 2021, 04:24 GMT
It's a fair argument, if the spec allows it then it seems archmanweb is doing nothing wrong.

I suppose the question then becomes, should archmanweb be stubborn and make it difficult for people to link back to it just to make a point that other systems aren't following the spec properly, or should it encode the URLs even though it doesn't have to, to make it easier for people to share links to it?

I guess it depends on whether you want to promote the site and make it easy for people to reference, or whether you don't especially want people to post links to it.

FWIW it's not just the Arch Linux MediaWiki instance that won't accept these links, the Arch Linux forums won't accept them either. So I would argue that at least temporarily archmanweb should escape the links, at least until the rest of the Arch ecosystem correctly accepts the un-escaped URLs.
Comment by Jakub Klinkovský (lahwaacz) - Monday, 30 August 2021, 07:04 GMT
[Sorry for this noise, my browser just double-posted a comment...]
Comment by Jakub Klinkovský (lahwaacz) - Monday, 30 August 2021, 17:33 GMT
I've actually changed it differently [5]. The id attributes are left readable (unencoded), but the href attributes have the characters "[]|" percent-encoded, so the URLs that users will copy will be "compatible". Also the existing links with unencoded brackets should still work.

But as I said, man pages should be linked via a template [3] on the wiki. We've just improved it so that brackets do not have to be encoded manually [6].

[5] https://gitlab.archlinux.org/archlinux/archmanweb/-/commit/2e325ac6ba4300c608ca1af46f1fbaec9a440cda
[6] https://wiki.archlinux.org/index.php?title=Template:Man&diff=693490&oldid=648692

Loading...