FS#45384 - [netctl] ESSIDs with irregular latin characters require intervention

Attached to Project: Arch Linux
Opened by L T (stozi) - Thursday, 18 June 2015, 18:34 GMT
Last edited by Jouke Witteveen (jouke) - Thursday, 05 July 2018, 08:36 GMT
Task Type Bug Report
Category Arch Projects
Status Closed
Assigned To Jouke Witteveen (jouke)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description: for example, I'm using the following netctl profile:

Description='Automatically generated profile by wifi-menu'
Interface=wlp2s0
Connection=wireless
Security=wpa
ESSID=MARİGOLD_2
IP=dhcp
Key=whatever

note that the İ in the ESSID is a turkish special character. It was translated into a sequence of symbols (MAR\xc4\xb0GOLD) by netctl/wifi-menu and I had to find out what the original character was and manually edit it for the network to work. This isn't the first time.

Additional info:
* package version(s) netctl 1.10-2
* config and/or log files etc.


Steps to reproduce: try using wifi-menu to add a network with a special latin character.
This task depends upon

Closed by  Jouke Witteveen (jouke)
Thursday, 05 July 2018, 08:36 GMT
Reason for closing:  Fixed
Additional comments about closing:  e4638274
Comment by Nick (kousu) - Tuesday, 29 August 2017, 00:26 GMT
+1 to this. It's not super common, but in non-English areas there's usually at least a few ESSIDs with UTF-8.

For example, there's a cafe named "SHMIT-invité". If I connect to it with wifi-menu it makes this profile and fails:

```
Description='Automatically generated profile by wifi-menu'
Interface=wlan0
Connection=wireless
Security=none
ESSID=SHMIT-invit\xc3\xa9
IP=dhcp
```

wifi-menu also uses "SHMIT-invit\xc3\xa9" as the name displayed in its interactive menu.

If I correct the profile to this then it works

```
Description='Automatically generated profile by wifi-menu'
Interface=wlan0
Connection=wireless
Security=none
ESSID=SHMIT-invité
IP=dhcp
```

so it seems that wifi-menu is trying to be *too* clever: my system handles utf-8 just fine, and that's confusing wpa_supplicant.


Android was perfect, though, of course. I took at look at what it made (in /data/misc/wifi/wpa_supplicant.conf, which is a very sneaky place to hide that file Android!) and saw

```
network={
ssid=53484d49542d696e766974c3a9
bssid=4a:f8:b3:8f:18:6e
key_mgmt=NONE
priority=45
disabled=1
id_str="%7B%22creatorUid%22%3A%221000%22%2C%22configKey%22%3A%22%5C%22SHMIT-invit%C3%A9%5C%22NONE%22%7D"
}
```

That ssid is the hex-encoding of the utf-8 encoding of "SHMIT-invité":

```
Python 3.6.2 (default, Jul 20 2017, 03:52:27)
[GCC 7.1.1 20170630] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s="53484d49542d696e766974c3a9"
>>> import binascii
>>> binascii.unhexlify(s)
b'SHMIT-invit\xc3\xa9'
>>> binascii.unhexlify(s).decode("utf-8")
'SHMIT-invité'
>>>
```

I didn't even know you could hex-encode them for wpa_supplicant, but now I do. This is the only ssid in the Android file like that; I guess Google decided that it if there were any non-7-bit-ascii characters in a ssid it was safer to use that; maybe Arch should too? But I would actually just prefer that it wrote the unicode in directly.
Comment by Nick (kousu) - Tuesday, 29 August 2017, 00:27 GMT
Notable maybe that iwlist uses \x quoting too, but with upper case letters, so I guess netctl isn't pulling /directly/ from that.
```
$ iwlist scanning
wlan0 Scan completed :
Cell 01 - Address: 4A:F8:B3:8F:18:6E
ESSID:"SHMIT-invit\xC3\xA9"
Protocol:IEEE 802.11bgn
Mode:Master
Frequency:2.432 GHz (Channel 5)
Encryption key:off
Bit Rates:144 Mb/s
Quality=99/100 Signal level=74/100
Extra:fm =0003
```
Comment by Declspeck (declspeck) - Sunday, 17 June 2018, 17:08 GMT
Hi,

I attached a patch which undos the \x quoting for scan results and the current connection. This fixed the isse for me (The ESSID contained 'ä')
Comment by Jouke Witteveen (jouke) - Sunday, 17 June 2018, 18:41 GMT
While this is the only official bug tracker for netctl, some more information can be found on a comparable GitHub bug: https://github.com/joukewitteveen/netctl/issues/132

To fix this properly, we should detect whether the network advertises a UTF-8 encoded SSID.

Since I really appreciate people contributing patches, I lean towards just assuming SSIDs are UTF-8 encoded. However, before something like the proposed patch can be accepted, we need to be sure that the decoding of \x-sequences follows UTF-8. I guess `echo -e` just uses the current locale?
Comment by Declspeck (declspeck) - Monday, 18 June 2018, 10:58 GMT
Wow, that was a fast reply for an old issue!


About the UTF-8:ness of the \x thing:

wpa_supplicant encodes bytes into the \x notation, it is not concerned with Unicode or UTF-8. It's probably not supposed to, since as you mentioned in the Github thread, SSIDs are strings of bytes: https://w1.fi/cgit/hostap/tree/src/utils/common.c#n504

`echo -e` in coreutils decodes \x into bytes, it is not concerned neither UTF-8 nor locale:

From the man page: \xHH byte with hexadecimal value HH (1 to 2 digits)
And from the source: https://github.com/coreutils/coreutils/blob/master/src/echo.c#L214

So we get the SSID as verbatim bytes, which may or may not be valid UTF-8.


I tried the patch for a few corner cases, and yes I noticed some issues, one of them (tab in SSID) is quite severe:

- For saving in the profile, the string may still contain characters that break the formatting. Specifically, if the SSID contains a character which will be quoted with `printf %q`, the ESSID will be wrong. E.g. if the SSID contains a backslash.
- For dialog, "\n" in the SSID gets stripped out when displaying as a menu option, but when it is included in the message, it will be an actual newline.
- If a string printed by dialog contains invalid UTF-8, the whole string will be discarded - the ESSID containing invalid Unicode would not be shown, but it would not invalidate the whole list. E.g. in the following example, Hidden would not be shown, but the other strings would:
`dialog --menu "Title" 24 50 12 $(echo -e 'Hello\tworld \x80Hidden\tVisible Test\t2')`
- This is a bad one: I believe that if the ESSID contains a tab, it will break the dialog completely. I did not try this out with my AP though, I just ran dialog with a wrong number of tabs in an item.

So, the patch will work for non-whitespace non-quote non-backslash UTF-8 characters. AP names that worked previously would not be broken, unless I'm missing something. Some AP names will now fail in unacceptable ways though.


For a proper solution, I think the following things would need to be done:

- Escape the ESSID field in the profile differently - I believe this would require a fallback to a hex-encoded string if the SSID contains broken Unicode or newlines or backslashes. This would also mean that there would have to be a function to compare SSIDs, since the strings might be encoded differently.
- Deal with invalid Unicode, newlines, and tabs when printing the SSID to dialog.
- Escape the profile filename differently (I'm thinking transliteration here) - while filenames may contain arbitrary bytes, it's probably not a good thing...

Would that approach sound good to you? I'd be willing to work on this. If I'll work on this, do you have any gotchas that I might want to know? Also, can I assume that iconv exists? Currently, netctl does not call it, but it comes with glibc so it should be installed with base.
Comment by Jouke Witteveen (jouke) - Monday, 18 June 2018, 20:12 GMT
Before hacking away on this, let's set some targets. I see several:

1) Make sure that profiles generated by wifi-menu are usable for all ESSIDs.
2) Display ESSIDs using UTF-8 when possible.
3) Make full use of wpa_supplicants printf-escaped strings.
4) Improve SSID matching in wifi-menu.

1)
This is a bug that should be fixed. Luckily, the scan results used internally by netctl are printf-encoded according to the printf_encode function in https://w1.fi/cgit/hostap/tree/src/utils/common.c. This means we do not need to worry about nasty stuff like tabs in ESSIDs. It may be beneficial to tackle 3) first.

2)
This is a feature enhancement that may be very hard to get right. If you want to look into it, I am open for patches. Usage of iconv is okay, since bash depends on its presence anyway. Most important to me is that wifi-menu can be used to make profiles for as many ESSIDs as possible, not necessarily that the ESSIDs look very appealing. As stated before, I am leaning to be okay with the assumption that ESSIDs are UTF-8 encoded, even thought we cannot know.

3)
While this bug may be old, these issues are far older. In 2012 the developer of wpa_supplicant surprised me by adding proper escaping support to wpa_supplicant (http://lists.shmoo.com/pipermail/hostap/2012-August/026456.html , implemented in https://w1.fi/cgit/hostap/commit/?id=5c4b93d72ecf6d1d5b21a60b3e78db3948d0f034 ). Unfortunately, netctl still only implements the double quotation and no quotation modes. Maybe it should just use printf-encoded ESSIDs whenever it would currently use double quoted? Would this break reasonable use cases?

4)
Fixes to the above issues may introduce a situation where wifi-menu no longer correctly detects whether the user is connected to any of the listed networks. If so, this should be fixed.


I aim to have 1) fixed in the next version of netctl. The first and last points of your proposed path to a solution I agree with. Ideally, no new encoding schemes need to be invented for netctl.
Comment by Declspeck (declspeck) - Wednesday, 20 June 2018, 10:55 GMT
1) Nice! The tabs and other control characters should still be worried about if the SSIDs are displayed as UTF-8, since they broke dialog.
2) I tried out implementing it and came to an ALMOST satisfactory solution with two major issues, discussed below. Of course, you might find more obvious issues that I might have missed:

Algorithm for formatting SSID for display:
1. Check for NULs in ESSID - grep for \x00 in the encoded ESSID and if found, fall back to the old representation. I learned that Bash doesn't like NULs in variables and that SSIDs are not NULL-terminated, but rather variable length binary buffers.
2. Check for invalid UTF-8 - Run `echo -e "$1" | iconv -f UTF-8 -t UTF-8` to check if there are errors. If there are, show the escaped version.
3. Unescape the output with `echo -e`
4. Substitute control characters - replace \a \b \e \t \v with their slash-escaped versions. Substitute spaces with an underscore (non-breaking space)

3) Sounds good - I did not know about the escaping in wpa_supplicant, that'll make everything a lot easier. About breaking reasonable use cases - I guess this can be done without breaking BC at all? Since the printf-escaped syntax can encode any SSIDs, I don't see how it could not account for some use cases.
4) If printf-escaped syntax is used both in profiles and output from wpa_supplicant, then detecting the connected network will be rather easy I think - but might require support for previously added non-escaped names if I understand correctly? I.e. that wifi-menu would write ESSID=P"\xsomething" in the future, but old profiles with ESSID="Something" would still need to be supported?



PROBLEM 1: The output is ambiguous for several reasons, and the user cannot differentiate between similar-looking SSID names:

A) If you have an SSID with a NUL or invalid UTF-8, it will be shown as e.g. "Invalid\x80UTF-8". If you also have an access point called "Invalid\x80UTF-8" (with the \x80 as the actual string "\x80", not byte 80), the user won't be able to differentiate between them.

B) If you have an SSID with e.g. \t and another with the string "\t", the user won't again be able to differentiate them in the menu.

C) You can have different UTF-8 strings mapping to the same glyphs - one example being:
- "Ä" is normally represented as a single code point U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS,
- But it can be represented as two code points, U+0041 LATIN CAPITAL LETTER A, followed by U+0308 COMBINING DIAERESIS

Example:
```
echo -e '\xC3\x84' # LATIN CAPITAL LETTER A WITH DIAERESIS
echo -e 'A\xCC\x88' # LATIN CAPITAL LETTER A + COMBINING DIAERESIS
```

These yield the following results:
- Both print Ä in xterm, gnome-terminal (and other VTE-based terminals I assume), urxvt, and kitty
- First prints Ä, second prints A¨ in alacritty

Possible solution for A - add a marker "!" or something to the name if the fallback is taken.
Possible solution for B - escape \ as \\ in the input.
Possible solution for C - give the user an option for displaying the SSIDs in their unescaped form. E.g. add an extra button to dialog "Show unescaped SSIDs" or a command-line option. Note that it is impossible to detect whether two UTF-8 strings look the same, even if Unicode normalization was available (and it is not without additional dependencies), as terminals might display Unicode differently, user's fonts might have missing symbols, etc.



PROBLEM 2: The user's terminal might not be able to support UTF-8, in at least two different ways:

A) It uses a different character set, e.g. ASCII or ISO-8859-1.
B) It parses UTF-8, but cannot display most of the glyphs.

When first installing Arch and connecting to wifi on the TTY, either 1. or 2. is the case, depending on whether or not the TTY has been put into Unicode mode. There are a few possible solutions, but none of them are completely robust:

A) Ignore the issue - the missing characters would just display incorrectly.
B) Try to detect UTF-8 support, and fall back to showing printf-escaped names if not supported.
C) Try to detect UTF-8 support, and iconv the SSIDs to ASCII if not supported, e.g. with `iconv -f UTF-8 -t ASCII//TRANSLIT -c`
D) Try to detect the terminal charset, and iconv the SSIDs to that, e.g. with `iconv -f UTF-8 -t KOI-8//TRANSLIT -c`
E) In addition to the previous approaches, try to detect if we are in a Linux TTY, put it into Unicode mode with `unicode_start` if not already, restore it after.

Detecting the terminal charset is not possible in a robust way - you can get the charset specified in LC_ALL, LC_CTYPE, or LANG, but that does not really mean that the terminal is able to display that charset. E.g. the Linux TTY can be in non-Unicode mode while LANG=en_US.UTF-8.

If the language set should be detected, I suppose the following approach might be the most robust, but still not completely - and it might have too many moving parts...:

```
is_valid_charset() {
if [ -z "$1" ]; then
false
else
iconv -l | grep -F "$1//" > /dev/null
fi
}

get_environment_charset() {
if is_valid_charset "${LC_ALL#*.}"; then
echo "${LC_ALL#*.}"
elif is_valid_charset "$LC_CTYPE"; then
echo "$LC_CTYPE"
elif is_valid_charset "${LANG#*.}"; then
echo "${LANG#*.}"
else
echo ASCII
fi
}

get_terminal_charset() {
env_charset=$(get_environment_charset)
if [ "$TERM" = "linux" ]; then
if ! stty -a | grep -F -- "-iutf8" >/dev/null ; then
# Linux TTY in UTF-8 mode
echo UTF-8
elif [ "$env_charset" = "UTF-8" ]; then
# Linux TTY in non-UTF-8 mode, yet locale is UTF-8 -> fallback to ASCII.
echo ASCII
else
# Linux TTY in non-UTF-8 mode in non-UTF-8 locale, assume that the locale
# charset works with the terminal.
echo "$env_charset"
fi
else
echo "$env_charset"
fi
}

```
Comment by Declspeck (declspeck) - Wednesday, 20 June 2018, 11:38 GMT
Interestingly, nmtui displays UTF-8 SSIDs directly, without any concern for [UNICODE-FOR-ZERO-WIDTH-SPACE]Free_Airport_Wifi or anything - I still think that there should at least be an option to show the raw values, so the user could at least in theory connect to the correct wifi in a hostile environment. Of course in practice the user does not know the real name of the airport wifi, so the attacker could just put "Free_Wifi_LAX" instead of "Free_Airport_Wifi", so I'm not sure how important this issue really is...

With all the issues, this starts to sound like a situation where we should consider if we even want to pretty-print the SSIDs? What do you think? I'd still be willing to work on this, but I'm pretty sure that incorrectly guessed terminal encodings and problems with Unicode trickery cannot be 100% solved.
Comment by Declspeck (declspeck) - Wednesday, 20 June 2018, 14:07 GMT
DELETED Turns out thar reloading the tab reposts the comment.
Comment by Jouke Witteveen (jouke) - Wednesday, 20 June 2018, 16:30 GMT
How about this?

When the terminal is using UTF-8, it simply dumps the decoded SSIDs, assuming that it is UTF-8 (which it need not be).
Otherwise, it displays the escaped sequences, but internally processes them as 'raw strings' (byte sequences, really).

This method does in fact not use the possibility to encode a ssid as

ESSID='"P"my\tssid"'

in netctl. Not (officially) supporting this syntax saves some code in the init_profiles function. Otherwise, we would have needed to decode such values too. I know, that would only add two lines, but already the script is quite long.
Comment by Declspeck (declspeck) - Thursday, 21 June 2018, 11:46 GMT
Or how about something like this, built on top of your version:

- SSIDs are always handled as printf-encoded, except when displaying. I used a hack to pass it hidden to dialog, I hid it after invalid UTF-8, which dialog does not display but will still output: 'Nätwerk'$'\x80''N\x12\x34twerk'
- P"-encoding is used - init_profiles does the same amount of work, but I removed the ""-handling that you added (I assume you added it for this scenario?). This works with tabs etc. in wifi names.
- Since profile name is once again generated from the printf-encoded version, I reverted the iconv change. If you prefer the iconv, it can easily be added back, iconv should have the -c flag though so it won't choke on invalid Unicode.
- I also added a per-item info line to dialog, which shows the printf-encoded version of the SSID if it differs from the decoded version.
Comment by Jouke Witteveen (jouke) - Thursday, 21 June 2018, 12:19 GMT
Thanks for your review!

I don't think I fully understand. You rely on undocumented behavior to piggyback the encoded ssid on the decoded version and you do not care for my original point (4) in the sense that encoded and decoded versions of the same SSID are no longer matched? Matching ESSID='""ssid"' and ESSID=ssid was the point of my ""-handling (this also takes care of SSIDs starting with ").

Can you elaborate on the problems your version solves with respect to mine? Honestly, I don't care to support embedded NUL characters or otherwise malicious SSIDs in wifi-menu (although they should not prevent wifi-menu from working for the other networks available). I presume that if you want to connect to such a network you sort-of know what you are doing and are able to write your ESSID in hex (ESSID=\"4E0055004C). Note that I also made no attempt to match such ESSIDs.

Thanks in particular for the iconv -c catch. Using `LC_ALL=C wifi-menu` the transliteration is really an attempt to pretty-print SSIDs in profile names.
I do like the idea of the per-network info line, I just very much oppose the \x80-hack.
Comment by Jouke Witteveen (jouke) - Thursday, 28 June 2018, 14:32 GMT
For 1.17, I went with my original patch:
https://git.archlinux.org/netctl.git/commit/?id=e4638274ac7f84b749cdcbd8e93f06a5564280e7

If we need to use the printf-encoded string anyway, we should simply prefix the menu entries by an index like so:

1. first-ssid
2. second-ssid
etc.

We can then use the index to retrieve the encoded ssid.

Further, note that netctl already supports a printf-like syntax:

ESSID=$'\xF0\x9F\x93\xB6'

Lastly, I added UTF-8 exposing to wpa_supplicant, and found out that (at least in my surroundings) most access points don't set the UTF-8 bit.
http://lists.infradead.org/pipermail/hostap/2018-June/038658.html

Loading...