FS#15250 - [filesystem] incorrect sorting of accentuated filenames

Attached to Project: Arch Linux
Opened by Pierre Buard (Pierre_Buard) - Wednesday, 24 June 2009, 13:35 GMT
Last edited by Allan McRae (Allan) - Saturday, 19 December 2009, 22:22 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Aaron Griffin (phrakture)
Thomas Bächler (brain0)
Roman Kyrylych (Romashka)
Allan McRae (Allan)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
When mixing non-accentuated filenames with accentuated ones, the latter are
always at the bottom of the list. Rather than being sorted after the "z", they
should be with their relatives. E.g. "é" and "è" with the "e", "à" with the "a"
and so on...

This happens across the board: from Dolphin to xterm.

Additional info:
* package: unknown
* version: unknown
* locale set in 'rc.conf': LOCALE="fr_FR.utf8"
* 'locale.gen' uncommented lines:
fr_FR.UTF-8 UTF-8
fr_FR ISO-8859-1
fr_FR@euro ISO-8859-15
* 'locale' output:
LANG=fr_FR.utf8
LC_CTYPE="fr_FR.utf8"
LC_NUMERIC="fr_FR.utf8"
LC_TIME="fr_FR.utf8"
LC_COLLATE="fr_FR.utf8"
LC_MONETARY="fr_FR.utf8"
LC_MESSAGES="fr_FR.utf8"
LC_PAPER="fr_FR.utf8"
LC_NAME="fr_FR.utf8"
LC_ADDRESS="fr_FR.utf8"
LC_TELEPHONE="fr_FR.utf8"
LC_MEASUREMENT="fr_FR.utf8"
LC_IDENTIFICATION="fr_FR.utf8"
LC_ALL=

Steps to reproduce:
- in a folder, create some files named without accentuated characters,
- add a file whose name begins with an accentuated character (e.g. é),
- try to find your newly named file.

Here's an example of filenames and their incorrect order:
chou eloigne eloigné etalage hibou éloigne étaler

How to fix:
Turns out this is due to the empty LC_ALL. Typing 'export LC_ALL=fr_FR.utf8' in a terminal and doing a 'ls' on the folder solves this problem. The same thing happens if issued before a 'startx'.

Here's the corrected order obtained after the fix:
chou eloigne éloigne eloigné etalage étaler hibou
This task depends upon

Closed by  Allan McRae (Allan)
Saturday, 19 December 2009, 22:22 GMT
Reason for closing:  Fixed
Comment by Roman Kyrylych (Romashka) - Wednesday, 24 June 2009, 13:44 GMT
@Aaron, Thomas: does this mean we should set LC_ALL during locale setup process?
Comment by Thomas Bächler (brain0) - Wednesday, 24 June 2009, 14:52 GMT
In the Arch default settings, LC_COLLCATE should be set to C, which would cause the problem (but weirdly, your locale output does not show that).

All that LC_ALL does is override all other LC_* and LANG parameters.
Comment by Pierre Buard (Pierre_Buard) - Wednesday, 24 June 2009, 15:31 GMT
Just to be sure I tested your remark concerning LC_COLLCATE on a week old installation. It's indeed set to 'C' on that system but the sorting is still screwed unless the LC_ALL fix is applied.
Comment by Aaron Griffin (phrakture) - Wednesday, 24 June 2009, 16:26 GMT
Sorting is controlled solely by LC_COLLATE (note: typos above). The issue here is that LC_COLLATE is actually set in /etc/profile, and not in /etc/profile.d/locale.sh

Try making a profile.d script like so:

echo "LC_COLLATE=fr_FR.utf8" > /etc/profile.d/set_collation.sh
chmod +x /etc/profile.d/set_collation.sh

Note that the script is named so it runs _after_ locale.sh
Comment by Aaron Griffin (phrakture) - Wednesday, 24 June 2009, 16:31 GMT
Amazingly, this sorts fine in _any_ utf8 locale, I think... at least it's fine for en_US here

$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE=en_US.utf8
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

$ ls
9993243 chou eloigne éloigne eloigné etalage étaler hibou sdsad zzz
Comment by Pierre Buard (Pierre_Buard) - Wednesday, 24 June 2009, 21:42 GMT
I'll try the LC_COLLATE script. In the meantime I have added the following line to rc.sysinit:
echo "export LC_ALL=$LOCALE" >>/etc/profile.d/locale.sh
right after:
echo "export LANG=$LOCALE" >>/etc/profile.d/locale.sh
Comment by Pierre Buard (Pierre_Buard) - Thursday, 25 June 2009, 22:24 GMT
I've tested the script with success. I only needed to modify it a bit :
export LC_COLLATE=fr_FR.utf8

'locale' now reports:
LANG=fr_FR.utf8
LC_CTYPE="fr_FR.utf8"
LC_NUMERIC="fr_FR.utf8"
LC_TIME="fr_FR.utf8"
LC_COLLATE=fr_FR.utf8
LC_MONETARY="fr_FR.utf8"
LC_MESSAGES="fr_FR.utf8"
LC_PAPER="fr_FR.utf8"
LC_NAME="fr_FR.utf8"
LC_ADDRESS="fr_FR.utf8"
LC_TELEPHONE="fr_FR.utf8"
LC_MEASUREMENT="fr_FR.utf8"
LC_IDENTIFICATION="fr_FR.utf8"
LC_ALL=

and the sorting is correct:
chou eloigne éloigne eloigné etalage étaler hibou
Comment by Roman Kyrylych (Romashka) - Friday, 26 June 2009, 08:07 GMT
Hm, it is weird that just setting LANG and setting LANG and LC_ALL produce the same locale output, but work differently.
But if setting LC_ALL works - I see no reason why it cannot be added to the default rc.sysinit
(I think it wasn't done before only because judging by locale output it seemed that setting LANG is enough).
Comment by Thomas Bächler (brain0) - Friday, 26 June 2009, 08:18 GMT
I still think it would be sufficient to omit setting LC_COLLATE. Could you try removing the LC_COLLATE references from /etc/profile and try again?
Comment by Thomas Bächler (brain0) - Friday, 26 June 2009, 08:21 GMT
Just a remark, simlpy "unset LC_COLLATE" followed by "ls -l" gives me correct output. When LC_COLLATE is set to C, the 'ä' is sorted after the 'z', when I unset it, locale reports de_DE.utf8 for LC_COLLATE and the 'ä' is sorted somehow with the 'a'.
Comment by Roman Kyrylych (Romashka) - Friday, 26 June 2009, 11:13 GMT
so this gets us back to  FS#10428  and  FS#10435  that asked for removal of LC_COLLATE=C from the default /etc/profile
Comment by Pierre Buard (Pierre_Buard) - Friday, 26 June 2009, 11:47 GMT
And I thought I had researched the subject before posting this bug report !
I commented the LC_COLLATE line in /etc/profile and everything works.
Do what you feel is best and in the meantime I'll be modifying the wiki guides in order to make this <http://wiki.archlinux.org/index.php/Configuring_locales#Collation> more apparent.
Comment by Thomas Bächler (brain0) - Friday, 26 June 2009, 13:37 GMT
I am against forcing LC_COLLATE=C on anyone. We need to rethink locale handling in /etc/profile anyway, rc.conf should be parsed instead of generating a static locale.sh on every boot!
Comment by Aaron Griffin (phrakture) - Friday, 26 June 2009, 17:26 GMT
@Roman: We don't want to set LC_ALL at all. LC_ALL is an override, where as LANG is the default if not set

@Thomas: We absolutely _cannot_ remove the LC_COLLATE line unless we can ensure that all scripts remain working with different collation. C collation is forced on a few other distros as well, as it has great potential to break scripts.
Comment by Aaron Griffin (phrakture) - Friday, 26 June 2009, 17:28 GMT
More to the point - the /etc/profile file is just a suggestion. Anyone is free to edit it, but the default C collation covers most cases. See  FS#10435 
Comment by Aaron Griffin (phrakture) - Friday, 26 June 2009, 17:32 GMT
Case in point:
$ touch a A b
$ export LC_COLLATE=C
$ echo [a-z]
a b
$ export LC_COLLATE=en_US.utf8
$ echo [a-z]
a A b
Comment by Thomas Bächler (brain0) - Friday, 26 June 2009, 17:58 GMT
None of our scripts can rely on that: When not started from a login shell, LANG=C is the default anyway. When started from a (login) shell, we cannot be sure that the user didn't set anything different.

If a script relies on it, LC_COLLATE=C must be set inside the script, period. If we would rely on it anywhere, it would have manifested as a bug before.
Comment by Aaron Griffin (phrakture) - Friday, 26 June 2009, 18:54 GMT
Right right, I'm not saying that it's a good thing. It's just the way it is. If we remove the default, we're going to have breakage we never expected. In the future, sure, maybe we should do this, but for right now, we shouldn't switch the default LC_COLLATE=C unless we're sure we covered most bases with scripts we ship (and even third-party scripts that use grep and sed)
Comment by Thomas Bächler (brain0) - Friday, 26 June 2009, 19:12 GMT
I hate to repeat myself, but most scripts are not even run from a login shell, so whatever you set in profile will not affect anything. I will disable this LC_COLLATE locally and see what happens.
Comment by Roman Kyrylych (Romashka) - Sunday, 28 June 2009, 13:26 GMT
@Thomas: "rc.conf should be parsed instead of generating a static locale.sh on every boot!"
There are actually two things done in locale.sh
I agree that 'echo "export LANG=$LOCALE" >>/etc/profile.d/locale.sh' can be removed so LANG is only set once after parsing rc.conf,
but other things done in locale.sh must not be changed, they must be done _after_ a terminal is set up,
and we cannot do this only once in rc.sysinit for all terminals because then dynamically spawned terminals will be broken (now even if terminal was reset simple relogin fixes it).
Comment by Roman Kyrylych (Romashka) - Saturday, 18 July 2009, 18:02 GMT
@Thomas: did you see any breakage with LC_COLLATE=C removed? If no then we should make the change.
Comment by Thomas Bächler (brain0) - Saturday, 18 July 2009, 20:12 GMT
I did not actually remove it. Good you are reminding me. I'm doing it now.
Comment by Roman Kyrylych (Romashka) - Saturday, 18 July 2009, 20:20 GMT
I've disabled it too. Though I won't be able to see any dfference in sorting order because Latin and Cyrillic chars are not mixed in any sorting order.
Comment by Roman Kyrylych (Romashka) - Sunday, 19 July 2009, 13:04 GMT
So far the only difference I've noticed is that Latin items (e.g. filenames, menu entries) in Gnome are now sorted after Cyrillic, it was the other way before.
Comment by Pierre Buard (Pierre_Buard) - Monday, 10 August 2009, 09:23 GMT
I haven't seen any problem on 2 workstations for nearly 2 month with 'LC_COLLATE="C"' commented in my /etc/profile.
An interesting side effect is that the GNOME Open/Save dialog now sorts the folders correctly regardless of the case (i.e. the smaller case folders were always sorted after the capitalized ones).
Comment by Thomas Bächler (brain0) - Monday, 10 August 2009, 11:26 GMT
I think that we can remove it and IF any script breaks, introduce LC_COLLATE=C _inside_ the script. @Aaron, can we do this now?
Comment by Roman Kyrylych (Romashka) - Monday, 10 August 2009, 11:48 GMT
Agree. Let's get rid of LC_COLLATE
Comment by Aaron Griffin (phrakture) - Monday, 10 August 2009, 18:55 GMT
Committed to SVN. This isn't worth releasing a new filesystem package though, so it will show up later
Comment by Marti (intgr) - Tuesday, 06 October 2009, 14:11 GMT
Don't rush it, this causes REALLY obscure problems with locales that tamper with the ordering of the alphabet.

For instance, in Estonian, the 'z' character comes right after 's'. Scripts frequently do use ranges like [a-z]. So if you run with LC_COLLATE=et_EE, this range will only match letters a-s and z -- EXCLUDING letters t u v x y
I cannot begin to imagine the security problems that arise from broken assumptions about alphabet ordering.
This is also known to break lots of configure scripts, such as Firefox, when using locale-aware shells like bash.

So please don't make this change until there's a sane and documented workaround. Perhaps warning users with weird locales to keep at LC_COLLATE=C?
Maybe there's a better solution, but I don't know it.

Attached is a test case to demonstrate the problem with Estonian locale.
Comment by Thomas Bächler (brain0) - Tuesday, 06 October 2009, 14:29 GMT
A script that doesn't set LC_COLLATE to what it expects is simply broken. I've used this for a long time now and didn't notice any problems.
Comment by Roman Kyrylych (Romashka) - Tuesday, 06 October 2009, 14:37 GMT
@Marti: Can you name any such scripts? They should be fixed to use LC_COLLATE=C if they depend on the order of ASCII chars, or use [:alpha:]
On the other hand, there are examples when setting LC_COLLATE=C in /etc/profile breaks real applications:  FS#16481 
Comment by Aaron Griffin (phrakture) - Tuesday, 06 October 2009, 15:09 GMT
Hrmm... if it's known to break configure scripts, would setting LC_COLLATE in makepkg solve enough of this so that we can then deal with broken scripts on a case-by-case basis?
Comment by Marti (intgr) - Tuesday, 06 October 2009, 15:56 GMT
Thomas Bächler: *YOU* didn't notice any problems because German is one of the sane locales that does not change the alphabet ordering of 'z'.
Particularly because this affects only a few locales with few speakers, these bugs will probably be added faster than they can be debugged and fixed.

The fix is not as easy as setting LC_COLLATE once per script either, because at one place a script might want locale-specific sorting, but in another place it might want ASCII pattern matching. And it gets even more ambiguous what the user expects when the regular expression is input from the user.
If that's not complicated enough, how about having to call setlocale() runtime in a multithreaded application to get some particular semantics? :)

Roman Kyrylych: I don't know many cases mainly because I use English locale on my own machines, as do most Estonian Linux users I know. But here's a brief teaser of bugs I could quickly find on the net:
http://bugs.gentoo.org/228005 http://bugs.gentoo.org/261363 http://bugs.gentoo.org/99013 http://bugs.gentoo.org/242332 http://markmail.org/message/fnvywvci3djsqp5h http://bugs.php.net/bug.php?id=25259 http://bugs.php.net/bug.php?id=23709

These are mostly buildscript issues because that is what I searched for, but I bet this problem affects many other applications as well.
Comment by Thomas Bächler (brain0) - Tuesday, 06 October 2009, 16:02 GMT
Broken applications don't justify broken-ness on our side. If you really run into trouble, then you can set LC_COLLATE=C somewhere in your login scripts, but it is broken as a distro default.

And btw: German collation DOES fuck up sorting.
Comment by Aaron Griffin (phrakture) - Tuesday, 06 October 2009, 16:04 GMT
How about we make sure to announce this and provide a list of locales which break some scripts, and recommend setting LC_COLLATE for these people?
Comment by Roman Kyrylych (Romashka) - Wednesday, 07 October 2009, 08:51 GMT
I agree with that.
Comment by Marti (intgr) - Wednesday, 07 October 2009, 13:30 GMT
Turns out that et_EE is the _only_ locale that causes this unexpected behavior. In all other locales, the range [a-z] will always match all English letters in range a-z.

If you want to verify this, you can uncomment all locales in /etc/locale.gen, run locale-gen as root, and then run my attached script.
Comment by Roman Kyrylych (Romashka) - Wednesday, 07 October 2009, 13:55 GMT
Hm, is it something that can be fixed on locale level?
I mean perhaps et_EE locale information files can be fixed?
Comment by Marti (intgr) - Wednesday, 07 October 2009, 14:23 GMT
I very much wish the problem could be fixed on the locale level.
But technically the locale is correct. The internationalized Estonian alphabet goes like this: a-p, q, r, s, š, z, ž, t, u, v, w, õ, ä, ö, ü, x, y

Different pattern matching engines behave differently in this respect. Some attempt to match the characters to the locale-specific alphabet (so a-z won't match y). Others will treat character ranges as numeric ranges of Unicode codepoints (so a-y won't match š).

Either behavior can be surprising depending on your expectations. Personally, I see it as an unsolvable problem, so warning users is the only thing that can be done. However, other distros don't warn them, and people spend days trying to figure it out.
Comment by Allan McRae (Allan) - Sunday, 01 November 2009, 04:05 GMT
I have released filesystem-2009.11 with this change to [testing]. Are we going to make an announcement regarding this?
Comment by Roman Kyrylych (Romashka) - Sunday, 01 November 2009, 10:32 GMT
Yes, an announcement is needed. Can you write the text?

Loading...