FS#15250 - [filesystem] incorrect sorting of accentuated filenames
            Attached to Project:
            Arch Linux
            
Opened by Pierre Buard (Pierre_Buard) - Wednesday, 24 June 2009, 13:35 GMT
Last edited by Allan McRae (Allan) - Saturday, 19 December 2009, 22:22 GMT
          Opened by Pierre Buard (Pierre_Buard) - Wednesday, 24 June 2009, 13:35 GMT
Last edited by Allan McRae (Allan) - Saturday, 19 December 2009, 22:22 GMT
| 
 | Details
                    Description: When mixing non-accentuated filenames with accentuated ones, the latter are always at the bottom of the list. Rather than being sorted after the "z", they should be with their relatives. E.g. "é" and "è" with the "e", "à" with the "a" and so on... This happens across the board: from Dolphin to xterm. Additional info: * package: unknown * version: unknown * locale set in 'rc.conf': LOCALE="fr_FR.utf8" * 'locale.gen' uncommented lines: fr_FR.UTF-8 UTF-8 fr_FR ISO-8859-1 fr_FR@euro ISO-8859-15 * 'locale' output: LANG=fr_FR.utf8 LC_CTYPE="fr_FR.utf8" LC_NUMERIC="fr_FR.utf8" LC_TIME="fr_FR.utf8" LC_COLLATE="fr_FR.utf8" LC_MONETARY="fr_FR.utf8" LC_MESSAGES="fr_FR.utf8" LC_PAPER="fr_FR.utf8" LC_NAME="fr_FR.utf8" LC_ADDRESS="fr_FR.utf8" LC_TELEPHONE="fr_FR.utf8" LC_MEASUREMENT="fr_FR.utf8" LC_IDENTIFICATION="fr_FR.utf8" LC_ALL= Steps to reproduce: - in a folder, create some files named without accentuated characters, - add a file whose name begins with an accentuated character (e.g. é), - try to find your newly named file. Here's an example of filenames and their incorrect order: chou eloigne eloigné etalage hibou éloigne étaler How to fix: Turns out this is due to the empty LC_ALL. Typing 'export LC_ALL=fr_FR.utf8' in a terminal and doing a 'ls' on the folder solves this problem. The same thing happens if issued before a 'startx'. Here's the corrected order obtained after the fix: chou eloigne éloigne eloigné etalage étaler hibou | 
              This task depends upon
              
              
            
            
           
                      
All that LC_ALL does is override all other LC_* and LANG parameters.
Try making a profile.d script like so:
echo "LC_COLLATE=fr_FR.utf8" > /etc/profile.d/set_collation.sh
chmod +x /etc/profile.d/set_collation.sh
Note that the script is named so it runs _after_ locale.sh
$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE=en_US.utf8
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
$ ls
9993243 chou eloigne éloigne eloigné etalage étaler hibou sdsad zzz
echo "export LC_ALL=$LOCALE" >>/etc/profile.d/locale.sh
right after:
echo "export LANG=$LOCALE" >>/etc/profile.d/locale.sh
export LC_COLLATE=fr_FR.utf8
'locale' now reports:
LANG=fr_FR.utf8
LC_CTYPE="fr_FR.utf8"
LC_NUMERIC="fr_FR.utf8"
LC_TIME="fr_FR.utf8"
LC_COLLATE=fr_FR.utf8
LC_MONETARY="fr_FR.utf8"
LC_MESSAGES="fr_FR.utf8"
LC_PAPER="fr_FR.utf8"
LC_NAME="fr_FR.utf8"
LC_ADDRESS="fr_FR.utf8"
LC_TELEPHONE="fr_FR.utf8"
LC_MEASUREMENT="fr_FR.utf8"
LC_IDENTIFICATION="fr_FR.utf8"
LC_ALL=
and the sorting is correct:
chou eloigne éloigne eloigné etalage étaler hibou
But if setting LC_ALL works - I see no reason why it cannot be added to the default rc.sysinit
(I think it wasn't done before only because judging by locale output it seemed that setting LANG is enough).
FS#10428andFS#10435that asked for removal of LC_COLLATE=C from the default /etc/profileI commented the LC_COLLATE line in /etc/profile and everything works.
Do what you feel is best and in the meantime I'll be modifying the wiki guides in order to make this <http://wiki.archlinux.org/index.php/Configuring_locales#Collation> more apparent.
@Thomas: We absolutely _cannot_ remove the LC_COLLATE line unless we can ensure that all scripts remain working with different collation. C collation is forced on a few other distros as well, as it has great potential to break scripts.
FS#10435$ touch a A b
$ export LC_COLLATE=C
$ echo [a-z]
a b
$ export LC_COLLATE=en_US.utf8
$ echo [a-z]
a A b
If a script relies on it, LC_COLLATE=C must be set inside the script, period. If we would rely on it anywhere, it would have manifested as a bug before.
There are actually two things done in locale.sh
I agree that 'echo "export LANG=$LOCALE" >>/etc/profile.d/locale.sh' can be removed so LANG is only set once after parsing rc.conf,
but other things done in locale.sh must not be changed, they must be done _after_ a terminal is set up,
and we cannot do this only once in rc.sysinit for all terminals because then dynamically spawned terminals will be broken (now even if terminal was reset simple relogin fixes it).
An interesting side effect is that the GNOME Open/Save dialog now sorts the folders correctly regardless of the case (i.e. the smaller case folders were always sorted after the capitalized ones).
For instance, in Estonian, the 'z' character comes right after 's'. Scripts frequently do use ranges like [a-z]. So if you run with LC_COLLATE=et_EE, this range will only match letters a-s and z -- EXCLUDING letters t u v x y
I cannot begin to imagine the security problems that arise from broken assumptions about alphabet ordering.
This is also known to break lots of configure scripts, such as Firefox, when using locale-aware shells like bash.
So please don't make this change until there's a sane and documented workaround. Perhaps warning users with weird locales to keep at LC_COLLATE=C?
Maybe there's a better solution, but I don't know it.
Attached is a test case to demonstrate the problem with Estonian locale.
On the other hand, there are examples when setting LC_COLLATE=C in /etc/profile breaks real applications:
FS#16481Particularly because this affects only a few locales with few speakers, these bugs will probably be added faster than they can be debugged and fixed.
The fix is not as easy as setting LC_COLLATE once per script either, because at one place a script might want locale-specific sorting, but in another place it might want ASCII pattern matching. And it gets even more ambiguous what the user expects when the regular expression is input from the user.
If that's not complicated enough, how about having to call setlocale() runtime in a multithreaded application to get some particular semantics? :)
Roman Kyrylych: I don't know many cases mainly because I use English locale on my own machines, as do most Estonian Linux users I know. But here's a brief teaser of bugs I could quickly find on the net:
http://bugs.gentoo.org/228005 http://bugs.gentoo.org/261363 http://bugs.gentoo.org/99013 http://bugs.gentoo.org/242332 http://markmail.org/message/fnvywvci3djsqp5h http://bugs.php.net/bug.php?id=25259 http://bugs.php.net/bug.php?id=23709
These are mostly buildscript issues because that is what I searched for, but I bet this problem affects many other applications as well.
And btw: German collation DOES fuck up sorting.
If you want to verify this, you can uncomment all locales in /etc/locale.gen, run locale-gen as root, and then run my attached script.
I mean perhaps et_EE locale information files can be fixed?
But technically the locale is correct. The internationalized Estonian alphabet goes like this: a-p, q, r, s, š, z, ž, t, u, v, w, õ, ä, ö, ü, x, y
Different pattern matching engines behave differently in this respect. Some attempt to match the characters to the locale-specific alphabet (so a-z won't match y). Others will treat character ranges as numeric ranges of Unicode codepoints (so a-y won't match š).
Either behavior can be surprising depending on your expectations. Personally, I see it as an unsolvable problem, so warning users is the only thing that can be done. However, other distros don't warn them, and people spend days trying to figure it out.