FS#24553 - broken character range expressions in bash

Attached to Project: Arch Linux
Opened by Felix (thetrivialstuff) - Thursday, 02 June 2011, 23:33 GMT
Last edited by Allan McRae (Allan) - Friday, 03 June 2011, 03:43 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To No-one
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

Some aspect of the locale settings, or a bug in bash are causing unexpected behaviour in shell globbing.

Additional info:
Locale: en_GB.UTF-8 (but en_US.UTF-8 or any of the other en_ locales probably have equivalent behaviour)
* package version(s)
Latest.
* config and/or log files etc.
Defaults.


Steps to reproduce:

Try these commands in bash:
touch a A b B
ls [a-z]
ls [a-b]

Expected output:
a b
a b

Actual output:
a A b B
a A b

More details on exactly what's going on:

http://teaching.idallen.com/net2003/06w/notes/character_sets.txt


Workaround:
export LC_COLLATE=C in either the system or user bash profile.


Philosophical argument:

While case-insensitive collation makes sense in a lot of contexts, it does not make sense in character ranges. Character ranges should obey a well-defined, definite, predictable order.

Everyone who would use an expression like [a-c] knows the order of the ASCII table, and would never expect that expression to be equivalent to [aAbBc] (note the absence of an uppercase 'C' -- that's what makes this behaviour so strange; it's neither case-sensitive nor case-insensitive, it's something else in between and not at all intuitive).

I'm not arguing for changes to the locale* or even that LC_COLLATE should be set to C by default -- ideally, the best solution to this is to alter the code in bash that interprets character ranges to make it ignore collation entirely (which appears to be what grep does; grep treats char ranges in expressions as expected).


Ranked medium severity because this can cause data loss; see https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687


* Unimportant footnote, and more philosophizing:

Even if [a-c] is equivalent to [aAbBc], that does not even agree with the collation order. In locales where this is true, collation is entirely case-insensitive whereas the fact of [a-c] == [aAbBc] implies that lowercase letters always precede uppercase ones. This is not the case:

$ touch Aa ab
$ ls -1
Aa
ab

(If lowercase 'a' preceded uppercase 'A' in the collation order, the expected result would be:
ab
Aa
)

So, the behaviour of character ranges in a locale with case-insensitive collation should actually be complete case-insensitivity, not this strange leaving out of the last letter.

Finally, I would like to point out that bash's behaviour is not even consistent with its own documentation. man bash says:

"A pair of characters separated by a hyphen denotes a range expression; any character that ***sorts between those two characters,*** inclusive, using the current locale's collating sequence and character set, is matched." (emphasis mine)

Well, in the case of [a-c], uppercase C clearly sorts between 'a' and 'c' (sometimes), so it should be matched. Observe:

$ touch aa cd C
$ ls -1
aa
C
cd
This task depends upon

Closed by  Allan McRae (Allan)
Friday, 03 June 2011, 03:43 GMT
Reason for closing:  Not a bug
Additional comments about closing:  Defined behavior
Comment by Gerardo Exequiel Pozzi (djgera) - Friday, 03 June 2011, 03:36 GMT
I understand your frustration because I suffered the same surprise as you long time ago...

This is correct and defined behaviour (ISO 14651). You can read the definitions here /usr/share/i18n/locales/iso14651_t1_common
There are called "equivalence classes" so for letter "a" there are lots of symbols, for "b" others, for "0" (yes zero!). Example of some symbols in the equivalence class of "0": "¼" (1/4) " ½" (1/2), "¾" (3/4).

>> (If lowercase 'a' preceded uppercase 'A' in the collation order, the expected result would be:
This is a wrong assumption in this context. Why? Because strings are sorted linguistically not by code points, a much more complex thing.
character range (order: code points) != collation order (order: linguistically)

If you want "the old school" behaviour just use POSIX/C.

Loading...