FS#12888 - "sort" in core/coreutils-6.12-1 is broken with UTF-8 locales

Attached to Project: Arch Linux
Opened by Daniel Thaler (danielthaler) - Thursday, 22 January 2009, 10:01 GMT
Last edited by Andreas Radke (AndyRTR) - Tuesday, 24 February 2009, 18:26 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Andreas Radke (AndyRTR)
Architecture All
Severity Medium
Priority Normal
Reported Version None
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

"sort" in core/coreutils-6.12-1 is broken with UTF-8 locales

On my system I have LANG="en_US.UTF-8"

running sort results in
> sort: sort.c:1150: inittables_mb: Assertion `mblength != (size_t)-1 && mblength != (size_t)-2' failed.
> Aborted
No sorting takes place.

If I call sort like this
$ LANG="en_US" sort textfile.txt
sorting works as expected.

Since sort is used by a number of scripts/utilities on the system (mkinitcpio for example) this is no good as a woraround.

This problem is NOT an upstream problem; the bug is in Arch's coreutils-i18n.patch, which is applied by the pkgbuild.

This task depends upon

Closed by  Andreas Radke (AndyRTR)
Tuesday, 24 February 2009, 18:26 GMT
Reason for closing:  Won't fix
Additional comments about closing:  please ask for reopening if the bug resist in coreutils 7.1 with any locale setting.
Comment by Andreas Radke (AndyRTR) - Thursday, 22 January 2009, 15:45 GMT
I can't confirm this here. [andyrtr@workstation64 ~]$ locale
LANG=de_DE.UTF-8
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE=C
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=

I can pipe so far everything to sort without an error. Please give an example and show your output of "locale".
Comment by Aaron Griffin (phrakture) - Thursday, 22 January 2009, 17:03 GMT
@Daniel, what is your LC_COLLATE setting?
Comment by Daniel Thaler (danielthaler) - Thursday, 22 January 2009, 21:00 GMT
I have

LANG=en_US.utf8
LANGUAGE=
LC_COLLATE=C
LC_TIME=de_DE

no other localzation-related environment variables are set.

I've just played around with this some more and I have found that the actual problem appears to be the mismatch between LC_TIME and LANG.
If I set LC_TIME=de_DE.UTF-8 and LANG=en_US.UTF-8 it works. Similarly it also works if neither is UTF-8. Only if one of the 2 is UTF-8 and the other isn't sort breaks.
(Aside: the intent of this setup is to get system messages in english - most translations annoy me - while using familiar dates/times)

Comment by Aaron Griffin (phrakture) - Thursday, 22 January 2009, 21:04 GMT
So are we SURE this is from our patch? Where did we get this patch?

Side note: would you mind documenting this on this page, for other users:
http://wiki.archlinux.org/index.php/Locale

Just mention there is a bug when LC_TIME and LANG use different encodings
Comment by Daniel Thaler (danielthaler) - Thursday, 22 January 2009, 21:18 GMT
Yes, I am sure that patch is the cause:
The failing assertion doesn't exist in the vanilla source; it is added by by that patch.

I also just tested the LANG=en_US.UTF-8, LC_TIME=de_DE setup on my desktop which is running Gentoo and it worked fine.

Comment by Damjan Georgievski (damjan) - Wednesday, 28 January 2009, 16:14 GMT
@Daniel

for one thing, you should always use UTF-8 locales, for another.. you can set LANGUAGE=en to get the messages in english. How this works is:
LANG is the default locale setup, it can be overridden by specific LC_xxx variables. LC_ALL overrides everything.

LANGUAGE is the variable that controls the messages (gettext). It's a list .. so you can have LANGUAGE=de:fr:en and the first available will be used.
If LANGUAGE is not set, gettext will try to guess it from LANG (or LC_MESSAGES I guess).

Long story, short - I guess you need
LANG=de_DE.UTF_8
LANGUAGE=en
and optionally LC_COLLATE=C (personally I don't like it)
Comment by Andreas Radke (AndyRTR) - Saturday, 31 January 2009, 10:34 GMT
the patch is included in every distribution (fedora, debian, gentoo, mandriva). you might look where it is maintained upstream.
Comment by Daniel Thaler (danielthaler) - Tuesday, 03 February 2009, 21:04 GMT
Sorry for not getting back to you earlier about this; I was rather busy.

Anyway: As far as I'm concerned the bug is fixed (for me) by changing my locale settings so that I wasn't mixing UTF-8 with non-UTF-8 locales.
If you don't want or intend to modify the patch (or delegate the problem to upstream), this bug could be closed.

Loading...