FS#36670 - [mkinitcpio] locale bug

Attached to Project: Arch Linux
Opened by A Web (aweb) - Monday, 26 August 2013, 03:46 GMT
Last edited by Dave Reisner (falconindy) - Wednesday, 28 August 2013, 17:13 GMT
Task Type Bug Report
Category Arch Projects
Status Closed
Assigned To Dave Reisner (falconindy)
Architecture All
Severity Very Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

mkinitcpio truncates the list of hooks in non-UTF8 locales.

Additional info:
* package version(s)
mkinitcpio 0.15.0-1
* config and/or log files etc.


Steps to reproduce:

Run "LANG=C mkinitcpio -L". Tons of hooks will be missing (including mdadm and mdadm_udev).
This task depends upon

Closed by  Dave Reisner (falconindy)
Wednesday, 28 August 2013, 17:13 GMT
Reason for closing:  Won't fix
Additional comments about closing:  See discussion
Comment by A Web (aweb) - Monday, 26 August 2013, 05:50 GMT
As pointed out here, the issue is the column command choking on UTF-8 annotations:

https://bbs.archlinux.org/viewtopic.php?pid=1317159#p1317159
Comment by Dave Reisner (falconindy) - Monday, 26 August 2013, 11:44 GMT
Next you're going to tell me that pacman is broken because it truncates packager names when they contain UTF-8 characters. Then you'll file bugs about wc and other utilities wrongly counting the lengths of lines.

Your setup is broken, and I can't fix util-linux (not mkinitcpio) to abide by your strange religion. You really need to have a UTF-8 LC_CTYPE these days, or all sorts of programs are going to behave strangely.

> I find it mildly disturbing that mkinitcpio behaves differently depending on the locale, especially since it uses bsdtar which can be quite finicky with locale stuff.
I can only assume you're referring to warnings like this:

https://projects.archlinux.org/mkinitcpio.git/commit/?id=34b9a2b1509c0288

Please, fix your locale.
Comment by A Web (aweb) - Tuesday, 27 August 2013, 06:06 GMT
There are two problems with "fixing my locale".

The first issue is that LC_CTYPE requires a full-blown locale archive, including messages, even if you don't want the messages. (If I'm wrong here, you will make me very happy by explaining how I can use a charmap without having a corresponding locale in the archive.) A partial solution might be to do something like other linux distributions, which include a C.UTF-8 locale. This could be as simple as running "localedef --prefix "$BUILDROOT" -i POSIX -c -f UTF-8 C.UTF-8". I do this in a mkinitcpio hook I use for images I boot over the network to install arch on new machines. Otherwise, including en_US.UTF-8 can easily add 10% to the size of my boot images.

The second issue is that UTF-8 makes tools behave incorrectly in some cases, because not all byte sequences are valid UTF-8. Say, for example, that you want to use lvm snapshots and bsdtar to back up a file system. With a UTF-8 locale, this flat-out does not work, because bsdtar will choke on file names that contain invalid UTF-8 byte sequences. Sure, I can tell my users that they should fix their locales, but if they don't, I should still be able to recreate the actual state of the file system from my backups. This requires the use of an 8-bit charmap such as ISO-8859-1, because even if my backups don't internally represent the state of the file system correctly (e.g., storing a single two-byte UTF-8 character as two unicode code points), at least what comes out of the restore process is exactly the same as the original file system.

Basically I find I end up having to change my locale around to get different programs to work correctly, so I usually just run with the C locale and then alias various commands to 'LC_CTYPE=whatever command'. Also, I like using the 8th bit for meta in bash, emacs, and screen, and doing so requires a 7-bit chartype and non-UTF-8 keyboard input.

I guess the upshot is that I find this internationalization stuff frustratingly finicky in linux, but under the circumstances think you would be justified in marking the bug as closed/wontfix.

As always, I'm grateful for how promptly and clearly you respond to bug reports, even when the bug is unimportant and you consider it caused by some idiosyncrasy of my system. This is one of the things that makes arch so great.
Comment by Dave Reisner (falconindy) - Tuesday, 27 August 2013, 13:55 GMT
I still don't understand what you mean by 'using the 8th bit for meta'.

I'm trying to understand your use case, but you're fighting an uphill battle. I can't throw any numbers, but I suspect that there's more tools which will misbehave without UTF-8 locale compared to the number of tools which will misbehave WITH the UTF-8 locale. That's just the world we live in.
Comment by A Web (aweb) - Tuesday, 27 August 2013, 19:36 GMT
Well, we're heading off topic, but for example I like to use \341 (M-a) as my escape character in screen, and type Alt-a to get it. There doesn't appear to be a way to configure a unicode escape sequence in screen. (I haven't tried tmux as recently, but seem to remember it is no better.)

Another example is that when running emacs in a terminal (emacs -nw), I want to type Alt-x to get M-x. I can't use "XTerm*metaSendsEscape: true", because this inserts ESC charaters that mess up viper mode in emacs. Bash is a bit more forgiving, because you can bind multi-byte characters to actions, but this doesn't seem to be there by default. Hence, I would have to bind every single Meta character manually if I want things like Alt-Backspace to delete a word rather than insert "LATIN SMALL LETTER Y WITH DIAERESIS".

Now maybe there is an easy way to fix these problems, in which case you will vastly improve my life my pointing me at a solution. However, as someone who spends most of my day at the command-line, I want all the key combinations I can get. Taking away the Alt key (or making it equivalent to the already-overloaded ESC) makes it that much harder to find unused key combinations for escape characters or short-cuts.

I do agree that everything would be nicer if we had UTF-8 throughout the system. Many years ago, I worked on the Plan 9 operating system, where this was the case. Plan 9 was great, because you could write code that was Unicode clean almost without thinking about it. Unfortunately, most linux tools are only half-way there, and the file system allows illegal (non-UTF-8) file names that lead to unexpected or dangerous behavior in system administrative tasks.

Here's a simple example. Try the following two commands:

LC_CTYPE=C touch $(printf "annoying\xc0\x80")

LC_CTYPE=en_US.UTF-8 find . -name 'annoy*' -print

In this case, find does not even find the annoying file. So if you are a system administrator trying to understand or deal with weird stuff users have done, it's safest to start with the C locale (or at least some locale that admits all byte sequences, such as en_US.ISO-8859-1), and just switch your LC_CTYPE when necessary.
Comment by Dave Reisner (falconindy) - Wednesday, 28 August 2013, 17:13 GMT
I'm afraid I don't have any panacea for you, but I'm sufficiently intrigued that I might try to delve into this one day.

Thanks for the background. I'm going to close this as wontfix, since there's nothing that mkinitcpio itself is doing wrong here.

Loading...