FS#36670 - [mkinitcpio] locale bug
Attached to Project:
Arch Linux
Opened by A Web (aweb) - Monday, 26 August 2013, 03:46 GMT
Last edited by Dave Reisner (falconindy) - Wednesday, 28 August 2013, 17:13 GMT
Opened by A Web (aweb) - Monday, 26 August 2013, 03:46 GMT
Last edited by Dave Reisner (falconindy) - Wednesday, 28 August 2013, 17:13 GMT
|
Details
Description:
mkinitcpio truncates the list of hooks in non-UTF8 locales. Additional info: * package version(s) mkinitcpio 0.15.0-1 * config and/or log files etc. Steps to reproduce: Run "LANG=C mkinitcpio -L". Tons of hooks will be missing (including mdadm and mdadm_udev). |
This task depends upon
Closed by Dave Reisner (falconindy)
Wednesday, 28 August 2013, 17:13 GMT
Reason for closing: Won't fix
Additional comments about closing: See discussion
Wednesday, 28 August 2013, 17:13 GMT
Reason for closing: Won't fix
Additional comments about closing: See discussion
https://bbs.archlinux.org/viewtopic.php?pid=1317159#p1317159
Your setup is broken, and I can't fix util-linux (not mkinitcpio) to abide by your strange religion. You really need to have a UTF-8 LC_CTYPE these days, or all sorts of programs are going to behave strangely.
> I find it mildly disturbing that mkinitcpio behaves differently depending on the locale, especially since it uses bsdtar which can be quite finicky with locale stuff.
I can only assume you're referring to warnings like this:
https://projects.archlinux.org/mkinitcpio.git/commit/?id=34b9a2b1509c0288
Please, fix your locale.
The first issue is that LC_CTYPE requires a full-blown locale archive, including messages, even if you don't want the messages. (If I'm wrong here, you will make me very happy by explaining how I can use a charmap without having a corresponding locale in the archive.) A partial solution might be to do something like other linux distributions, which include a C.UTF-8 locale. This could be as simple as running "localedef --prefix "$BUILDROOT" -i POSIX -c -f UTF-8 C.UTF-8". I do this in a mkinitcpio hook I use for images I boot over the network to install arch on new machines. Otherwise, including en_US.UTF-8 can easily add 10% to the size of my boot images.
The second issue is that UTF-8 makes tools behave incorrectly in some cases, because not all byte sequences are valid UTF-8. Say, for example, that you want to use lvm snapshots and bsdtar to back up a file system. With a UTF-8 locale, this flat-out does not work, because bsdtar will choke on file names that contain invalid UTF-8 byte sequences. Sure, I can tell my users that they should fix their locales, but if they don't, I should still be able to recreate the actual state of the file system from my backups. This requires the use of an 8-bit charmap such as ISO-8859-1, because even if my backups don't internally represent the state of the file system correctly (e.g., storing a single two-byte UTF-8 character as two unicode code points), at least what comes out of the restore process is exactly the same as the original file system.
Basically I find I end up having to change my locale around to get different programs to work correctly, so I usually just run with the C locale and then alias various commands to 'LC_CTYPE=whatever command'. Also, I like using the 8th bit for meta in bash, emacs, and screen, and doing so requires a 7-bit chartype and non-UTF-8 keyboard input.
I guess the upshot is that I find this internationalization stuff frustratingly finicky in linux, but under the circumstances think you would be justified in marking the bug as closed/wontfix.
As always, I'm grateful for how promptly and clearly you respond to bug reports, even when the bug is unimportant and you consider it caused by some idiosyncrasy of my system. This is one of the things that makes arch so great.
I'm trying to understand your use case, but you're fighting an uphill battle. I can't throw any numbers, but I suspect that there's more tools which will misbehave without UTF-8 locale compared to the number of tools which will misbehave WITH the UTF-8 locale. That's just the world we live in.
Another example is that when running emacs in a terminal (emacs -nw), I want to type Alt-x to get M-x. I can't use "XTerm*metaSendsEscape: true", because this inserts ESC charaters that mess up viper mode in emacs. Bash is a bit more forgiving, because you can bind multi-byte characters to actions, but this doesn't seem to be there by default. Hence, I would have to bind every single Meta character manually if I want things like Alt-Backspace to delete a word rather than insert "LATIN SMALL LETTER Y WITH DIAERESIS".
Now maybe there is an easy way to fix these problems, in which case you will vastly improve my life my pointing me at a solution. However, as someone who spends most of my day at the command-line, I want all the key combinations I can get. Taking away the Alt key (or making it equivalent to the already-overloaded ESC) makes it that much harder to find unused key combinations for escape characters or short-cuts.
I do agree that everything would be nicer if we had UTF-8 throughout the system. Many years ago, I worked on the Plan 9 operating system, where this was the case. Plan 9 was great, because you could write code that was Unicode clean almost without thinking about it. Unfortunately, most linux tools are only half-way there, and the file system allows illegal (non-UTF-8) file names that lead to unexpected or dangerous behavior in system administrative tasks.
Here's a simple example. Try the following two commands:
LC_CTYPE=C touch $(printf "annoying\xc0\x80")
LC_CTYPE=en_US.UTF-8 find . -name 'annoy*' -print
In this case, find does not even find the annoying file. So if you are a system administrator trying to understand or deal with weird stuff users have done, it's safest to start with the C locale (or at least some locale that admits all byte sequences, such as en_US.ISO-8859-1), and just switch your LC_CTYPE when necessary.
Thanks for the background. I'm going to close this as wontfix, since there's nothing that mkinitcpio itself is doing wrong here.