FS#7641 - non-latin1 keymaps mistranslation in unicode mode + fix

Attached to Project: Arch Linux
Opened by Michal Soltys (msoltyspl) - Saturday, 21 July 2007, 00:21 GMT
Last edited by Roman Kyrylych (Romashka) - Monday, 22 October 2007, 13:12 GMT
Task Type Bug Report
Category System
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Roman Kyrylych (Romashka)
Architecture All
Severity Medium
Priority Normal
Reported Version 2007.05 Duke
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 6
Private No

Details

Archlinux uses simple dumpkeys | loadkeys --unicode when switching to unicode, without specifying any keymap charset. In situation when different keymap is used, let's say polish, prepared for specific charset - like iso-8859-2 - loadkeys won't load the mapping properly (defaulting to iso-8859-1). So for example - if we have Ą under 0xA1 in 8859-2, it will be loaded directly as 0xA1 under unicode too, but should be 0x104 (and proper -c paramater will guarantee that).

Remark: many fonts / keymaps will function properly with not properly translated keymaps, as they were prepared for respective charsets - so i.e. lat2-16 will output Ą both at 0xA1 and 0x104.

I've made simple workaround in my install

- added KEYMAP_CHARSET parameter to rc.conf
- changed above sequence to: /usr/bin/dumpkeys ${KEYMAP_CHARSET:+"-c${KEYMAP_CHARSET}"} | /bin/loadkeys --unicode

Trivial patch attached.
This task depends upon

Closed by  Roman Kyrylych (Romashka)
Monday, 22 October 2007, 13:12 GMT
Reason for closing:  Fixed
Comment by Roman Kyrylych (Romashka) - Tuesday, 24 July 2007, 10:49 GMT
note that the patch contains KEYMAP_CODING instead of KEYMAP_CHARSET
Comment by Michal Soltys (msoltyspl) - Wednesday, 25 July 2007, 17:53 GMT
Ahh, yes. Later I changed the name in the files to KEYMAP_CHARSET, as it seemed to make a bit more sense. But the patch was made earlier. Thanks for noticing.
Comment by Michal Soltys (msoltyspl) - Saturday, 25 August 2007, 23:04 GMT
So... any news from people in charge ? It's trivial issue and simple to fix. Without -c, it will create small mess, when national characters are used - so for example going back to mentioned Ą - if we have a filename containig that letter, it will be recorded using utf-8 on the disk, but using 0xA1 instead of 0x104.

Extra note: analogous change should be made in initcpio's keymap module as well.
Comment by sda (sda00) - Thursday, 13 September 2007, 16:53 GMT
I've got some question if you please... I'm not an expert - beg my pardon for possible stupidity...

I'm using the following sets:
[code]
$ locale

LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE=C
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
[/code]
[code]
$ locale -a
C
POSIX
de_DE.utf8
en_US
en_US.iso88591
en_US.utf8
ru_RU
ru_RU.cp1251
ru_RU.iso88595
ru_RU.koi8r
ru_RU.utf8
russian
[/code]
and extract from /etc/rc.conf
[code]
LOCALE="en_US.utf8"
HARDWARECLOCK="localtime"
TIMEZONE="Europe/Moscow"
KEYMAP="ru-utf"
KEYMAP_CHARSET=""
CONSOLEFONT="Cyr_a8x16.psfu"
CONSOLEMAP=""
USECOLOR="yes"
[/code]
and here's the last "bullet in the head"
[code]
$ cat /usr/share/kbd/keymaps/i386/qwerty/ru-utf.map.gz | gzip -d | enca -L russian -

Universal transformation format 7 bits; UTF-7
LF line terminators
[/code]
and all things are running just fine (except mentioned below in questions). maybe the "case" is inside *.map files? Yes, this ru-utf.map.gz is far away from the default. Let's see:
[code]
$ cat /usr/share/kbd/keymaps/i386/qwerty/ru-ms.map.gz | gzip -d | enca -L russian -
7bit ASCII characters
[/code]
moreover, I've got CONFIG_NLS_DEFAULT="iso8859-1" along with CONFIG_NLS_UTF8=y in running kernel.

And here is my stupid questions:
Why gtk1 apps with this settings require export LC_CTYPE="ru_RU.UTF-8" before launch (for correct display of utf8 content)?
Why xdvi/xpdf are unable to save file with non-latin characters in filename?
Why xpdf is not capable to search non-latin characters in the documents?

Thank you.
Comment by SKOCDOPOLE Tomas (skocdopolet) - Thursday, 13 September 2007, 20:11 GMT Comment by Michal Soltys (msoltyspl) - Thursday, 13 September 2007, 20:49 GMT
sda00:

Well, I don't really use X windows, so I won't be much of a help here. Still - KEYMAP, CONSOLEFONT, CONSOLEMAP and added by me - KEYMAP_CHARSET are relevant only to console driver. Afaik, they are ignored by X, which uses keyboard in raw mode and display part is of course completely different there (that's why I said on the forums it's a long shot :)

Still, your setup (regarding console input/output part) seems fine - assuming ru-utf keymap is what I think it is.

Blind guess about gtk1 thing - it might be using locale to make X select appropriate fonts for displaying, based on the locale setting. So only ru_RU will make it select fonts with cyrillic glyphs.

In gentoo docs (more in http://www.gentoo.org/doc/en/utf-8.xml) I've found following comment: "The exceptions to this rule come in Xlib and GTK+1. GTK+1 requires a iso-10646-1 FontSpec in the ~/.gtkrc, for example -misc-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1. Also, applications using Xlib or Xaw will need to be given a similar FontSpec, otherwise they will not work."

Comment by sda (sda00) - Thursday, 13 September 2007, 22:00 GMT
Michal:

I'll try to make some testing, but currently xpdf,xdvi,snd and may be all other Motif apps refuse to accept any non-latin characters...
Comment by Roman Kyrylych (Romashka) - Friday, 14 September 2007, 08:32 GMT
> Why xdvi/xpdf are unable to save file with non-latin characters in filename?

Try create /etc/profile.d/gtk+.sh with the following content:
#!/bin/sh
G_BROKEN_FILENAMES=1
export G_BROKEN_FILENAMES
G_FILENAME_ENCODING=@local

and relogin.
Comment by sda (sda00) - Saturday, 15 September 2007, 15:24 GMT
Roman, thanks, but it doesn't work. Beg my pardon, but I suppose that Motif apps doesn't care much about gtk settings...
Comment by Bogdan Szczurek (thebodzio) - Tuesday, 25 September 2007, 21:47 GMT
How about reconstructing rc.sysinit file to use loadkeys -u [keymap] if LOCALE is utf? A little complication in script but we don't need new parameter then.
Comment by Michal Soltys (msoltyspl) - Wednesday, 03 October 2007, 16:42 GMT
You're right, it would be simpler. And hardly a complication.
Comment by Bogdan Szczurek (thebodzio) - Wednesday, 03 October 2007, 20:10 GMT
Glad to hear that :).
I don't have much time to dig into this and provide some patch but if by some chance I'd have I'll try to.
Comment by Bogdan Szczurek (thebodzio) - Saturday, 13 October 2007, 12:25 GMT
I've forged a little patch for rc.sysinit that takes care of loading utf8 keymap properly. It's sufficient for me and I hope I haven't messed anything up ;).
By the way: you were right Michal -- it was hardly a complication as one got to it :).
Comment by Michal Soltys (msoltyspl) - Tuesday, 16 October 2007, 10:23 GMT
I've made something similar, but more verbose. Also got rid of 2x echo called from locale.sh (setfont sets (K implicitely, while %G can be set for all consoles in loop in rc.sysinit - seems to be working fine).

I could also provide script that fixes broken filenames.

Attached this time - whole part responsible of setting locale, instead of a diff.

Either way - is there even a point of providing diffs ? Dunno how arch devs approach it, but it's been almost 3 months since I submitted it...
Comment by Roman Kyrylych (Romashka) - Tuesday, 16 October 2007, 10:49 GMT
Expect rc.sysinit-utf8-keymap.patch to be applied soon.
%G should be called explicitly from profile.sh too (this way if user or application resets the console - it can be restored with simple relogin).
Since I don't use non-UTF-8 locale anymore I forget that broken filenames ( FS#5487 ) are fixed now.
Comment by Michal Soltys (msoltyspl) - Tuesday, 16 October 2007, 11:14 GMT
All right, good point (although unicode_start can be used for that too).

Regarding broken filenames I meant something different - if someone is relying on console in utf8 now, one will not be able to normally reference the files after applying that patch. All the filenames with national characters would have to be converted back from utf8 to iso 1, and then after conversion treated as iso N, converted back to utf8. Thus my offer to write some simple script to do so.

Thanks for update.
Comment by Roman Kyrylych (Romashka) - Tuesday, 16 October 2007, 11:17 GMT
why? there are codepage and iocharset mount options for this.
Comment by Michal Soltys (msoltyspl) - Tuesday, 16 October 2007, 11:50 GMT
No,no - that's not it either.

Example, (from the first post here): letter 'Ą' - 0x104 in unicode, 0xA1 in 8859-2. Currently under archlinux (let's say on ext3), when console is set to utf8 w/o any patches, and you use that letter, you will end with U+00A1 encoded in utf8 (and used for instance in filename) - due to improperly loaded keymap. After the patch, you will be using proper U+0104 encoded as utf8, when using 'Ą'. So you will have problems referencing before-patch files with national characters, when you mount disk under different linux distro, when you loging from putty set to utf-8, etc. etc.
Comment by Roman Kyrylych (Romashka) - Tuesday, 16 October 2007, 12:12 GMT
Ah, got it, nice explanation. :)
I had such issue when switched from KOI8-U to UTF-8, but because amount of such files was very little - I did manual renames. :P
It would be nice if you provide a conversion script.
Comment by Tobias Powalowski (tpowa) - Tuesday, 16 October 2007, 14:54 GMT
it's already there a program
convmv
Comment by Thomas Bächler (brain0) - Sunday, 21 October 2007, 08:35 GMT
I am confused. Our initscripts don't use dumpkeys anymore, what do we do now to fix the original bug?
Comment by Michal Soltys (msoltyspl) - Sunday, 21 October 2007, 13:44 GMT
If they load keymap with -u when utf-8 mode is chosen, then they are most likely fine. Analogously to how it's done in the attachments above.
Comment by Roman Kyrylych (Romashka) - Sunday, 21 October 2007, 14:32 GMT
Please test initscripts-2007.11-1 from Testing.
Comment by Michal Soltys (msoltyspl) - Sunday, 21 October 2007, 18:54 GMT
Just tested that version of the scripts - works fine on my arch.

Loading...