FS#7641 - non-latin1 keymaps mistranslation in unicode mode + fix
            Attached to Project:
            Arch Linux
            
Opened by Michal Soltys (msoltyspl) - Saturday, 21 July 2007, 00:21 GMT
Last edited by Roman Kyrylych (Romashka) - Monday, 22 October 2007, 13:12 GMT
          Opened by Michal Soltys (msoltyspl) - Saturday, 21 July 2007, 00:21 GMT
Last edited by Roman Kyrylych (Romashka) - Monday, 22 October 2007, 13:12 GMT
| 
 | Details
                    Archlinux uses simple dumpkeys | loadkeys --unicode when
                    switching to unicode, without specifying any keymap charset.
                    In situation when different keymap is used, let's say
                    polish, prepared for specific charset - like iso-8859-2 -
                    loadkeys won't load the mapping properly (defaulting to
                    iso-8859-1). So for example - if we have Ą under 0xA1 in
                    8859-2, it will be loaded directly as 0xA1 under unicode
                    too, but should be 0x104 (and proper -c paramater will
                    guarantee that). Remark: many fonts / keymaps will function properly with not properly translated keymaps, as they were prepared for respective charsets - so i.e. lat2-16 will output Ą both at 0xA1 and 0x104. I've made simple workaround in my install - added KEYMAP_CHARSET parameter to rc.conf - changed above sequence to: /usr/bin/dumpkeys ${KEYMAP_CHARSET:+"-c${KEYMAP_CHARSET}"} | /bin/loadkeys --unicode Trivial patch attached. | 
              This task depends upon
              
              
            
            
           
                       rc.sysinit.patch
                         rc.sysinit.patch
                    
Extra note: analogous change should be made in initcpio's keymap module as well.
I'm using the following sets:
[code]
$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE=C
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
[/code]
[code]
$ locale -a
C
POSIX
de_DE.utf8
en_US
en_US.iso88591
en_US.utf8
ru_RU
ru_RU.cp1251
ru_RU.iso88595
ru_RU.koi8r
ru_RU.utf8
russian
[/code]
and extract from /etc/rc.conf
[code]
LOCALE="en_US.utf8"
HARDWARECLOCK="localtime"
TIMEZONE="Europe/Moscow"
KEYMAP="ru-utf"
KEYMAP_CHARSET=""
CONSOLEFONT="Cyr_a8x16.psfu"
CONSOLEMAP=""
USECOLOR="yes"
[/code]
and here's the last "bullet in the head"
[code]
$ cat /usr/share/kbd/keymaps/i386/qwerty/ru-utf.map.gz | gzip -d | enca -L russian -
Universal transformation format 7 bits; UTF-7
LF line terminators
[/code]
and all things are running just fine (except mentioned below in questions). maybe the "case" is inside *.map files? Yes, this ru-utf.map.gz is far away from the default. Let's see:
[code]
$ cat /usr/share/kbd/keymaps/i386/qwerty/ru-ms.map.gz | gzip -d | enca -L russian -
7bit ASCII characters
[/code]
moreover, I've got CONFIG_NLS_DEFAULT="iso8859-1" along with CONFIG_NLS_UTF8=y in running kernel.
And here is my stupid questions:
Why gtk1 apps with this settings require export LC_CTYPE="ru_RU.UTF-8" before launch (for correct display of utf8 content)?
Why xdvi/xpdf are unable to save file with non-latin characters in filename?
Why xpdf is not capable to search non-latin characters in the documents?
Thank you.
Well, I don't really use X windows, so I won't be much of a help here. Still - KEYMAP, CONSOLEFONT, CONSOLEMAP and added by me - KEYMAP_CHARSET are relevant only to console driver. Afaik, they are ignored by X, which uses keyboard in raw mode and display part is of course completely different there (that's why I said on the forums it's a long shot :)
Still, your setup (regarding console input/output part) seems fine - assuming ru-utf keymap is what I think it is.
Blind guess about gtk1 thing - it might be using locale to make X select appropriate fonts for displaying, based on the locale setting. So only ru_RU will make it select fonts with cyrillic glyphs.
In gentoo docs (more in http://www.gentoo.org/doc/en/utf-8.xml) I've found following comment: "The exceptions to this rule come in Xlib and GTK+1. GTK+1 requires a iso-10646-1 FontSpec in the ~/.gtkrc, for example -misc-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1. Also, applications using Xlib or Xaw will need to be given a similar FontSpec, otherwise they will not work."
I'll try to make some testing, but currently xpdf,xdvi,snd and may be all other Motif apps refuse to accept any non-latin characters...
Try create /etc/profile.d/gtk+.sh with the following content:
#!/bin/sh
G_BROKEN_FILENAMES=1
export G_BROKEN_FILENAMES
G_FILENAME_ENCODING=@local
and relogin.
I don't have much time to dig into this and provide some patch but if by some chance I'd have I'll try to.
By the way: you were right Michal -- it was hardly a complication as one got to it :).
I could also provide script that fixes broken filenames.
Attached this time - whole part responsible of setting locale, instead of a diff.
Either way - is there even a point of providing diffs ? Dunno how arch devs approach it, but it's been almost 3 months since I submitted it...
%G should be called explicitly from profile.sh too (this way if user or application resets the console - it can be restored with simple relogin).
Since I don't use non-UTF-8 locale anymore I forget that broken filenames (
FS#5487) are fixed now.Regarding broken filenames I meant something different - if someone is relying on console in utf8 now, one will not be able to normally reference the files after applying that patch. All the filenames with national characters would have to be converted back from utf8 to iso 1, and then after conversion treated as iso N, converted back to utf8. Thus my offer to write some simple script to do so.
Thanks for update.
Example, (from the first post here): letter 'Ą' - 0x104 in unicode, 0xA1 in 8859-2. Currently under archlinux (let's say on ext3), when console is set to utf8 w/o any patches, and you use that letter, you will end with U+00A1 encoded in utf8 (and used for instance in filename) - due to improperly loaded keymap. After the patch, you will be using proper U+0104 encoded as utf8, when using 'Ą'. So you will have problems referencing before-patch files with national characters, when you mount disk under different linux distro, when you loging from putty set to utf-8, etc. etc.
I had such issue when switched from KOI8-U to UTF-8, but because amount of such files was very little - I did manual renames. :P
It would be nice if you provide a conversion script.
convmv