FS#42088 - pacman executed with C locale can sometimes output UTF-8 characters (cause problems with puppet)
Attached to Project:
Arch Linux
Opened by Damien Gombault (Desintegr) - Tuesday, 23 September 2014, 14:46 GMT
Last edited by Dave Reisner (falconindy) - Tuesday, 23 September 2014, 16:48 GMT
Opened by Damien Gombault (Desintegr) - Tuesday, 23 September 2014, 14:46 GMT
Last edited by Dave Reisner (falconindy) - Tuesday, 23 September 2014, 16:48 GMT
|
Details
Hi.
I use puppet to manage my Arch Linux installations. Puppet runs Pacman with the C locale to execute installation of packages. Puppet is waiting for an ASCII output when it runs external programs. Puppet use the -Qi option to check if a package is installed, but I get an error with somes packages (p7zip, freerdp, etc.). Debug log : Debug: Executing '/usr/bin/pacman -Q' Debug: Executing '/usr/bin/pacman -Qg' Debug: Executing '/usr/bin/pacman -Qi p7zip' Debug: Executing '/usr/bin/pacman --noconfirm --noprogressbar -Sy p7zip' Debug: Executing '/usr/bin/pacman -Qi p7zip' Error: Could not set 'present' on ensure: invalid byte sequence in US-ASCII at 205:/root/Puppet/modules/packages/manifests/init.pp Here is the output of 'pacman -Qi p7zip' (manually executed with locale C) : Name : p7zip Version : 9.20.1-9 Description : Command-line version of the 7zip compressed file archiver Architecture : i686 URL : http://p7zip.sourceforge.net/ Licenses : GPL custom Groups : None Provides : None Depends On : gcc-libs bash Optional Deps : wxgtk2.8: GUI desktop-file-utils: desktop entries [installed] Required By : None Optional For : kdeutils-ark Conflicts With : None Replaces : None Installed Size : 7871.00 KiB Packager : Bart�~Bomiej Piotrowski <bpiotrowski@archlinux.org> Build Date : Mon Jan 6 20:32:00 2014 Install Date : Tue Sep 23 16:29:29 2014 Install Reason : Explicitly installed Install Script : Yes Validated By : Signature I noticed that the Packager line contains some UTF-8 characters. This character raises the 'invalid byte sequence in US-ASCII' exception in the Puppet code. Could the pacman output be normalized when running it with the C locale ? This should fix the puppet problem. Thank you. |
This task depends upon
Closed by Dave Reisner (falconindy)
Tuesday, 23 September 2014, 16:48 GMT
Reason for closing: Won't fix
Additional comments about closing: Not something to be tackled in pacman -- your environment must be consistent with itself. unicode or no unicode...
Tuesday, 23 September 2014, 16:48 GMT
Reason for closing: Won't fix
Additional comments about closing: Not something to be tackled in pacman -- your environment must be consistent with itself. unicode or no unicode...
> Could the pacman output be normalized when running it with the C locale ?
What does it even mean to "normalize" utf8 data? It's already valid utf8 codepoints. Your locale is simply set to ignore the idea that characters can be multiple bytes.
Should a program run with the C locale display a non-ASCII character ?
I mean "normalize" is converting all UTF-8 characters to ASCII. ("Last Packager: Bartłomiej Piotrowski" to "Last Packager: Bartlomiej Piotrowski") (perform Normalization Form Canonical Decomposition then remove special char).
Should I report this on Puppet bugtracker ?
> I mean "normalize" is converting all UTF-8 characters to ASCII. ("Last Packager: Bartłomiej Piotrowski" to "Last Packager: Bartlomiej Piotrowski") (perform Normalization Form Canonical Decomposition then remove special char).
If a program did this without me intentionally instructing it to do so, I'd be pretty annoyed. There's no way to programmatically convert, e.g. ł -> l, so you'd need to maintain some static mapping. Really don't think this is pacman's job, and linking to some library that does for the sole purpose of placating non-utf8 locales seems completely backwards...
I will try tomorrow to configure the locale (UTF-8 one) before running Puppet.
There are ways to programaticaly convert ł -> l (I don't know for the C language), but libs exist for Python for example :
http://stackoverflow.com/questions/14682397/can-somone-explain-how-unicodedata-normalizeform-unistr-work-with-examples
But, I agree, adding a lib in Pacman just for UTF-8 manipluation is not a great idea.
Thank you for your replies.
Right -- libraries exist to do this with hard coded tables of translations. There's no relationship between the codepoint for ł and l unlike the ASCII relationship between upper and lower case (i.e. upper_case|(1<<5) == lower_case).