FS#42088 : pacman executed with C locale can sometimes output UTF-8 characters (cause problems with puppet)

FS#42088 - pacman executed with C locale can sometimes output UTF-8 characters (cause problems with puppet)

Attached to Project: Arch Linux
Opened by Damien Gombault (Desintegr) - Tuesday, 23 September 2014, 14:46 GMT
Last edited by Dave Reisner (falconindy) - Tuesday, 23 September 2014, 16:48 GMT

Task Type	Bug Report
Category	Arch Projects
Status	Closed
Assigned To	No-one
Architecture	All
Severity	Low
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	1 Damien Gombault (Desintegr) (2014-09-23)
Private	No

Details

Hi.

I use puppet to manage my Arch Linux installations.
Puppet runs Pacman with the C locale to execute installation of packages.
Puppet is waiting for an ASCII output when it runs external programs.

Puppet use the -Qi option to check if a package is installed, but I get an error with somes packages (p7zip, freerdp, etc.).

Debug log :

Debug: Executing '/usr/bin/pacman -Q'
Debug: Executing '/usr/bin/pacman -Qg'
Debug: Executing '/usr/bin/pacman -Qi p7zip'
Debug: Executing '/usr/bin/pacman --noconfirm --noprogressbar -Sy p7zip'
Debug: Executing '/usr/bin/pacman -Qi p7zip'
Error: Could not set 'present' on ensure: invalid byte sequence in US-ASCII at 205:/root/Puppet/modules/packages/manifests/init.pp

Here is the output of 'pacman -Qi p7zip' (manually executed with locale C) :

Name : p7zip
Version : 9.20.1-9
Description : Command-line version of the 7zip compressed file archiver
Architecture : i686
URL : http://p7zip.sourceforge.net/
Licenses : GPL custom
Groups : None
Provides : None
Depends On : gcc-libs bash
Optional Deps : wxgtk2.8: GUI
desktop-file-utils: desktop entries [installed]
Required By : None
Optional For : kdeutils-ark
Conflicts With : None
Replaces : None
Installed Size : 7871.00 KiB
Packager : Bart�~Bomiej Piotrowski <bpiotrowski@archlinux.org>
Build Date : Mon Jan 6 20:32:00 2014
Install Date : Tue Sep 23 16:29:29 2014
Install Reason : Explicitly installed
Install Script : Yes
Validated By : Signature

I noticed that the Packager line contains some UTF-8 characters.
This character raises the 'invalid byte sequence in US-ASCII' exception in the Puppet code.

Could the pacman output be normalized when running it with the C locale ?
This should fix the puppet problem.

Thank you.

This task depends upon

Closed by Dave Reisner (falconindy)
Tuesday, 23 September 2014, 16:48 GMT
Reason for closing: Won't fix
Additional comments about closing: Not something to be tackled in pacman -- your environment must be consistent with itself. unicode or no unicode...

Comment by Dave Reisner (falconindy) - Tuesday, 23 September 2014, 15:14 GMT

Seems like something to be solved in Puppet. Changing the locale doesn't change the data, just the interpretation of the data.

> Could the pacman output be normalized when running it with the C locale ?
What does it even mean to "normalize" utf8 data? It's already valid utf8 codepoints. Your locale is simply set to ignore the idea that characters can be multiple bytes.

Comment by Damien Gombault (Desintegr) - Tuesday, 23 September 2014, 15:54 GMT

I open this task on Arch Linux bugtracker but I'm not sure if the bug is in Pacman or Puppet.
Should a program run with the C locale display a non-ASCII character ?

I mean "normalize" is converting all UTF-8 characters to ASCII. ("Last Packager: Bartłomiej Piotrowski" to "Last Packager: Bartlomiej Piotrowski") (perform Normalization Form Canonical Decomposition then remove special char).

Should I report this on Puppet bugtracker ?

Comment by Dave Reisner (falconindy) - Tuesday, 23 September 2014, 16:12 GMT

No, sorry, I meant your Puppet configuration. Surely you can control the environment that the program runs in...

> I mean "normalize" is converting all UTF-8 characters to ASCII. ("Last Packager: Bartłomiej Piotrowski" to "Last Packager: Bartlomiej Piotrowski") (perform Normalization Form Canonical Decomposition then remove special char).
If a program did this without me intentionally instructing it to do so, I'd be pretty annoyed. There's no way to programmatically convert, e.g. ł -> l, so you'd need to maintain some static mapping. Really don't think this is pacman's job, and linking to some library that does for the sole purpose of placating non-utf8 locales seems completely backwards...

Comment by Damien Gombault (Desintegr) - Tuesday, 23 September 2014, 16:40 GMT

My current environment is in C locale (fresh installation).
I will try tomorrow to configure the locale (UTF-8 one) before running Puppet.

There are ways to programaticaly convert ł -> l (I don't know for the C language), but libs exist for Python for example :
http://stackoverflow.com/questions/14682397/can-somone-explain-how-unicodedata-normalizeform-unistr-work-with-examples
But, I agree, adding a lib in Pacman just for UTF-8 manipluation is not a great idea.

Thank you for your replies.

Comment by Dave Reisner (falconindy) - Tuesday, 23 September 2014, 16:47 GMT

> There are ways to programaticaly convert ł -> l (I don't know for the C language), but libs exist for Python for example :
Right -- libraries exist to do this with hard coded tables of translations. There's no relationship between the codepoint for ł and l unlike the ASCII relationship between upper and lower case (i.e. upper_case|(1<<5) == lower_case).

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#42088 - pacman executed with C locale can sometimes output UTF-8 characters (cause problems with puppet)

Details

Loading...