FS#44848 - [coreutils] tr produces messy nonsense when using UTF-8 characters

Attached to Project: Arch Linux
Opened by Mathias Steiger (mathiassteiger) - Monday, 04 May 2015, 13:58 GMT
Last edited by Sébastien Luttringer (seblu) - Wednesday, 13 May 2015, 22:43 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Sébastien Luttringer (seblu)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

/usr/bin/tr is owned by coreutils 8.23-1


Example: echo -en "asdgf\nadsfdssdsaf" | tr '\n' '≠'
Expected output: asdgf≠adsfdssdsaf
Actual output: asdgfâadsfdssdsaf
Hex: 7361 6764 e266 6461 6673 7364 6473 6173 0066

Example: echo -en "asdgf\nadsfdssdsaf" | tr 'asd' '≠'
Expected output: ≠gf\nadsfdssdsaf
Actual output: ≠gf\n⠉f  âf
Hex: 89e2 67a0 0a66 a0e2 6689 89a0 a089 e289 0066

Example: echo -en "asdgf≠adsfdssdsaf" | tr '≠' '\n'
Expected output: asdgf\nadsfdssdsaf
Actual output: asdgf\n\nadsfdssdsaf
Hex: 7361 6764 0a66 0a0a 6461 6673 7364 6473 6173 0066

This used to work for years now it is all weird.

locale settings are irrelevant. Terminal used is irrelevant. Alias set is irrelevant.
This task depends upon

Closed by  Sébastien Luttringer (seblu)
Wednesday, 13 May 2015, 22:43 GMT
Reason for closing:  Upstream
Comment by Sébastien Luttringer (seblu) - Tuesday, 12 May 2015, 07:15 GMT
I'm able to reproduce.

Did you have reported this bug upstream?
Comment by Johannes Löthberg (demize) - Tuesday, 12 May 2015, 11:17 GMT
This is documented in the info file, and has nothing to do with Arch.
Comment by Sébastien Luttringer (seblu) - Wednesday, 13 May 2015, 22:42 GMT
@Kyrias: Thanks for pointing the info doc.

<cut>
Currently ‘tr’ fully supports only single-byte characters.
Eventually it will support multibyte characters; when it does, the ‘-C’
option will cause it to complement the set of characters, whereas ‘-c’
will cause it to complement the set of values. This distinction will
matter only when some values are not characters, and this is possible
only in locales using multibyte encodings when the input contains
encoding errors.
</cut>

If it used to works for years, maybe upstream will accept your report. Nonetheless, I will close it here.

Loading...