FS#18691 - [unzip] iconv patch needed to support UTF-8 filenames created in Windows

Attached to Project: Arch Linux
Opened by Semen Soldatov (simplexe) - Monday, 15 March 2010, 12:54 GMT
Last edited by Allan McRae (Allan) - Friday, 16 November 2012, 11:29 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Roman Kyrylych (Romashka)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 6
Private No

Details

Description:
Please, return patch "unzip60-alt-iconv-utf8.patch" for unzip.
without this path, all win-archives extracting with broken encoding in filename.

Additional info:
6.0-5


Steps to reproduce:
create archive in windows with national encoding filename. extract in arch with unzip
This task depends upon

Closed by  Allan McRae (Allan)
Friday, 16 November 2012, 11:29 GMT
Reason for closing:  Upstream
Comment by Thayer Williams (thayer) - Monday, 15 March 2010, 14:39 GMT
This is a problem that has been raised several times with the unzip developers. It appears they have no interest in fixing this issue, despite patches being sent to them. We have removed our patch because it conflicts with other programs.

If you require international win32 zip extraction, please use 'p7zip' instead, which will properly handle these zip files.
Comment by Jun Wu (quark) - Thursday, 27 May 2010, 08:59 GMT
  • Field changed: Percent Complete (100% → 0%)
p7zip doesn't properly handle non-utf8 zips well. There are lots of these zips from Windows world.
Comment by Roman Kyrylych (Romashka) - Thursday, 27 May 2010, 09:01 GMT Comment by Roman Kyrylych (Romashka) - Thursday, 27 May 2010, 09:05 GMT
@ Jun Wu: please provide more details about why p7zip doesn't handle UTF-8 well (perhaps p7zip can be fixed).
Comment by Allan McRae (Allan) - Thursday, 27 May 2010, 10:37 GMT
From memory, not be supported upstreamwas not the only reason to remove that patch. It also brought incompatibilities with some zip archives made on Windows.
Comment by Jun Wu (quark) - Friday, 28 May 2010, 07:02 GMT
@Roman Kyrylych:

p7zip has no option about encoding. It does read $LC_CTYPE, but still does something wrong.

Take tankrule.zip (http://astardata.baidu.com/download/tankrule.zip) as example.

Directly extract it using p7zip:

% 7z x tankrule.zip
7-Zip 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=zh_CN.UTF-8,Utf16=on,HugeFiles=on,2 CPUs) <--- Notice locale here
Processing archive: tankrule.zip
Extracting Astar2010̹¿Ë´óÕ½Ïêϸ¹æÔò.pdf <--- Wrong

Change $LC_CTYPE doesn't resolve this issue:

% export LC_CTYPE=zh_CN.GBK
% \7z x tankrule.zip
7-Zip 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=zh_CN.GBK,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: tankrule.zip
Extracting Astar2010̹�˴�ս��ϸ����.pdf <--- Still Wrong

Only way to extract this file right under Linux is to
use unzip with iconv patch and LC_CTYPE=*.UTF-8
(filesystem uses utf-8 encoding)

% unzip -O GBK tankrule.zip
Archive: tankrule.zip
inflating: Astar2010坦克大战详细规则.pdf


@Allan McRae:

This patch doesn't change anything if unzip is used without '-O' option. How can it bring incompatibilities ?
Comment by Thayer Williams (thayer) - Friday, 28 May 2010, 13:26 GMT
The incompatibility was that it breaks Zsh auto-completion with or without -O option.
Comment by Jun Wu (quark) - Friday, 28 May 2010, 17:56 GMT
@thayer

I think zsh completion is a minor issue compared to the fact that there is no other way to correctly extract these non-utf8 zips in Linux. Almost all zips containing non-ascii filenames are created in Windows, which are large in number, and have this annoying issue.
Comment by Alexander Mityunin (xandry) - Thursday, 24 June 2010, 20:00 GMT
I have checked up - this patch works.
Comment by Greg (dolby) - Friday, 04 March 2011, 03:28 GMT
Brilliant! What does upstream think about your patch? The above link from the Info-ZIP forum doesnt really work.
Comment by Alexander Mityunin (xandry) - Friday, 04 March 2011, 10:21 GMT
Probably they also don't know about it. At least I to them didn't address.
Comment by Greg (dolby) - Friday, 04 March 2011, 10:46 GMT
Well, some patch made it to 6.10. See http://www.info-zip.org/phpBB3/viewtopic.php?f=7&t=223&p=2023&hilit=iconv#p2113
Isnt that your issue?
Comment by Alexander Mityunin (xandry) - Friday, 04 March 2011, 10:55 GMT
Nope.
Comment by Greg (dolby) - Friday, 04 March 2011, 11:14 GMT
Well, you're wrong. Thats it.
Comment by Greg (dolby) - Friday, 04 March 2011, 11:26 GMT
Can someone try the 6.10 beta and report if its fixed? ftp://ftp.info-zip.org/pub/infozip/beta/unzip610b.zip
http://www.info-zip.org/phpBB3/viewtopic.php?f=1&t=326

This should be implemented:
6.10b02 (16 Sep 2010):
- Implement -I (--iso-char-set) and -O (--oem-char-set) options to allow user
to set a specific ISO or OEM character set that UnZip should translate
from to create the internal file name strings. Based on the
unzip60-alt-iconv-utf8.patch suggested in a forum thread. This patch is
rather simple, just providing the new options to manually set the input
character set (generally OEM if archive from Windows and ISO otherwise).
These options are currently only available on Unix. This is separate from
the Unicode implementation that allows direct conversion between character
sets and bypasses these new options if an entry has Unicode. These options
are only for converting names from older archives without stored UTF-8 to
names that are readable on a Unix platform. This implementation includes
a short table to automatically guess the code page for a few common
locales. More locales can be added later, and a more complete solution
might be implemented later based on something like the libnatspec or
librcc libraries. However, the goal is for everyone to move to zippers
that support UTF-8 so these options are no longer needed. Enabled by
setting the USE_ICONV_MAPPING macro. Uses the iconv library which must be
available. (unzip.c, zipinfo.c, unix/unix.c, unix/unxcfg.h) [EG]
Comment by Philip Nilsson (leffe) - Sunday, 06 March 2011, 10:17 GMT
The beta works for me, but I have to use -I where I used to use -O.
Comment by Phil Schaf (flying-sheep) - Tuesday, 17 April 2012, 11:43 GMT
the choice of -I and -O is unintuitive. one might read that as Input-encoding and Output-encoding instead of --iso-char-set and --oem-char-set…
Comment by Allan McRae (Allan) - Tuesday, 17 April 2012, 11:56 GMT
@Phil - you should tell the unzip developers. That is not an Arch issue.

If someone finds the commit that implements this, we may consider actually adding it to the package. It appears that 6.10 is in perpetual beta...
Comment by Phil Schaf (flying-sheep) - Tuesday, 17 April 2012, 12:07 GMT
do they have a public VCS?

else we are down to diffing 6.10b agains 6.10a (because -O and -I are listed as major change in 6.10b)

http://sourceforge.net/projects/infozip/files/unreleased%20Betas/UnZip%20betas/

also, there is unzip-iconv in the AUR https://aur.archlinux.org/packages.php?ID=40047

this is what i currently use and what did the trick for me for a .zip from a windows xp install, i.e. the probably most common use case for this functionality (-O cp850 -I utf-8)

Loading...