FS#18691 : [unzip] iconv patch needed to support UTF-8 filenames created in Windows

FS#18691 - [unzip] iconv patch needed to support UTF-8 filenames created in Windows

Attached to Project: Arch Linux
Opened by Semen Soldatov (simplexe) - Monday, 15 March 2010, 12:54 GMT
Last edited by Allan McRae (Allan) - Friday, 16 November 2012, 11:29 GMT

Task Type	Bug Report
Category	Upstream Bugs
Status	Closed
Assigned To	Roman Kyrylych (Romashka)
Architecture	All
Severity	Medium
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	6 Jekyll Wu (adaptee) (2011-05-09) Philip Nilsson (leffe) (2011-03-06) Shun-Yi Huang (ShunYi) (2010-08-28) Alexander Mityunin (xandry) (2010-06-24) Semen Soldatov (simplexe) (2010-05-29) Jun Wu (quark) (2010-05-27)
Private	No

Details

Description:
Please, return patch "unzip60-alt-iconv-utf8.patch" for unzip.
without this path, all win-archives extracting with broken encoding in filename.

Additional info:
6.0-5

Steps to reproduce:
create archive in windows with national encoding filename. extract in arch with unzip

This task depends upon

Closed by Allan McRae (Allan)
Friday, 16 November 2012, 11:29 GMT
Reason for closing: Upstream

Comment by Thayer Williams (thayer) - Monday, 15 March 2010, 14:39 GMT

This is a problem that has been raised several times with the unzip developers. It appears they have no interest in fixing this issue, despite patches being sent to them. We have removed our patch because it conflicts with other programs.

If you require international win32 zip extraction, please use 'p7zip' instead, which will properly handle these zip files.

Comment by Jun Wu (quark) - Thursday, 27 May 2010, 08:59 GMT

Field changed: Percent Complete (100% → 0%)

p7zip doesn't properly handle non-utf8 zips well. There are lots of these zips from Windows world.

Comment by Roman Kyrylych (Romashka) - Thursday, 27 May 2010, 09:01 GMT

just adding more info here:

topic with Info-ZIP team's answer: http://www.info-zip.org/board/board.pl?m-1248086794/

Ubuntu's bugreports mentioned in one of the duplicate report:
https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/203609
https://bugs.launchpad.net/debian/+source/unzip/+bug/10979

Debian's bugreport:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=197427

Comment by Roman Kyrylych (Romashka) - Thursday, 27 May 2010, 09:05 GMT

@ Jun Wu: please provide more details about why p7zip doesn't handle UTF-8 well (perhaps p7zip can be fixed).

Comment by Allan McRae (Allan) - Thursday, 27 May 2010, 10:37 GMT

From memory, not be supported upstreamwas not the only reason to remove that patch. It also brought incompatibilities with some zip archives made on Windows.

Comment by Jun Wu (quark) - Friday, 28 May 2010, 07:02 GMT

@Roman Kyrylych:

p7zip has no option about encoding. It does read $LC_CTYPE, but still does something wrong.

Take tankrule.zip (http://astardata.baidu.com/download/tankrule.zip) as example.

Directly extract it using p7zip:

% 7z x tankrule.zip
7-Zip 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=zh_CN.UTF-8,Utf16=on,HugeFiles=on,2 CPUs) <--- Notice locale here
Processing archive: tankrule.zip
Extracting Astar2010Ì¹¿Ë´óÕ½ÏêÏ¸¹æÔò.pdf <--- Wrong

Change $LC_CTYPE doesn't resolve this issue:

% export LC_CTYPE=zh_CN.GBK
% \7z x tankrule.zip
7-Zip 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=zh_CN.GBK,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: tankrule.zip
Extracting Astar2010̹�˴�ս��ϸ��.pdf <--- Still Wrong

Only way to extract this file right under Linux is to
use unzip with iconv patch and LC_CTYPE=*.UTF-8
(filesystem uses utf-8 encoding)

% unzip -O GBK tankrule.zip
Archive: tankrule.zip
inflating: Astar2010坦克大战详细规则.pdf

@Allan McRae:

This patch doesn't change anything if unzip is used without '-O' option. How can it bring incompatibilities ?

Comment by Thayer Williams (thayer) - Friday, 28 May 2010, 13:26 GMT

The incompatibility was that it breaks Zsh auto-completion with or without -O option.

Comment by Jun Wu (quark) - Friday, 28 May 2010, 17:56 GMT

@thayer

I think zsh completion is a minor issue compared to the fact that there is no other way to correctly extract these non-utf8 zips in Linux. Almost all zips containing non-ascii filenames are created in Windows, which are large in number, and have this annoying issue.

Comment by Alexander Mityunin (xandry) - Thursday, 24 June 2010, 20:00 GMT

I have checked up - this patch works.

unzip-6.0-iconv.patch (13.9 KiB)

Comment by Greg (dolby) - Friday, 04 March 2011, 03:28 GMT

Brilliant! What does upstream think about your patch? The above link from the Info-ZIP forum doesnt really work.

Comment by Alexander Mityunin (xandry) - Friday, 04 March 2011, 10:21 GMT

Probably they also don't know about it. At least I to them didn't address.

Comment by Greg (dolby) - Friday, 04 March 2011, 10:46 GMT

Well, some patch made it to 6.10. See http://www.info-zip.org/phpBB3/viewtopic.php?f=7&t=223&p=2023&hilit=iconv#p2113
Isnt that your issue?

Comment by Alexander Mityunin (xandry) - Friday, 04 March 2011, 10:55 GMT

Nope.

Comment by Greg (dolby) - Friday, 04 March 2011, 11:14 GMT

Well, you're wrong. Thats it.

Comment by Greg (dolby) - Friday, 04 March 2011, 11:26 GMT

Can someone try the 6.10 beta and report if its fixed? ftp://ftp.info-zip.org/pub/infozip/beta/unzip610b.zip
http://www.info-zip.org/phpBB3/viewtopic.php?f=1&t=326

This should be implemented:
6.10b02 (16 Sep 2010):
- Implement -I (--iso-char-set) and -O (--oem-char-set) options to allow user
to set a specific ISO or OEM character set that UnZip should translate
from to create the internal file name strings. Based on the
unzip60-alt-iconv-utf8.patch suggested in a forum thread. This patch is
rather simple, just providing the new options to manually set the input
character set (generally OEM if archive from Windows and ISO otherwise).
These options are currently only available on Unix. This is separate from
the Unicode implementation that allows direct conversion between character
sets and bypasses these new options if an entry has Unicode. These options
are only for converting names from older archives without stored UTF-8 to
names that are readable on a Unix platform. This implementation includes
a short table to automatically guess the code page for a few common
locales. More locales can be added later, and a more complete solution
might be implemented later based on something like the libnatspec or
librcc libraries. However, the goal is for everyone to move to zippers
that support UTF-8 so these options are no longer needed. Enabled by
setting the USE_ICONV_MAPPING macro. Uses the iconv library which must be
available. (unzip.c, zipinfo.c, unix/unix.c, unix/unxcfg.h) [EG]

Comment by Philip Nilsson (leffe) - Sunday, 06 March 2011, 10:17 GMT

The beta works for me, but I have to use -I where I used to use -O.

Comment by Phil Schaf (flying-sheep) - Tuesday, 17 April 2012, 11:43 GMT

the choice of -I and -O is unintuitive. one might read that as Input-encoding and Output-encoding instead of --iso-char-set and --oem-char-set…

Comment by Allan McRae (Allan) - Tuesday, 17 April 2012, 11:56 GMT

@Phil - you should tell the unzip developers. That is not an Arch issue.

If someone finds the commit that implements this, we may consider actually adding it to the package. It appears that 6.10 is in perpetual beta...

Comment by Phil Schaf (flying-sheep) - Tuesday, 17 April 2012, 12:07 GMT

do they have a public VCS?

else we are down to diffing 6.10b agains 6.10a (because -O and -I are listed as major change in 6.10b)

http://sourceforge.net/projects/infozip/files/unreleased%20Betas/UnZip%20betas/

also, there is unzip-iconv in the AUR https://aur.archlinux.org/packages.php?ID=40047

this is what i currently use and what did the trick for me for a .zip from a windows xp install, i.e. the probably most common use case for this functionality (-O cp850 -I utf-8)

	Tasks related to this task (1)
	~~FS#17503 - [unzip] zsh completion missing for unzip patches~~

Duplicate tasks of this task (2)
~~FS#17503 - [unzip] zsh completion missing for unzip patches~~
~~FS#19603 - unzip should include iconv patch~~

Arch Linux

FS#18691 - [unzip] iconv patch needed to support UTF-8 filenames created in Windows

Details

Loading...