FS#18691 - [unzip] iconv patch needed to support UTF-8 filenames created in Windows
Attached to Project:
Arch Linux
Opened by Semen Soldatov (simplexe) - Monday, 15 March 2010, 12:54 GMT
Last edited by Allan McRae (Allan) - Friday, 16 November 2012, 11:29 GMT
Opened by Semen Soldatov (simplexe) - Monday, 15 March 2010, 12:54 GMT
Last edited by Allan McRae (Allan) - Friday, 16 November 2012, 11:29 GMT
|
Details
Description:
Please, return patch "unzip60-alt-iconv-utf8.patch" for unzip. without this path, all win-archives extracting with broken encoding in filename. Additional info: 6.0-5 Steps to reproduce: create archive in windows with national encoding filename. extract in arch with unzip |
This task depends upon
If you require international win32 zip extraction, please use 'p7zip' instead, which will properly handle these zip files.
topic with Info-ZIP team's answer: http://www.info-zip.org/board/board.pl?m-1248086794/
Ubuntu's bugreports mentioned in one of the duplicate report:
https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/203609
https://bugs.launchpad.net/debian/+source/unzip/+bug/10979
Debian's bugreport:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=197427
p7zip has no option about encoding. It does read $LC_CTYPE, but still does something wrong.
Take tankrule.zip (http://astardata.baidu.com/download/tankrule.zip) as example.
Directly extract it using p7zip:
% 7z x tankrule.zip
7-Zip 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=zh_CN.UTF-8,Utf16=on,HugeFiles=on,2 CPUs) <--- Notice locale here
Processing archive: tankrule.zip
Extracting Astar2010̹¿Ë´óÕ½Ïêϸ¹æÔò.pdf <--- Wrong
Change $LC_CTYPE doesn't resolve this issue:
% export LC_CTYPE=zh_CN.GBK
% \7z x tankrule.zip
7-Zip 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=zh_CN.GBK,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: tankrule.zip
Extracting Astar2010̹�˴�ս��ϸ����.pdf <--- Still Wrong
Only way to extract this file right under Linux is to
use unzip with iconv patch and LC_CTYPE=*.UTF-8
(filesystem uses utf-8 encoding)
% unzip -O GBK tankrule.zip
Archive: tankrule.zip
inflating: Astar2010坦克大战详细规则.pdf
@Allan McRae:
This patch doesn't change anything if unzip is used without '-O' option. How can it bring incompatibilities ?
I think zsh completion is a minor issue compared to the fact that there is no other way to correctly extract these non-utf8 zips in Linux. Almost all zips containing non-ascii filenames are created in Windows, which are large in number, and have this annoying issue.
Isnt that your issue?
http://www.info-zip.org/phpBB3/viewtopic.php?f=1&t=326
This should be implemented:
6.10b02 (16 Sep 2010):
- Implement -I (--iso-char-set) and -O (--oem-char-set) options to allow user
to set a specific ISO or OEM character set that UnZip should translate
from to create the internal file name strings. Based on the
unzip60-alt-iconv-utf8.patch suggested in a forum thread. This patch is
rather simple, just providing the new options to manually set the input
character set (generally OEM if archive from Windows and ISO otherwise).
These options are currently only available on Unix. This is separate from
the Unicode implementation that allows direct conversion between character
sets and bypasses these new options if an entry has Unicode. These options
are only for converting names from older archives without stored UTF-8 to
names that are readable on a Unix platform. This implementation includes
a short table to automatically guess the code page for a few common
locales. More locales can be added later, and a more complete solution
might be implemented later based on something like the libnatspec or
librcc libraries. However, the goal is for everyone to move to zippers
that support UTF-8 so these options are no longer needed. Enabled by
setting the USE_ICONV_MAPPING macro. Uses the iconv library which must be
available. (unzip.c, zipinfo.c, unix/unix.c, unix/unxcfg.h) [EG]
If someone finds the commit that implements this, we may consider actually adding it to the package. It appears that 6.10 is in perpetual beta...
else we are down to diffing 6.10b agains 6.10a (because -O and -I are listed as major change in 6.10b)
http://sourceforge.net/projects/infozip/files/unreleased%20Betas/UnZip%20betas/
also, there is unzip-iconv in the AUR https://aur.archlinux.org/packages.php?ID=40047
this is what i currently use and what did the trick for me for a .zip from a windows xp install, i.e. the probably most common use case for this functionality (-O cp850 -I utf-8)