FS#68481 - [hunspell-de] German dictionary doesn't work in multi-dictionary mode with UTF8-encoded dicts

Attached to Project: Arch Linux
Opened by Mikhail Skorzhinskii (rasmi) - Friday, 30 October 2020, 16:29 GMT
Last edited by Andreas Radke (AndyRTR) - Tuesday, 17 November 2020, 19:41 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Andreas Radke (AndyRTR)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

When user tries to use together DE with other dictionaries encoded in UTF-8 it fails with the following error:

# hunspell -d ru_RU,de_DE
'error - iconv: ISO8859-1 -> UTF-8'

This happens because german dictionary is encoded in ISO8859-1. This is discussed in more details here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864864. After applying the same suggestion (basically encoding german dict into UTF-8) the problem is gone.

I've filed a bug about this at hunspell github: https://github.com/hunspell/hunspell/issues/688 and also wrote a personal letter to the german dict maintainer. Unfortunately I didn't yet received any answer.

Would it be possible to encode german (and possibly other dictionaries) into UTF-8 in ArchLinux? This is possibly a bad decision to fix it in distribution, but I see no other choice to improve user experience given that upstream reluctant to fix that.

Package version: extra/hunspell-de 20161207-4
aur/hunspell-ru-aot 0.4.5-1
This task depends upon

Closed by  Andreas Radke (AndyRTR)
Tuesday, 17 November 2020, 19:41 GMT
Reason for closing:  Fixed
Comment by Mikhail Skorzhinskii (rasmi) - Monday, 16 November 2020, 21:25 GMT
  • Field changed: Percent Complete (100% → 0%)
Thanks for that! Unfortunately it is till not fixed for

# hunspell -d de_DE,ru_RU

Notice different orders of dictionaries in the command line. To fix that one need to fix dictionary .aff files. In .aff files there is a line with explicit file encoding setting:

SET ISO8859-1

But after encoding it to UTF-8 it should be

SET UTF-8

For example:

sed -i 's/SET ISO8859-1/SET UTF-8/' de_DE.aff
Comment by Andreas Radke (AndyRTR) - Monday, 16 November 2020, 22:07 GMT
Are you sure this is required? I can run here "hunspell -d de_DE,en_US somefile" without any error.
Comment by Mikhail Skorzhinskii (rasmi) - Monday, 16 November 2020, 22:26 GMT
I think it works fine when you pass non-ASCII symbols. It's could be hard to catch combining German and English dictionaries. But it's very easy to reproduce with Russian language, since Russian using Cyrillic and thus all Russian words causing this error. For example:

# hunspell -d de_DE,ru_RU
Hunspell 1.7.0
hallo
*

привет
error - iconv: ISO8859-1 -> UTF-8
*

But combining English and German dictionaries there is also a way to cause troubles. Example:

# hunspell -d de_DE,en_GB
Hunspell 1.7.0
fünfundfünfzig
& fünfundfünfzig 1 0: fünfundfünfzig


Changing the "SET ISO8859-1" line fixes this problem.
Comment by Andreas Radke (AndyRTR) - Tuesday, 17 November 2020, 16:55 GMT
I've applied the sed fix to all affected hunspell-{de,it,el,pl} packages and also checked to use proper ISO value while iconv'ing the aff/dic files.

Please report back if this has been solved now. For the future please also try nuspell that may some day replace hunspell.
Comment by Mikhail Skorzhinskii (rasmi) - Tuesday, 17 November 2020, 19:27 GMT
I confirm that issue is solved for me now with hunspell-de 20161207-6. Thanks again for you help and support!

> For the future please also try nuspell that may some day replace hunspell.

Wow, that is a discovery for me. I played a little with it today. Reading the project description is looks very promising from my perspective. Will do much more later.

Loading...