Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#711 - SpamAssassin is confused with UTF-8

Attached to Project: Arch Linux
Opened by Jan Willemson (janwill) - Thursday, 08 April 2004, 11:26 GMT
Last edited by Dale Blount (dale) - Thursday, 08 April 2004, 12:08 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Eric Johnson (eric)
Architecture not specified
Severity Medium
Priority Normal
Reported Version 0.7 Wombat
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No

Details

When running

sa-learn --spam path/to/spam

almost every message that is analyzed, returns the message

Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc0) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/Bayes.pm line 319.

or

Malformed UTF-8 character (unexpected non-continuation byte 0xc7, immediately after start byte 0xcc) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/Bayes.pm line 319.

Instead of 0xc0 or 0xc7 there also occur 0xbd, 0x20 and many other UTF-8 characters. The line number 319 is constant for all the warnings. As a result, sa-learn discards almost all the input and does not learn anything.

Jan
This task depends upon

Closed by  Eric Johnson (eric)
Wednesday, 14 April 2004, 11:28 GMT
Reason for closing:  Fixed
Comment by Dale Blount (dale) - Thursday, 08 April 2004, 12:09 GMT
IIRC, SpamAssasin sometimes ignores UTF-8 because it is "way to slow" currently. I'm not sure if this can be fixed in the packaging, or if we have to wait for an upstream fix.
Comment by Eric Johnson (eric) - Sunday, 11 April 2004, 23:35 GMT
Jan, before running sa-learn next time, try setting your environment variable LANG=en_US and see if that helps.

$ export LANG=en_US
$ sa-learn --spam path/to/spam

Either let me know if that fixes the problem in another comment, or contact me directly at eric@archlinux.org.
Comment by Jan Willemson (janwill) - Tuesday, 13 April 2004, 12:44 GMT
This did not fix the problem -- I still get the same messages.

It does not seem to be "SpamAssassin does not handle UTF-8 because it's too slow" issue, since in the end of sa-learn I get the information that it has learnt from 1 message only, although my spam box contains over 13000 messages. So all but one messages contain UTF-8 symbols and one does not? Not very plausible ...

Jan
Comment by Eric Johnson (eric) - Tuesday, 13 April 2004, 14:06 GMT
Try upgrading to the new release (2.63-3). I added a patch that will hopefully fix the problem for you.
Comment by Jan Willemson (janwill) - Tuesday, 13 April 2004, 17:31 GMT
Thanks for upgrade, the problem is still there, but in a different form. Now the error messages go like this:

--- cut here

Malformed UTF-8 character (unexpected non-continuation byte 0x20, immediately after start byte 0xc1) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Malformed UTF-8 character (unexpected continuation byte 0xbd, with no preceding start byte) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc0) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/PerMsgStatus.pm line 1293.

[--- many messages like this ---]

Malformed UTF-8 character (unexpected non-continuation byte 0x2e, immediately after start byte 0xd9) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Learned from 0 message(s) (1 message(s) examined).

--- cut here

I also be noted that the whole process was much faster this time, only a few minutes compared to 15-20 minutes with the release 2.63-2.

I can give more specific test results if You give me some idea what to test.

Jan
Comment by Eric Johnson (eric) - Tuesday, 13 April 2004, 21:17 GMT
Jan - can you try this next version out for me before I update to the AL community?

wget www.coding-zone.com/spamassassin-2.63-4.pkg.tar.gz
pacman -U spamassassin-2.63-4.pkg.tar.gz

Also - in case this doesn't solve your problems, do you mind continuing via regular email rather than these comments? Send me your next set of problems to eric@archlinux.org.

Comment by Eric Johnson (eric) - Wednesday, 14 April 2004, 11:27 GMT
Closing the bug with the release of 2.63-4.

Loading...