Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
FS#711 - SpamAssassin is confused with UTF-8
Attached to Project:
Arch Linux
Opened by Jan Willemson (janwill) - Thursday, 08 April 2004, 11:26 GMT
Last edited by Dale Blount (dale) - Thursday, 08 April 2004, 12:08 GMT
Opened by Jan Willemson (janwill) - Thursday, 08 April 2004, 11:26 GMT
Last edited by Dale Blount (dale) - Thursday, 08 April 2004, 12:08 GMT
|
DetailsWhen running
sa-learn --spam path/to/spam almost every message that is analyzed, returns the message Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc0) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/Bayes.pm line 319. or Malformed UTF-8 character (unexpected non-continuation byte 0xc7, immediately after start byte 0xcc) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/Bayes.pm line 319. Instead of 0xc0 or 0xc7 there also occur 0xbd, 0x20 and many other UTF-8 characters. The line number 319 is constant for all the warnings. As a result, sa-learn discards almost all the input and does not learn anything. Jan |
This task depends upon
$ export LANG=en_US
$ sa-learn --spam path/to/spam
Either let me know if that fixes the problem in another comment, or contact me directly at eric@archlinux.org.
It does not seem to be "SpamAssassin does not handle UTF-8 because it's too slow" issue, since in the end of sa-learn I get the information that it has learnt from 1 message only, although my spam box contains over 13000 messages. So all but one messages contain UTF-8 symbols and one does not? Not very plausible ...
Jan
--- cut here
Malformed UTF-8 character (unexpected non-continuation byte 0x20, immediately after start byte 0xc1) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Malformed UTF-8 character (unexpected continuation byte 0xbd, with no preceding start byte) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc0) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
[--- many messages like this ---]
Malformed UTF-8 character (unexpected non-continuation byte 0x2e, immediately after start byte 0xd9) in transliteration (tr///) at /usr/lib/perl5/site_perl/current/Mail/SpamAssassin/PerMsgStatus.pm line 1293.
Learned from 0 message(s) (1 message(s) examined).
--- cut here
I also be noted that the whole process was much faster this time, only a few minutes compared to 15-20 minutes with the release 2.63-2.
I can give more specific test results if You give me some idea what to test.
Jan
wget www.coding-zone.com/spamassassin-2.63-4.pkg.tar.gz
pacman -U spamassassin-2.63-4.pkg.tar.gz
Also - in case this doesn't solve your problems, do you mind continuing via regular email rather than these comments? Send me your next set of problems to eric@archlinux.org.