FS#79582 - [postgresql] Invalid multibyte char w/ full text search + plperl extension

Attached to Project: Arch Linux
Opened by Matt (mattarch) - Tuesday, 05 September 2023, 19:09 GMT
Last edited by Toolybird (Toolybird) - Wednesday, 22 November 2023, 00:44 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To No-one
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

This problem happens consistently with PostgreSQL 15.4 but has never been seen on 14, even after a clean install of the entire OS on newly formatted partitions. It happens when all 3 conditions are satisfied:

1. TEXT @@ to_tsquery() is executed.
2. The TEXT column contains a multi-byte UTF-8 character.
3. The statement is run after a simple plperl function call in the same session.

The error is:

ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding.

Additional info:

* package version: postgresql 15.4-2

Steps to reproduce:

$ createdb -U postgres test
$ psql -U postgres test
psql (15.4)
Type "help" for help.

test=# \l test
List of databases
Name | Owner | Encoding | Collate | Ctype | ICU Locale | Locale Provider | Access privileges
------+----------+----------+-------------+-------------+------------+-----------------+-------------------
test | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc |
(1 row)

test=# CREATE EXTENSION IF NOT EXISTS plperl;
CREATE EXTENSION
test=# CREATE FUNCTION public.simple_quoter(text) RETURNS TEXT
LANGUAGE plperl
AS $_X$
return "'$_[0]'"
$_X$;
CREATE FUNCTION
test=# CREATE TABLE t (c TEXT);
CREATE TABLE
test=# INSERT INTO t VALUES ('Queensrÿche');
INSERT 0 1
test=# SELECT * FROM t WHERE to_tsquery('english','Queen:*') @@ to_tsvector('english',c);
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding.

The same error also occurs when running the two statements in the same session:

SELECT simple_quoter('hi');
SELECT c @@ to_tsquery('english','Queen:*') FROM t;

If the second is run without the first, there is not problem. So it somehow involves use of the `plperl` extension prior to the second query.

There is also an open Stackoverflow issue of the problem: https://stackoverflow.com/questions/77033003/bug-with-postgresql-full-text-search-multi-byte-utf-8-characters-plperl-exte
This task depends upon

Closed by  Toolybird (Toolybird)
Wednesday, 22 November 2023, 00:44 GMT
Reason for closing:  No response
Comment by Toolybird (Toolybird) - Wednesday, 06 September 2023, 08:53 GMT
> So it somehow involves use of the `plperl` extension

I followed your steps and see the same error. But it's hard to see how this is an Arch packaging issue. Have you reported the problem upstream? You might have better luck visiting the proper Arch support channels (Forum/IRC/Mailing Lists/Reddit/etc) to see if anyone has any ideas. Please let us know what you find out.
Comment by Matt (mattarch) - Wednesday, 06 September 2023, 10:41 GMT
> it's hard to see how this is an Arch packaging issue.

One possibility is the patch applied to the plperl makefile: https://gitlab.archlinux.org/archlinux/packaging/packages/postgresql/-/blob/main/postgresql-perl-rpath.patch

> Have you reported the problem upstream?

No - but thanks for confirming the issue and the suggestions.
Comment by loqs (loqs) - Wednesday, 06 September 2023, 17:04 GMT
> One possibility is the patch applied to the plperl makefile: https://gitlab.archlinux.org/archlinux/packaging/packages/postgresql/-/blob/main/postgresql-perl-rpath.patch
You can test if that is the cause by rebuilding the package with [1] applied to the PKGBUILD. You could also try 16rc1 with [2] applied to the PKGBUILD instead of [1].
You could also attempt to bisect the bug [3].

[1] PKGBUILD.diff.1
[2] PKGBUILD.diff.2
[3] https://wiki.archlinux.org/title/Bisecting_bugs_with_Git
Comment by Matt (mattarch) - Saturday, 09 September 2023, 11:48 GMT
The error still occurs using both patches. With the second patch, I also had to apply
```
diff --git a/postgresql-check-db-dir b/postgresql-check-db-dir
index a2134cd..16d6559 100755
--- a/postgresql-check-db-dir
+++ b/postgresql-check-db-dir
@@ -16,9 +16,9 @@ then
fi

# PGMAJORVERSION is major version
-PGMAJORVERSION=15
+PGMAJORVERSION=16
# PREVMAJORVERSION is the previous major version
-PREVMAJORVERSION=14
+PREVMAJORVERSION=15

# Check for the PGDATA structure
if [ -f "$PGDATA/PG_VERSION" ] && [ -d "$PGDATA/base" ]
```
in order for the server to start via systemd.

Laurenz Albe in the comments of https://stackoverflow.com/questions/77033003/bug-with-postgresql-full-text-search-multi-byte-utf-8-characters-plperl-exte claims to not see the error w/ PostgreSQL v16.
Comment by loqs (loqs) - Saturday, 09 September 2023, 12:30 GMT
Has the issue occurred with other postgresql 15 packages than the 15.4-2 release? Wondering if the update to llvm15 in 15.4-2 could be related.
Comment by Matt (mattarch) - Sunday, 10 September 2023, 01:28 GMT
It seems the problem may have been first suppressed with tag 15.3-3 (which also exhibits the same error) in which the primary change was adding
```
export LANG=C
```
to the `check()` function of `PKGBUILD`. Running `makepkg` on the prior tag 15.3-2 fails, ending with
```
============== running regression test queries ==============
test plperl_setup ... ok 45 ms
test plperl ... ok 24 ms
test plperl_lc ... ok 19 ms
test plperl_trigger ... ok 20 ms
test plperl_shared ... ok 20 ms
test plperl_elog ... FAILED 15 ms
test plperl_util ... ok 14 ms
test plperl_init ... ok 21 ms
test plperlu ... ok 10 ms
test plperl_array ... ok 15 ms
test plperl_call ... FAILED 14 ms
test plperl_transaction ... ok 20 ms
test plperl_plperlu ... ok 17 ms
============== shutting down postmaster ==============

=======================
2 of 13 tests failed.
=======================

The differences that caused some tests to fail can be viewed in the
file "/home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/regression.diffs". A copy of the test summary that you see
above is saved in the file "/home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/regression.out".

make[2]: *** [GNUmakefile:119: check] Error 1
make[2]: Leaving directory '/home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl'
make[1]: *** [Makefile:35: check-plperl-recurse] Error 2
make[1]: Leaving directory '/home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl'
make: *** [GNUmakefile:71: check-world-src/pl-recurse] Error 2
make check-world failure: ./src/pl/plperl/regression.diffs
diff -U3 /home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/expected/plperl_elog_1.out /home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/results/plperl_elog.out
--- /home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/expected/plperl_elog_1.out 2023-05-08 17:13:20.000000000 -0400
+++ /home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/results/plperl_elog.out 2023-09-09 21:05:06.004063285 -0400
@@ -76,6 +76,7 @@
RETURN 1;
END;
$$;
+WARNING: could not determine encoding for locale "en_US.UTF-8": codeset is "ANSI_X3.4-1968"
select die_caller();
NOTICE: caught die
die_caller
diff -U3 /home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/expected/plperl_call.out /home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/results/plperl_call.out
--- /home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/expected/plperl_call.out 2023-05-08 17:13:20.000000000 -0400
+++ /home/matt/gitlab/postgresql/src/postgresql-15.3/src/pl/plperl/results/plperl_call.out 2023-09-09 21:05:06.097397160 -0400
@@ -64,6 +64,7 @@
RAISE NOTICE '_a: %, _b: %', _a, _b;
END
$$;
+WARNING: could not determine encoding for locale "en_US.UTF-8": codeset is "ANSI_X3.4-1968"
NOTICE: a: 10, b:
NOTICE: _a: 10, _b: 20
DROP PROCEDURE test_proc1;
==> ERROR: A failure occurred in check().
Aborting...
```
The test failures involving plperl and locale/encoding are closely related to the issues posted above.

I wonder why `LANG=C` was needed? If it was due to the above errors, perhaps there is a better resolution that would be compatible with environments with `LANG=en_US.UTF-8`?
Comment by Toolybird (Toolybird) - Sunday, 10 September 2023, 02:17 GMT
> I wonder why `LANG=C` was needed?

Clearly, it was done to allow `make check' to pass. What happens if you follow the advice here [1] and/or here [2]?

[1] https://dba.stackexchange.com/questions/65268/how-can-i-load-utf8-into-plperl-by-default
[2] https://stackoverflow.com/questions/20355007/why-does-this-postgres-stored-procedure-want-to-use-utf8
Comment by Matt (mattarch) - Sunday, 10 September 2023, 14:39 GMT
Adding `plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";'` to `postgresql.conf` doesn't work because `utf8_heavy.pl` was removed in perl 5.31.6: https://github.com/Perl/perl5/commit/483a80b4eb1ce75c33945f69455138be14944460

I've tried variations on this theme, based on the above perl5 commit, e.g.
```
plperl.on_init = 'use utf8; use re; package utf8; use Unicode::UCD; require "unicore/UCD.pl";'
```
plus various subsets/permutations of these init commands but they result in the same error.
Comment by Toolybird (Toolybird) - Monday, 30 October 2023, 20:53 GMT
postgresql 16.0-2 is in [extra-testing]. How does it fare?

Loading...